Key Takeaways
- The article describes the journey of creating a multi-agent platform, a Language Model Operating System (LMOS), which was developed to solve the challenge of deploying LLM-powered applications for highly distributed scenarios with localized constraints.
- Initially built upon JVM-based tooling, the platform leverages concurrency constraints and the need for domain-specific languages (DSLs) using Kotlin.
- LMOS was used in an enterprise-grade scenario to replace vendor-provided solutions, drastically reducing the time-to-deployment of agents (from weeks to days).
- The platform uses isolated microservice-based agents and a layered architecture. It supports dynamic agent routing, version management, and rollback/rollout capabilities for deploying agents into production. It also includes an Agent DSL called ARC to help build LLM-powered agents.
- The platform was created to democratize agent development. It supports multiple agents and languages (Python, LangChain, LlamaIndex). It is open-source and publicly available as an incubated Eclipse Foundation project.
This is a summary of our talk at InfoQ Dev Summit Boston 2024. In this talk, we shared some of our company’s key learnings in developing customer-facing LLM-powered applications deployed across Europe. We used multi-agent architecture and systems design to create an open-source set of tools, a framework, and a full-fledged platform to accelerate the development of AI agents in Deutsche Telekom. During the talk, we also showcased demos related to using agents within our platform. Our approach focuses on enabling existing JVM engineers - who already possess domain knowledge and expertise in APIs - to build agents efficiently.
There is a central program for customer sales and service automation called Frag Magenta. Our initial task was simple: "How do you deploy GenAI across our European footprint, which is around ten countries?" This should be done for all the channels customers use to reach us (chat channel, voice channel) and any related autonomous use cases. We also needed to consider that these European countries would require different languages.
Especially at the time when we built RAG-based chatbots, this is not something that can really scale unless you have a platform to solve these hard challenges. We also needed to consider other circumstances, such as building a prompt flow or any use cases that require a different approach: you cannot send links, for example, in the voice channel. Essentially, this is where we started.
Inception
To attack the problem space, we formed a small team to look into the emerging GenAI scope. We were primarily looking into RAG-based systems and LLMs to see whether we could use them to achieve our goals. They are powerful constructs but also completely non-deterministic. How do you build applications where a stream of tokens or strings can control a program flow?
Classical computing, state machine-based workflow engines (e.g., Apache Airflow, Camunda, or Temporal), and orchestrators designed for deterministic process execution are not inherently equipped to handle the probabilistic and evolving nature of LLM output.What kind of paradigms can we bring in? At that point in time, LangChain was the primary framework for building LLM RAG applications. OpenAI had just released the tool calling functionality in the OpenAI APIs. LangChain4j was an emerging port, and there was nothing particularly available in the JVM ecosystem.
It is important to note that we have made huge investments in the JVM stack. A lot of our transactional systems were on the JVM stack. We have SDKs and client libraries already built on the JVM stack, which allows data pulls and also observability platforms. What skill sets do you require to build these applications? Does it require an AI engineer? Does it require data scientists? Certainly, most models were not production-ready, and you always needed to have a human in the loop. Considering our problem space and background, it was clear we could not just take a rudimentary approach to building something and expect it to work for all these countries with different business processes, APIs, and specifications.
This scenario also provided us with an opportunity: it was evident that there were no existing frameworks or design strategies to address this challenge. We already knew the existing models would only get better. So, what constructs could we build today that would stand the test of time in building a platform that allows the democratization of agents? That's how we started looking into open-source contributors within the company, and we brought a team together to look at it as a foundational platform that needed to be built.
Our Agent Platform Journey Map
We set out on a journey wherein we decided to build the next Heroku for agents. This was our mindset when we started recruiting people. In September 2023, we released our first use case: a FAQ RAG bot built on LangChain. Since then, we realized we needed better abstractions and a foundational platform for the scale we needed.
We also wanted to leverage our existing investments in the JVM stack better: almost all of our infrastructure, including microservices and transactional API clients, was on JVM. There was no viable alternative in the JVM ecosystem. So, we started building a new agent framework using Kotlin, replacing the LangChain version in weeks (October 2023). Then, we started building the platform to manage the lifecycle of such agents, with the first tool calling agents being released in February 2024. Today, we have a fully open-source, multi-agent platform that provides the constructs to manage the entire lifecycle of agents: inter-agent communication, discovery, advanced routing capabilities, etc.
However, we are not paid to build frameworks and tooling. We are hired to solve business problems. With that in mind, it was also clear that the approaches of rudimentary prompt abstractions and functions on top would not scale. How many developers in data centers across all these countries would need to be hired, considering we have around 100 million customers in Europe alone?
We knew that voice models would emerge, so we needed something fundamental. We started looking at building the stack with one principle in mind: How can you bring in the greatest hits of classical computing and bake them into a platform? So we started creating a completely ground-up framework back then, and we ported the whole RAG pipeline, which was the RAG construct (or agent) we released back then, onto the new stack.
The stack had two layers. One we referred to as kernel, which was the encapsulation of the operating system constructs. The second one was the Intelligent Agents platform, where developers were creating customer-facing use cases. This stack was referred to as LMOS, which stands for Language Models Operating System. We used Kotlin as the primary development language for two reasons: first, because of our investments in the JVM stack at the time. Second, there were specific advantages inherent to Kotlin. We knew we had to democratize this stack, which could be done with Domain-Specific Languages (DSLs). Also, the nature of this application requires advanced concurrency constructs. Kotlin inherently solves both problems.
In February, we released the first agent based on our new tooling: the billing agent. You could ask the Frag Magenta chatbot, "What's my bill?" and it would return it. This was a simple call, but built entirely on the new stack. We were not even using LangChain4j, or Spring AI then. But as we started scaling our teams, we realized that we had to reduce the entry barrier because there was still a lot of code that had to be written for each agent. This was when the DSL emerged, bringing down the democratizer. It's called the LMOS ARC, or the Agent ReaCtor, as we call it.
By July this year, we realized that we also needed to change the lifecycle of developing applications. Focusing only on frameworks and platforms would not be enough to accelerate our development pace. So, we ran an initiative called F9, which is derived from SpaceX's "Falcon 9". Using a completely cloud-native multi-agent platform, we were able to reduce the development time of an agent to 12 days.
We have started replacing some of the use cases in Frag Magenta with our new LLM-powered agents. So far, more than a million questions have been answered by the use cases for which we have deployed this, with an 89% acceptable answer rate. This represents more than 300,000 human-agent conversations deflected with a risk rate under 2%.
We benchmarked our solution against LLM-powered vendor products through A/B testing in production and found that our agent handovers performed better in about 38% of the same use cases.
Since our first release, agent development time has also been greatly reduced: for agents representing a domain entity (e.g., billing contracts), it was reduced from 2 months to 10 days. In the case of enhancing existing agents (adding new features or a new use case, such as resolving billing queries), this development time was brought down from weeks to an average of 2.5 days.
However, considering the fragility of these systems, you cannot release new features into production that fast, especially for a company that size. There are many security factors involved that require a considerable amount of tests. For this reason, we brought the actual time-to-production down to two agents per week. We need to design for failure. As a result, we now have the necessary constructs in the platform, allowing us to intervene and deploy a fix within hours. That, in essence, is what the platform stands for, which we refer to as our "agent computing platform".
Multi-Agent Architecture
In our multi-agent architecture, we have a single chatbot facing our customers and users. Behind it is a collection of agents, each focusing on a single business domain running as a separate, isolated microservice. There is an agent router between the chatbot and the agents to direct incoming requests to specific agents as needed. During a conversation, multiple agents can come into play. The services used by the agents, such as the customer API and the search API, are in the agent platform. All our RAG pipelines reside in the search API, outside the agents’ structure, thus simplifying the overall architecture.
We chose this architectural design due to two key factors. The first was our need to upscale the number of teams working on the application. We had a very ambitious roadmap, and the only way we could achieve this was by having multiple teams work on the application in parallel. This is a great design for that. The second was related to the LLM prompts: they can be fragile. Whenever you make a change, no matter how small, you risk breaking the entire prompt. With a multi-prompt agent design, we can isolate the failure point to a single agent instead of having the whole chatbot collapse due to a broken prompt. This fragility is something we struggled with quite a bit at the beginning.
That's the top-level design. We can go one level deeper and take a look at the actual code of one of our first billing agents. We had a very traditional approach here: a billing agent class, an agent interface, an LLM executor to call the LLM, and a prompt repository. Although this was a good start, we identified key areas to improve. The top one was the higher knowledge barrier involved: if you wanted to develop the chatbot, you basically had to be a Spring Boot developer. For a lot of our teammates, who were data scientists (and more familiar with Python), this was a problem. Even if you were a good Spring Boot developer, there's a lot of boilerplate code you needed to learn before you could actually become a productive member of the team.
We also needed design patterns to implement common tasks throughout the team. The high coupling with Spring Boot was a problem, too, because it made it difficult to reuse agents without it. We love Spring Boot, but we wanted to share our agents not only with other teams, but with the entire world.
These considerations resulted in ARC, a Kotlin DSL designed to help us build LLM-powered agents quickly and concisely. ARC combines the simplicity of a low-code solution with the power of an enterprise framework. This started as something simple and basic and has grown into our secret sauce for achieving our current agent development speed.
The LMOS Multi-Agent Platform
Our plan was never limited to a single agent. For this reason, we also needed to provide the necessary constructs for managing the entire lifecycle of multiple agents. When we started, we used to discuss this quite a lot: How do you design the society of agents? Should we use the actor approach? Should there be a supervisor? In essence, we decided that instead of reinventing the wheel, we needed to provide enough constructs to allow the extensibility of different patterns.
The Frag Magenta bot, for instance, is composed of multiple agents. That means you need discoverability, version management, dynamic routing, and routing between agents. You may have to deal with a multi-intent query, too. In addition, the size of our problem space, with multiple countries and business processes, is also important. How do you manage the agents' lifecycle when everything can go around with one change in one prompt? What we learned as developers from building microservices and distributed systems still applies. We needed to have an enterprise-grade platform to run these agents.
These considerations resulted in the LMOS multi-agent platform. Similarly to Heroku, we wanted to create a platform where a developer could just git-push an agent or LMOS master into production. This platform should take care of everything else.
The LMOS platform is built on existing constructs around Kubernetes and Istio. It also has a custom control plane. In the platform, agents are first-class citizens, together with the idea of channels. Channel is the construct where we group agents to form a system,Frag Magenta, for example. Once agents are grouped, we need agent traffic management tied to tenant and channel management. Also, releasing agents is a continuous iteration process: You cannot just develop an agent, push it to production, and believe that everything will work well. You need rollout and rollback capabilities. In addition to all that, we have a module called LMOS RUNTIME, which bootstraps the system with all the agents required for a particular system.
We built LMOS as an open platform where you can not only run your ARC agent, but also the agents in JVM or Kotlin. For example, you can bring your own agent developed using Python, LangChain, or LlamaIndex. The idea is that multiple agents can all coexist within this platform if it follows its specifications. You can bring your non-ARC Agent, wrap it into the fabric, deploy it, and even the routing will be taken care of by LMOS.
Summary
At the start of this journey, our vision was not only to create use cases for agents. We saw an opportunity: if we could create the next agent computing platform from the ground-up, what would the layers look like? Would it be like the network architecture or the traditional computing layers we already know? At the bottom layer, we have the foundational computing abstractions, which allow prompting optimization, memory management, how to deal with LLMs, and low-level constructs. How do you build single agent abstractions in the layer above, the single agent abstractions layer? What tools and frameworks can we bring in that allow this?
On top of all that, you need to manage the lifecycle of agents, which is different from the traditional microservices. It brings in additional requirements around shared memory conversations, continuous iterations, and the need to release only to specific channels. You also need the multi-agent collaboration layer, where we can build the society of agents. Such abstractions allow multiple sets of agents to be open and sovereign, so that we don't end up in a closed ecosystem of agents provided by any monopolies that might emerge in this space. We designed LMOS to absorb each of these layers. This is the vision.
Of course, we are building use cases, but this has been the construct that has been in our minds when we started this journey. We have all those layers and modules open-sourced. We invite you to join us in defining our platform and the foundations of agentic computing on GitHub. LMOS is open-source and publicly available as an incubated Eclipse Foundation project.