BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Defensible Moats: Unlocking Enterprise Value with Large Language Models

Defensible Moats: Unlocking Enterprise Value with Large Language Models

Bookmarks
53:55

Summary

Nischal HP discusses risk mitigation, environmental, social, and governance (ESG) framework implementation to achieve sustainability goals, strategic procurement, spend analytics, data compliance.

Bio

Nischal HP, a data science and engineering leader with over 13 years of experience in Fortune 500 companies and startups, is renowned for his approach in leveraging machine learning to drive product advancements. He likes building and leading teams while fostering a culture of innovation and excellence.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

HP: I'm going to be talking a little bit about the work that we're doing at scoutbee in building large language models or enabling large language models and generative AI applications in the enterprise landscape. Before we get all technical and go down that road, moat is the water around the castle that you see here. I wanted to take a little bit of attention away from large language models and talk about black swan events. The black swan events are unpredictable, they have severe consequences. The fun around black swan events is that when it really happens, you look at it, and you connect the dots back and say this was bound to happen. In the supply chain space, up to about 5, 6 years ago, 80% of the supply chain was predictable in the sense people knew when to expect delivery, they didn't have a massive supply chain break. Twenty percent were surprises. Unfortunately, or maybe fortunately for us at scoutbee, this has flipped. You see 80% of surprises and 20% of predictability. You might be wondering, maybe he's making these numbers up. Let me walk you through some painful experiences we've all had to share in the last 5 years. COVID-19 happened, and I think everybody up until then thought supply chain was this thing that just existed somewhere, until they went to a supermarket and couldn't find pasta to eat, or worst case, toilet paper was missing. You wondered, what happened? You saw that the entire medical system went under struggle, face masks, ventilators, everything was missing. You started to read a little bit about supply chain. You start seeing Financial Times, Bloomberg, everybody started covering, Wired Magazine started talking about supply chain. We live in a time where the last few months have been terrible in terms of climate, forest fires everywhere, lots of floods. There's an ongoing war. We are maybe far from it, but it's causing a lot more disruption than any of us imagined. The other part which also happened, is one of the busiest waterways in the world is the Suez Canal, where a ship just decided to go sideways. For weeks, there was quite some struggle in getting the ship on track and looking at supply chain issues. When you look at these situations, and you ask yourself, how do we handle these events? You're going to need a wee bit more than direct integration with ChatGPT. All of these problems cannot be solved by just enabling a large language model API.

Background

I'm Nischal. I am the Vice President of Data Science and Engineering at scoutbee. We're based out of Berlin in Germany. I've been building enterprise AI and MLOps for the last 7 years, in the supply chain space for the last 3.

Before that, in the insurance and the med-tech industry. What's the purpose of the talk? We've not found the only solution, we do think solving a problem of this scale requires a multitude of solutions. The goal is to present how we are enabling generative AI applications and enabling large language models as part of our product stack. As takeaways for all of you, the presentation is going to be broken down into two phases. The first phase is how we manage the entire data stack. The second is, how do you start thinking about reliability in large language models? How do you build safety nets, because the users are not consumer users, they're business users. They're used to reliability. They're used to enterprise software. How do you bring that in the generative AI space? For those of you wondering, who are not working in this space, it might seem like we are in a big bubble that's about to pop at some point. The market analyst thinks otherwise. Generative AI is here to stay. For at least the next 18 months, 67% of the IT leaders want to adopt generative AI. One-third of them want to make it their top priority.

Defensible Moats

A little bit about defensible moats before I jump into the data stack. Warren Buffett looks for economic castles protected by unreachable moats. He invests in companies that have these moats. Wardley mapping is a very interesting tool to think about strategy. You have evolution on the x axis, and you have the value chain on the left. A decade ago, I think in 2011, 2012, when I started working in the field of data science, everything that was related as an IP was basically feature engineering, statistical models, and you did a lot of regression work, and that was your IP. There was not a commodity. The data that went along with it was what you were actually focused on. A few years after that, the deep learning era kicked in, and we stopped wondering about all of the features that we handcrafted and built to serve our applications. That was not your moat anymore. Your moat was essentially thinking about networks, and how do you build your loss functions? It still required quite some data, so you still had a defensible moat if you had traffic and data coming in, but your features were not your moats anymore, it became a commodity. What's happened with OpenAI coming up with ChatGPT, is that your deep learning models that were your IP also are not your IP anymore, you're not in the race to build new models. Of course, there's a lot of room for innovation in the deep learning space. At the moment, if you're competing with the likes of ChatGPT, and Meta Llama 2, essentially, it's a race that maybe a few companies can have because of the access to the data that these companies have, which means that you need to be smarter in terms of understanding, where do you want to build your moat?

What is the commodity that you can use off the shelf? Where do you get stronger? This has just been the journey of large language models in the last 2, 3 years. Just in the last 6 months, I think, as I started making this presentation to now, there's probably two or three new large language models in this space. You can see that building these models themselves are not a defensible moat for us.

Full Data Stack - System of Records

The introduction to the full data stack. This is going to be broken down into three segments, system of record, system of intelligence, and system of engagement, based on the inspiration from a blog post called, "The New Moats," moats that came out from Greylock ventures. For us, we're bringing data for our customers, and our customers being the Unilever, Walmart, Audi, and a lot of these large organizations. For them, we bring data from different places. We bring data from ERP systems. I know a whole lot of you have forgotten that ERP exists, but they still do. Lots of companies run on very large ERP installations. They have document stores. They have a ton of custom data systems. You'd be shocked to understand how many custom data systems they have. A bunch of these systems are still not on the cloud, so they're all sitting in data centers that are managed by a group of people. We saw a few years ago, there was a shortage of COBOL programmers as well. That's because these systems are still powering a lot of these large organizations. What you see with this is not really a data mesh architecture in these organizations, actually a little bit of data mesh architecture. The data is duplicated. It's incorrect. It's not up to date. They have different complexities in their own space, and they're heterogeneous, which means every enterprise customer we're working with, and actually the teams within these organizations themselves, they're all speaking a very different language. They're all looking at the same data points in their space. It's hard to understand what it means. The first thing we wanted to do, and we started this journey about two-and-a-half years ago, is to standardize this data language and build a semantic layer. We invested in this piece of technology, we call as the knowledge graph, which was championed by Google and Amazon and everybody in the previous decade. We put together a connection between the data points, and worked on the ontology, on creating this knowledge graph. In the Netflix talk, presented by Surabhi, she spoke a little bit about future proofing your technology and actually taking the bets and investing. This bet that we did two-and-a-half years ago actually started to pay off with the generative AI landscape. I'll cover as we move through the presentation how that happened.

We couldn't fit the ontology we have on a slide because it was just too big. This is to give you a sense of what a knowledge graph can look like. This is from the supply chain space. As you can see, there's data coming from customers, third party, supply chain event, organizational data. The relationship between data points themselves completely change the meaning of the data. The good thing with knowledge graphs is that you're actually trying to design how the real world works. You're trying to design how entities in the real world look like. If you think about a manufacturing company that's dependent on other manufacturing companies, putting them as a record in a relational database table is actually bad practice because they are a real-world entity. A real-world entity has more than just being a record in a database table. In order to enable these knowledge graphs, just briefly touching the technology part, we partner with Neo4j Aura. We are bringing our data in large batches with Apache Airflow. We're doing data validations with Pydantic. We are scaling that processing with Polars. Polars and Pydantic are some of the new libraries in the Python landscape. We've done some data observability with Snowflake. Because we want to build a ton of different applications on top of this, we expose all of this data through GraphQL.

System of Intelligence

We'll jump into the system of intelligence layers. We have a machine learning inference layer, and we have an agent-based framework. This is where things start to get interesting on the machine learning and the generative AI side. From the machine learning inference layer, so we're doing the traditional machine learning workloads of converting unstructured data to structured data. We are running smaller transformer models, because we have 170 billion transformer models now. We're doing something very small in the likes of RoBERTa, and a bunch of other language models to extract information that we think are appropriate for our domain. The scale at which this is operating is at web scale because we actually crawl the internet respectfully and ethically. We're based out of Europe, there are GDPR laws that kick in. We're looking at about a billion pages every 3 to 6 months. We're extracting about 675 million entities. This builds our internal knowledge graph, about 3 billion relationships. They need to be refreshed every few months. That's the traditional machine learning inference layer that we have. With generative AI kicking in, we started hosting an open source large language model called as Llama 2 from Meta. There's a reason why we went with hosting something ourselves. The space that we operate in, without the access of system of records, there is very little value that we can actually bring to our customers. When you use ChatGPT and the likes without domain knowledge, without access to all of this internal information, you can bring some value, but the moment you want to start working with intellectual property, organizations don't want to work with organizations where data is being shipped off somewhere. We are in a position where every single data point that we look at has to be managed and maintained by us. This added another challenge on our machine learning inference layer, which is, we look at a very different observability metric for supporting inferencing with large language models. We're talking about, how many tokens can you process per millisecond? What's your throughput of tokens per minute? How large a GPU do you need to actually run this model? Currently, we are running on one big machine, which is about 48GB GPUs. Then we are also running another flavor of it on SageMaker. This is the scale at which our machine learning inference layer has to work. In order to support this, we are working with using Hugging Face transformers library and a bunch of other packages from Hugging Face. We built our own with PyTorch. We're running our ML workloads with Spark, our MLOps workflows with MLflow, and S3, and Snowflake. The thing that we started to realize the moment we started adding the LLM layer is that this ecosystem is not sufficient for us. We are starting to move away from Spark workloads to actually building machine learning inference workloads with Ray and Airflow. The other thing that's coming up is we're moving away from MLOps to LLMOps. I've put an asterisk there because I'll talk about LLMOps in a non-traditional way as part of this presentation. Then the traditional way of LLMOps, which is still people are figuring out what that even means. That's our machine learning inference layer.

As part of system of intelligence, we bring a new layer, which is actually very nascent, so the agent-based ecosystem. What are agent-based ecosystems, or what are generative AI apps? The ethos that we have at scoutbee is we want humans to be involved to review the outputs of accuracy, suss out bias, and actually ensure that these large language models are operating as intended. The goal for us is to not displace or replace humans, it's for us to augment them with capabilities they didn't have, so that they can actually use this to solve harder problems, than search for a data field across 100 different systems. That shouldn't be the nature of work that people in these organizations have to do. What are we building as part of the agent-based ecosystem? We're building conversational AI, supported by multi-agent architecture and RAGs. I'll walk through each one of what that means. Before we jump into what is a multi-agent, what's a typical agent structure? Who of you here have not worked with ChatGPT, or prompts, or any of this in the last few months? Just a quick introduction. When you're thinking about an agent, when you write a prompt, or when you ask ChatGPT a question, what you're typically doing is writing a prompt. When you think about an agent, an agent is something that you're designing in order to solve a particular problem. This agent can have multiple prompts. You also provide a persona where you say, I am this so and so person. You want to keep the conversation clean, and you provide an instruction in terms of the problem this agent is actually designed to solve, and you give them a set of prompts that they have to look at in order to solve that problem. You can think about an agent as a person tasked with a job to perform, and giving them all the necessary tools and data to solve that problem. A multi-agent is, you go from one person to multiple specialized group of people. Instead of having one agent that does everything, you have multiple agents that talk to each other in order to solve a problem. You have an agent that is summarizing. You have an agent that does analytical work. Then, you have an agent that's refining. A user creates a prompt and then the agents take over and help you solve that problem.

As part of the multi-agent architecture, I'll talk about RAGs as well, is, in the multi-agent architecture, we use a flavor for every agent called as Re+Act, not to be confused with the web framework React. Re+Act actually stands for reason and act. Why reason and act? One of the things that the large language models do so much better and why they're all the rage right now is, because of the amount of data they have been trained on, and because of the amount of prompts that have been generated by experts, they've built the capability to actually reason through their analysis and their solution. This is an example of what that looks like, is when you ask a large language model a question, what you can also force the large language model to do is to reason out how it got to that answer. It comes up with a thought. It acts on it. It observes from that thought what to do next, and builds a chain of thoughts until it reaches the final answer. Why did we have to do this? Our business users need to know why we are asking them the question that we are when we are trying to solve a problem. They need to know how we reached a certain solution, and not from going from A to B, because the wonderful thing that large language models can also do is they can be notorious at generating or hallucinating very coherently. It looks like the answer is real, but it's factually incorrect, so you need a step of reasons to get there.

We spoke about conversational AI and multi-agent and the style that we're using, and there was another acronym that I mentioned, they're called as RAGs. RAG stands for Retrieval Augmented Generation. What this actually means is that large language models know a lot, but they have no idea or they have no access to the system of records that you actually have access to. In order to bring facts, in order to make sure large language models are not hallucinating all the answers, when you ask a certain question, you can also provide a small group of documents or a subset of documents that you think contain the answer to that question, to a large language model. The large language model then takes the question, takes a small set of documents that you've identified, and then using that, generates an answer with the facts presented in that document. All of the document question-answer, and the applications that people are building on top of their datasets, there's a lot of work that's going on in RAGs. That's why vector databases are suddenly one of the coolest databases out there now, but they've been around for quite some time. We've used vector databases, not as vector databases, but we've been storing and working with vectors for quite some time. Now you have the capability of finding this subset of documents pretty quickly. Now, betting on being future proof and future ready, and investing in some technology where you go from strength to strength. We're doing RAGs, not only with vector databases, I'll talk about that next, but also with our knowledge graphs that we've built as part of our system of records. When a user is having this conversation, and they're working with asking us questions, the thing that we can actually do, is because the knowledge graphs are representations of real-world entities with semantic meaning, we can actually convert that query and identify a subgraph on the knowledge graph that we can pick up to actually answer that question. The technology that we invested in, in the last two-and-a-half years now, has enabled us to build generative AI applications on top of it. We do the same thing with documents that we get, where we also have a vector database. It's not one tool that does it all. We have different datastores. They come into action depending on the kind of question that's being asked where the system of records actually live.

Quickly moving on from the agent-based ecosystem. This is a very common architecture pattern that you will see for most of the large language model applications that you're building. You have your datastores sitting in different databases, and you need a way to connect to these different datastores quite easily. LlamaIndex is a framework that helps you do this. Your agents can make use of LlamaIndex to talk to this system of records. Your agents themselves, there's been a lot of work that's gone in the last year or so in coming up with frameworks such as LangChain, and there's a lot of other OpenAI functions and everything that's come out where we've designed the agents using that framework. Basically, your agents are talking to the user, and they are enabling an interaction with large language models. At the same time, whenever there is a requirement of data to answer the question, they're making use of LlamaIndex. Of course, you can with something called as Llama Hub. We've not investigated or invested in Llama Hub because we bring data from a very different place. You could actually integrate all of the APIs from Slack, Notion, Salesforce, a bunch of these other places to bring data through Llama Hub.

The tools and technologies that we use as part of the agent-based ecosystem layer, is we work with LangChain. There's a whole debate in the community if LangChain's too bulky, if it's something that you should use, not use. At the moment, it's made our jobs a whole lot easier to work with large language models, especially given that we are running our own version of Llama 2 in our ecosystem. Then we have the Llama 2 from the Hugging Face library. We've also put it on AWS SageMaker. If you want to quickly get some of these models up and running, Hugging Face inference containers running on SageMaker with some of these models probably takes you 5 or 10 minutes to set up. It's expensive, but it doesn't take you much more than that to actually get that up and running. For our first set of applications that we started building, because we didn't want to invest a whole lot of frontend engineers building the conversational apps, we started working with Streamlit. Streamlit is this framework for building Python data apps. You have the capability to spin up a conversational web app with I think, maybe 20 lines of code, and nothing more than that.

Systems of Engagement

Coming to our last layer of the data stack, we're talking about systems of engagement. What part of the product are we actually enabling? How do we think about this from a product perspective? We're looking at two very specific things as part of our product experience here. One is the fact that we're working with a lot of data. It's very easy to build complex user interface. With generative AI applications and the agent-based framework that we have, we're navigating and helping users to move away from complex user interface, to something that is chat based. We're not entirely chat based, we still have our own flavor of what we call as chat based. This is helping us solve tough problems or complex data problems for our users, by giving them a much neater, cleaner user experience. The second and the most important part of why generative AI is actually essential for us is, I'm not sure how many of you have multiple products as part of your platform layers, but for us, we have three different products. These three products are still in the same space, but they do some very specific set of operations. As a user, they have the necessity to actually use all three different products depending on the kind of problem they were solving. What we have enabled as part of generative AI is that the application that we are building helps us bridge the gap between all of these three different products. It actually helps the users to work through one interface. The application themselves make use of the features coming from different products as and when the user needs them. This is helping us stitch and build a good product, cohesiveness of a product rather than having different products for a narrow solution.

Why do we want to do this? There's one part being part of the hype, which is good and fun, because the technology teams love it. When you're building a business, and you want to think about the economies of scale, and you're investing so much in generative AI and all of these experiments, what we essentially want to do and lead our users into a new direction is, our products were essentially places with the stack up until the machine learning inference, system of records, and product application layer, for a brief second if you just ignored the agent-based ecosystem, they would come and use our systems to search for data. They had a problem, they would come in to search for some data and look at the analytics, the dashboard, so on and so forth. We wanted to help and navigate them from that, to saying, I have this problem. I don't know how to solve this problem, what can I do? We wanted to support them with all the data that they needed from the stack that we've put together.

Recap

We're going from strength to strength. I think it's very important to understand large language models, and generative AI should be a tool in your stack, and you still need your entire stack to drive value for your customers. Your defensible moats actually come from building combined power across the layers in the stack.

Feedback and Learnings

Once we did this, and we put a lot of effort in getting our first generative AI applications off the ground, we went into beta testing. This is where magic happened. We started getting feedback. The good thing that we got was our customers enjoyed the experience. They loved the fact that they could chat through an application, work with their data, so on and so forth, but there are concerns. They said, I asked the same question yesterday, and I asked the same question today, your system gives different answers, why? Some of the users are not native English speakers, and sometimes they're using different text to express the same thing, but the application started behaving differently. They're used to idempotency. They're used to the fact that if they do the same things over again, the results they get should be the same. That's not the case in the area of generative AI and large language models. They also spoke about the quality of conversation. They said, we want to solve a particular problem in this space, but the large language model actually thinks it is in a completely different domain. The intent of the conversation is misunderstood. We'll talk a little bit about some of the findings that we had.

Reliability in the World of Probability

We realized that if we have to build a stronger moat here and drive value for our customers, we should be able to address their concerns around reliability and build something that is a domain expert, rather than something that is generic. Building trust with generative AI apps, it's not the easiest things to do, and it's a work in progress. Building reliability in the world of probabilities is not an easy thing. Enterprise product users are used to reliability in a good form and shape. Now with the generative AI use cases, they are extremely happy to use them, but it creates discomfort because now they have entered a new world of uncertainty of a probabilistic world. The other aspect that really scares them is the fact that the large language model can take you to very different destinations, depending on what you ask of the large language model to do. One could argue that this is the power of the LLMs, but it might not drive the intended value for your customers in an enterprise landscape. Train of thought, so as with humans, if you're having a long enough conversation, and you ask things in a way that's maybe confusing or challenging, large language models switch context. They can take you from trying to solve a problem in the supply chain space, to actually taking you to fairyland to write new stories about Narnia. You could really land in very different places with a large language model. The more you use, the more you realize how quickly the large language models can switch context. There's also the challenge that you need to be able to switch between different agents. This is not trivial either. Large language models or agents can choose to not invoke other agents. The reason they choose to do that is they can start hallucinating. Instead of actually picking up data from the system of records and using that to solve a problem, they might just hallucinate the data themselves. They might say, we know about the supplier working out of the U.S., and the data looks so right that you might assume that it's factually correct, but it's not.

We didn't want to go away from using large language models. We see the power, and we wanted to bring the best of both worlds, where we use the creative and innovation power of the LLMs. At the same time, have control and build reliability and analysis together. How do we start thinking about this? Where do we even go and how do we start? As we were asking ourselves this question, we stumbled upon this concept in large language models called as Graphs of Thought. What this concept is, is very similar to how humans think. Essentially, when you are asked a question, you are in a Bayesian world, you have many answers and many thoughts. Depending on who is asking this question, you go from Bayesian to frequentist, you choose a certain path and a certain answer. The next time when somebody else asks you the same question, you might actually choose a very different path. Depending on the path that you've chosen, you put together different thoughts in order to solve a problem. This is something that large language models do as well. What Graphs of Thought paper talks about, it comes from ETH Zürich, is the Re+Act part, which we were talking about, the reason and acting part. They talk about storing this reasoning as a graph, and using the humans to actually tell us if this reasoning is right, so you can fine-tune them later. Of course, this is not the path that we've gone into right now. What struck us was the fact that you can actually store the reasoning state, you can store the paths that your large language models have taken. We went from knowing that we have a problem, to actually thinking about the observability part of it, where we said, using this, we can start observing what the large language model is doing. Depending on what the large language model is doing, we can then decide, how do we want to fix it.

This was inspirational for us. It enabled us to think of a plan, to think about the execution as a graph that you could fine-tune over time. As a quick thought experiment, what we realized was, on one side, maybe with thinking about this as graph and controlling it, we can bring some reliability and we can avoid a bit of context switches. This still will not stop the large language model from hallucinating and misinterpreting intent, and being very confident at it. It's very confident at being wrong. It almost feels like you're the dumb person on the other side most of the time. You have to be very careful in knowing what's really going on. Taking that inspiration, and looking at the observability and running these tests, what we realized was that the reasoning that large language models had, we realized there's a big gap in domain understanding. We have a lot of business and domain knowledge that we couldn't inject very easily into prompts, which is why you can see prompt engineering being one of the sought after fields right now, where I think salaries are crazy. I saw some organization, I think, paying out a million for some prompt engineer, and there was news all over it. Maybe it was fake or generated by LLM. Converting and taking all of your domain knowledge that is spread between hundreds of people in your organization across years, and bringing them into 1, 2, 10 prompts, is very challenging. You're always going to miss a certain aspect to it. We saw that with the reasoning with large language models, it jumped very quickly from being told that you are a person working in the supply chain space solving this problem, to thinking that it's an aeronautical engineer working for Lockheed Martin. It decided to go through different sorts of reasoning.

We said, ok, so we're going to need some control here. We're going to have to bring a lot of domain knowledge into this ecosystem. How do we do this? We've already worked with knowledge graphs before, so we thought maybe there is a way for us to build a new knowledge graph, which we can call the meta-data knowledge graph. Essentially, we took the problems that we wanted to solve, and from the business knowledge and the experts we had in that domain, put together an ontology that we can use in order to work with a large language model and design a meta-data domain knowledge graph. Essentially, what this knowledge graph has is a problem. All the subproblems that is required to solve that problem. The data the subproblems need in order to solve that problem. Essentially, it's not a single path. It's, again, a big graph. Depending on the way the user wants to solve that problem, large language models would essentially guide the user through that process. When you come up with something, and you're working on it, you have this weird worm in your head that constantly tells you, you're wrong. "You're wrong. You're not seeing something. You're not thinking about this. Are you sure you want to invest your time and effort into this? What if it backfires?" I'm sure a lot of you are thinking about this when you're building your large-scale machine learning models. A new paper got released called as Graph Neural Prompting, which talks about a very similar thing that we were doing. There was a big sigh of relief, and it was coincidental, so we were very happy that this came out. We are not doing the Graph Neural Prompting part, but, essentially, it talks about the approach that we take and where you have a very domain specific knowledge graph, you have your question, and you're augmenting that with your large language model to solve the problems that you wish to solve.

What the meta-data graph would actually do, is that, based on the problem that our user wanted to solve in the particular domain, it would help traverse through a subgraph. The entire traversal itself to different subgraphs would be enabled by a large language model. We put together a quick implementation, and we validated this idea with our users, with our internal team. We had a new additional layer that went on to our data stack, as we speak. This was good. What we saw was that because we started bringing in a whole lot of domain knowledge, the number of context switches reduced. The large language model held a conversation together and tried to solve a complex problem, going step by step with the user, without jumping from one context or one domain to the other. We still did not get the intended reliability we were looking for. There are still hallucinations. In spite of all the nudging and prompting, large language models, at any given point in time, they knew they wanted to perform an action, they knew the data points they required to solve that problem. Essentially, they said, ok, we'll hallucinate this for you.

How do we reduce the hallucinations? How do we choose the right subgraph? How do we switch the subgraphs in the middle of the conversation? The eyes cannot see what the mind does not know. Essentially, what we did, we went back to our observability. We started thinking about how we are storing the reasoning and managing our entire state. We did what humans do: when you come up with a conclusion, you want to verify if the conclusion that you've reached is right or wrong. We did this thing called as chain of verification. We didn't know this was what it was actually called. We essentially said, at every single point in the work that we are doing, we have to verify and ask another set of prompts to say, is this the right thing that you've done? Do you understand the intent? Is this the right data to validate? This is what that looks like. When you ask an LLM, name some politicians who were born in New York. It comes up with a list of presidents. It says Hillary Clinton, Donald Trump, Michael Bloomberg. When you go and nudge the LLM to say, let's verify this, where was Hillary Clinton born? Where was Donald Trump born, and Michael Bloomberg? You realize that the LLM says, Hillary Clinton was actually born in Chicago, Donald Trump in Queens, New York, and Michael Bloomberg in Boston. You realize the first answer it gave you was incorrect, because it picked up a very different understanding of what the question meant. With the chain of verification, you can actually verify every single action that you're performing with your large language model. Because we had the meta-data graph, because we had the chain of verification in place, what we could do was to introduce a planner. One of the other aspects that was important for us, when we picked up the subgraph, one of the challenges with business users, and when you try to solve a big problem with generative AI applications, is that you don't really know the road that the user is being asked to walk. Now that we had the process modeled on a graph, we showed the major milestones that the user has to go through much before the user actually hit the milestone. There was a certain aspect of certainty in using the generative AI application that guided our users and reduced their anxiety, that we have a 10-minute conversation to only realize we've made all the wrong choices in life. We didn't want them to have that. Essentially, we introduced a planner in the mix.

With this, what we did was we unlocked another layer in our data stack. As you can see, just enabling a very lightweight integration into something like ChatGPT, or Anthropic, or any of these systems, you will not find the value that you can actually drive for your customers with a very simple, easy integration. You will have to build and invest quite a lot, if you want to build generative AI applications in the enterprise space that actually drives value. With these two layers, what we did was we brought in a little bit more reliability and predictability to our probabilistic world. We're still in the process of measuring the appropriate baseline to see how we have improved. What we've seen from one of these papers is that with everything that we've taken, they've seen a 13.5% increase in the performance. If you fine-tuned it a little bit more, then close to about 15%, 16% of improvement. The other win that we had was that our data science team that's been working on this with the first version of the apps, and when they handed it out to the users and the sales team, they were like, we have no idea what's going to happen. Now with this, they're like, we know what's going to happen. We have some sense of reliability here. It's predictable in the way it needs to behave, to a certain degree. This is still work in progress. The stage 2, we want to implement human in the loop feedback. We want to grow our meta-data graph with the combination of users and LLMs itself, so that essentially we have the wide array of the entire knowledge graph that's there.

Summary

Without access to your full stack of data in building this entire thing, it might be very hard for you to build a defensible moat. Enabling LLMs requires a lot of effort. Making them reliable, predictable, and harmless requires way more effort and innovation. The generative AI space is starting up now. It's a very exciting time we live in. With great power comes great responsibility, so all of us practitioners have to make sure to take the reliability, predictability, and observability part quite seriously, and make LLMs safer for everyone to use.

Questions and Answers

Participant 1: I saw that you moved back to a single agent architecture from multi-agent. Of course, recently, the developments in that space with Auto-GPT, MetaGPT especially this stuff from Microsoft, and I see a general trend from, as much as single agents having lots of specialized agents, do you guys think you're going to return back to that in the future, augment things like setting the operating procedures for your communication. What are your thoughts on that direction?

HP: What we did was we reduced the amount of space that a large language model can operate in. Basically, to increase the reliability and predictability, we started adding constraints. When we started adding constraints, we saw that we didn't really need the switch to go from one agent to the other. Because at any given point in time, the meta-data knowledge graph is nudging and prompting the agent to know what is the task it needs to solve. We're basically building prompts on the fly, depending on how the user is navigating to solve the problem. We don't really need different agents, because one agent does everything. It knows what to summarize. It knows when to go into large text prompting, when to provide a list, when to go pick up from a system of records. It is being guided by the meta-data knowledge graph and the chain of verification. When we didn't have those two things, we definitely couldn't make do with a single agent, we had to do multi-agent. It depends on the domain that you're working and how much knowledge you have in that domain to design that graph. If you're working in a very open-ended space, then I think it is good to have multi-agent than single agent.

Participant 2: I know a lot of the stuff that we're building have latencies [inaudible 00:52:17], and then when you add a verification layer, so the latency question, so I was wondering how you deal with that.

HP: You can solve all the problems, making it a technical engineering one, or you can solve the problem as a combination of user experience, product experience, and the engineering space. When the verification is happening, we let the user know the verification that is happening by constantly providing them a summary. We say, this is the problem you want to solve. If you're having this long conversation, we guide them through having the summary on the side, saying, you started with this problem, you gave us this input, this is what we are trying to do. Now we are constantly expanding that, as and when you're going through the conversation. That user experience that we enabled helps us in a way to mask the fact that you need millisecond latency. You have to be a little smart in how you handle this, but that's helping us now. Let's talk when this is hitting production and there are hundreds of people using this.

 

See more presentations with transcripts

 

Recorded at:

Jun 28, 2024

BT