InfoQ Homepage Presentations Mind Your Language Models: an Approach to Architecting Intelligent Systems

Mind Your Language Models: an Approach to Architecting Intelligent Systems

View Presentation

Speed:

Download

50:13

Summary

Nischal HP discusses the intricacies of designing and implementing intelligent systems powered by LLMs, drawing upon practical insights gained from real-world deployments.

Bio

Nischal HP is an Engineering leader with over 13 years of experience specializing in infrastructure, services, and products within the realm of Artificial Intelligence. His journey has taken him through diverse domains such as Algorithmic trading, E-commerce, Credit risk scoring for students, Medical technology, Insurance technology, and Supply chain.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

HP: I think I'm going to try and elaborate our journey on how to architect systems with large language models. Let's just examine the timeline a bit. We're seeing a whole lot of discussions around large language models in the past year, a little bit before that. It's not a miracle that large language models happened. It's been a lot of work that's been going on that's building on top of it. The thing that you see with ChatGPT and Gemini and the likes, is the models are just way bigger, and they have a lot of data that's been fed to them.

Hence, they do what they do, which is amazing. November 2022, DALL·E had come out a few months before that, and then ChatGPT got released. There's a lot of buzz. We thought it's hype. It was also a time where we saw the tech industry going through ups and downs, and we just said, "This is hype. It's not going to last. This too shall pass, and we go back to our normal lives." February 2023 there was a lot of buzz in the market because people were wondering if the tech giants are going to run away again with monopolizing AI. To only realize Meta had different plans. Ines talk was about how big tech will not monopolize AI right now.

I think it's a wave that we are all witnessing. Open source took over, and we started seeing some really interesting adaptations of large language models and fine-tuning on Llama. We sat there as an organization, and we said, it's probably the era of POCs, and we might have to do something here ourselves. 2023, December, we thought, is this a hype? Is this going to continue? We see that the revenue of companies are starting to go up.

OpenAI grew from $200 million to $1.6 billion. NVIDIA's market revenue reached $18.1 billion, and I think their market cap is bigger than ever. Everyone's using ChatGPT and Gemini for their work. Everybody in the organization, be it marketing, sales, everyone's making use of these tools. Deepfake started to become a thing, to a point where there's a financial worker who lost $25 million because he thought he got a message from his boss and he just signed off on a $25 million check.

We're all sitting here, and we're being tasked to bring this AI into our organizations, and we're wondering, what do we do? 2024 and beyond, AI occupies the spotlight. As I can see from everyone, all of you are interested in understanding how to enable this. The investments continue. You can see that Amazon starts to invest in Anthropic. Inflection AI is now part of Microsoft. We had Jay from Cohere, who was doing some incredible stuff in that space. We have to ask ourselves, how can we bring something that is this powerful in a safe and a reliable way to our customers, and also without losing speed to actually innovate?

Global AI market is going to grow. The research says that it's going to grow to $1.8 trillion. Currently it's close to $200 billion. Nine out of 10 organizations think that AI is going to give them competitive advantage. Around 4 in 5 companies think AI is their top priority. This is what the landscape looks like. There's, of course, a little bit of hype, but everybody is investing across the entire landscape of machine learning, AI, and data. There are risks that come along with it, and organizations are thinking about it. There are a lot of organizations that understand the risk and are working towards it.

There are organizations that have absolutely no idea how to go about working with this risk, so they're just sitting there waiting for other organizations to implement and move forward. People are rebuilding their engine rooms. You can see that more than 60% of people are running either POCs or they're already bringing this to widespread adoption to their customers. We need to strike a balance, and that's what the talk's going to be about.

There's a lot of enthusiasm around AI. The AI revolution will be here to stay. It will change everything that we do. Deploying AI in production is quite complex, and it requires more than just thinking about it as a POC. How do we strike a balance? I'm sure a lot of you who started thinking about LLMs came across this image of, Transformers, and Attention is All You Need. The AI landscape is changing every day, so probably things I'm telling right now is going to be outdated the moment I finish the talk. Not everything requires your attention. You don't have to change, every single day, your entire stack. The goal is, how can you build a stack that gives you the flexibility to innovate and at the same time it's driving value for your customers?

Background, and Outline

The purpose of my talk. Everybody is talking about LLMs being awesome. I'm going to talk about everything that can possibly go wrong with LLMs, and our journey through the last 16 to 18 months, and what it took for us to bring this into production, and about the effort, people, time, money, and also a few meltdowns that we had. I'm going to be a bit opinionated, because this is some of the work that we've done. I want you all to take this with a grain of salt, because there's probably other versions out there that you might have to take a look at in terms of how to do this.

I'm Nischal. I'm the VP of Data Science and ML Engineering at Scoutbee. I'm based out of Berlin. I've been in the ML space for a little bit over 13 years, in the last 7 years, it's mostly been in the insurance and the supply chain space. This is the overview that we look at. Enabling LLMs in our product. Improving the quality of conversation. Improving the quality and trust in results. Improving the data coverage and quality with ground truth data. Then, summary and takeaways. I'll go step by step and peel the onion one layer at a time.

Case Study: Supplier Discovery (Semantic Search)

Case study. I spent a lot of time thinking about what I'm going to talk about. I thought I'll talk about some use case that everybody of you can relate to. Then I didn't find enough conviction, because it's not a use case that I worked on. I thought, let's talk about what we actually did as a company. We work in the supply chain space, and we help organizations such as Unilever, Walmart, and organizations of that size to do supplier discovery. Essentially, what it is, is a Google Search for suppliers, but Google does not work for them, because Google's not adapted to helping them in the supply chain space. There's a lot more nuances to understanding and thinking about supply chain. I'm sure all of you have been facing some disruptions, all the way from getting GPUs in your AWS data centers, to not finding toilet papers or pasta in your supermarkets when COVID happened.

Supply chain has an impact on all of us, and every manufacturer is dependent on other manufacturers. The question is, they want to try and find manufacturers they can work with, not just for handling disruptions, but to also mitigate different kinds of risks, and work with innovating manufacturers. The challenge that we wanted to do was, this is not a product that we brought into the market right now. We've been a market leader for the past 6, 7 years, so we had different generations of this product. We've been using ML for quite some time. We thought this would be a good way for us to bring large language models as a new generation of our product.

Efficiency and effectiveness, so why are LLMs being sought after? Why are enterprises rebuilding their engine room? Efficiency is something that all of us have been working towards, irrespective of which domain we are in. We want to make things faster. We want to make things more economical. We want to get people to be able to solve their tasks. The thing that we see with LLMs is, it's a part of that, which is efficiency.

It's also part of effectiveness, because it's now going to enable organizations or people working in the organizations to do things they could not do before. That would mean you ask a question or you come with a problem statement, rather than looking at static dashboards to tell you what you're supposed to do. Then, based on the question that you ask, the LLM along with your product, figures out what data to bring in and helps you solve that problem. We're going to try and see how we can augment organizations to do this.

Stage 1 - Enabling LLMs in Our Product

Stage 1, enabling LLMs in our product. We did what every organization probably did or starts to do right now, is enable your application. Connect it to ChatGPT or one of these providers through an API, and essentially, you're good to go. We did the API. We did some prompt engineering with LangChain. We connected it to ChatGPT's API. This was, I think, January 2022. We put some credits in there, and we started using it. The stack on the right is something that we've had for a while, which is, we were working largely with knowledge graphs. We had smaller machine learning models.

Back in the time, they were bigger, but now they're comparatively way smaller. We did some distributed inferencing with Spark to populate these knowledge graphs. When we did this, we asked ourselves, what could go wrong? Because it was very fast for us to do this. It didn't cost us a lot of time and money. We asked ourselves, let's see what our customers think. The one thing that stood out immediately was the lack of domain knowledge. The foundational models did not know what supplier discovery was about, and the conversations went haywire.

People started asking questions related to their domain, and we had foundational models start to take them on a live quest to answer questions that they wanted to ask about life. It became very chatty as well, and people were tired. People used our application and went, "Can we just bring the old form back? This is just so much. I don't want to have this conversation." We did see the results that were coming up were hallucinations. It felt so real that for a second when we were testing the system, we looked at it and went, is this really a supplier do we not know about or a manufacturer? They were results that were being fabricated by the large language model.

The other part, and I'm not sure how many of you are dealing with enterprise security and privacy, is, a lot of the customers that we worked with were a little bit on the edge when we said, the POC is fine to use it with somebody like ChatGPT or some of these providers, but I don't think we want to do production workloads or use this product in production if it's integrated there, because we're just concerned what they're going to use the data for.

First, we thought, is there really a market for us to bring LLMs to our product? Does our product really need LLMs? What we got as feedback from the users was that they really enjoyed the experience. They were excited to use more of the product, and wanted more of it. We realized, there's a big market we've identified for a new generation of the product. There were lots of things that we had to solve before we wanted to get there. First, we had to focus on efficiency even before effectiveness. We absolutely needed domain adaptation. We had to remove hallucinations. We had to build trust and reliability. We needed guardrails. We needed it to be a little less chatty. This was the outcome of stage 1.

One of our users said, we have an issue with importing coffee due to the conflict in Suez Canal, we need to find sustainable, fair market certified coffee beans from South America. The foundational model replied, Company Abc from Brazil has the best coffee ever. Coffee is very good for health. Coffee beans can be roasted. We said, "Yes, this is awesome. This sounds nice, but I don't think our customers will pay us money to use this product."

Stage 2 - Bringing Open-Source LLMs, Domain Adaptation, and Guardrails

We said, stage 2, the first thing we wanted to tackle was, we didn't want to go ahead without knowing if we can actually host a large language model ourselves. Because if we did all the development and then realized that data privacy and security was a big concern, all of our work would just go down the drain. The first thing that we did was we brought in an open-source LLM. As you can see, the stack just got a little bit bigger. We brought in LLaMA-13B, which was first dropped on a Torrent somewhere, and then finally made its way to Hugging Face. We put an API on top of it, called FastChat API. We had an LLM API that we were working with.

One thing that is also very common is, even though large language models, there are plenty of them right now, you have to pick and choose your large language models, because the prompt engineering work that you do for language model A will not fit for language model B. All of the effort that you put in in doing this prompt engineering will have to change the moment you change your large language model.

We had to make some adaptations from ChatGPT's prompts to work with Llama. The thing that we realized was, our cost and complexity just went up. We were now responsible for a new piece of software in our stack that we really didn't want to be, but we had to because of the domain that we operate in. It was way expensive than using an API that you can work with from any of these providers.

Domain adaptation. This is one of the challenges that any of you will probably face enabling this in your organization, be it internal tooling or through your products, is, how do you bring domain knowledge to large language models? The first thinking that goes in is, it can't be that crazy to retrain or fine-tune a large language model. Why can't we just build our own large language model? Just as a ballpark figure with some of the secret statistics that was released around GPT-4, it took OpenAI about $63 million to ship GPT-4. That's not including the cost of people and infrastructure and everything else. The API is about $30 for a million tokens.

You can see the big difference between using an API coming from these big houses to actually retraining a foundational model. You need a lot of data to train a good foundational model. The good news is that you can do domain adaptation without having to retrain an entire model. There are different ways you can do this. There's zero-shot learning, in-context learning, and then you can do few-shot learning. You can also build something called agents. What essentially an agent does, is, you give it a set of instructions. You give it some examples on how to deal with these instructions. You give it the capability to make different requests to different systems.

Imagine, if it were a human, and you give a human a task, the human would essentially try to understand the task, pick the relevant data, make queries to different systems, summarize the answer, and provide it for you. Agents typically do that. What we tried to do was feed all of our domain knowledge into an agent. We did some really heavy prompt engineering to enable this, at a point where documentation around prompt engineering was also a bit poor. We had quite a few meltdowns on building this, but we thought this was a good first step in the right direction.

The third part that we introduced were guardrails. When I'm telling you this, I'm sure all of you are sitting there looking at the presentation and going, I have to go verify what he's saying is right or wrong. Essentially, that's what guardrails is. You can't trust an LLM entirely because you don't know if it's taking the right decision, if it's looking at the right data points, if it's on the right path. Guardrails is essentially a way for you to validate if an LLM is doing the right thing.

There are different ways you can implement a guardrail. We started implementing a version of our own, because at that time, when we started this journey, there weren't a lot of open-source libraries or companies that we could work with. Right now, you have NeMo Guardrails coming from NVIDIA. There's guardrails.ai, a company that's being built out in the Bay Area, which is focusing entirely on guardrails. We implemented our guardrails with a little bit of a twist, which was Graphs of Thoughts approach.

Talking about a business process: supplier discovery is not, I type in something and you get a result. A lot of times in an enterprise landscape, you're essentially augmenting AI to support a business process. These business processes are not linear. We needed something where we can understand the dynamic nature of the business process, depending on where the user is, and then invoke different kinds of guardrails required to support that user. Thankfully, we saw a paper that was released, I think, came out from ETH Zurich, around Graphs of Thought, which is essentially, we thought of our entire business process as a graph.

At any given point in time, we knew where the user was, and we invoked different sorts of guardrails to make sure that LLM is not misleading the user. That was a lot for stage 2. What can happen if you don't have guardrails? Air Canada enabled a chatbot with an LLM for its users, and the Chatbot agent went ahead and told its customer that we owe you some money. Now the airline is liable for what the chatbot did. Enabling agents, or enabling LLMs without putting guardrails, without doing domain adaptation, they can start to take actions that are probably not in the best benefit of the organization.

Just taking a step back. What did we identify as issues in stage 1? We said we needed guardrails. We needed domain adaptation. We needed to build trust and reliability. We needed to be a less chatty and reduce hallucinations. When we brought in some changes as part of stage 2, we couldn't hit all of them, and we said, domain adaptation and guardrails increase trust and reliability. Because when the user started to work with our product, they started giving us feedback that now it starts to make sense that the system that we are working with understands the process, and we are quite happy that we don't have to worry about our data being shipped to another company.

The next biggest thing we had to solve, that remained a very big challenge for us, was hallucinations. This played a huge role in trust and reliability of the system, because every time the user comes in and uses the system, the system gave them different answers, which essentially means they could never come back and reuse the system for the same problem. That's something that we wanted to make sure was not going to happen. The weird aspect of using open-source models sometimes when you use something that's foundationally big models that are available, is, our users were happier in terms of the quality of conversation with ChatGPT, and they started asking us, can we have the same quality but we don't want it to be on ChatGPT? We had a bit of a situation there.

We constantly were thinking about, how can we do this? One big challenge we had as we implemented stage 2 was that testing agents was a nightmare. We had absolutely no idea what the agents were trying to do. Sometimes they just went completely off key. You couldn't put breakpoints in how they were thinking, because you can't really know what they want to do. They invoke data APIs at some point. They didn't invoke data APIs at some point. They decided to make up their own answer. There was a bit of a challenge in debugging agents, and we were not really comfortable thinking about bringing agents into production.

With the changes that we brought in, this is what the conversation started to look like. We said, have an issue with conflict in Suez Canal, we need to get sustainable, fair market coffee beans from South America. The agent took this input from the user and said, let me understand this. You have issues with shipping due to a conflict. Your focus is looking for coffee suppliers in South America. You want to look at suppliers who have sustainable and fair market certifications. It asked the user if this understanding is correct. The user said, yes, that's correct. The agent went on and augmented the conversation.

This is where LLMs start to enhance what users can do, leading them down the path of efficiency to effectiveness. Is to say, given that Fairtrade is a sustainability certificate, can I also include the other ones. For this, previously, the users had to go figure out what sustainability certificates were themselves. We didn't have to train the foundational model for this. Given the amount of data that it had seen, it was already aware what sustainable certificates were, and essentially, they said, "It's ok. We're good to go. Let's move ahead".

Then the agent, instead of invoking our data APIs to pick up data, just randomly decided to start creating its own suppliers. It said, ABC from Brazil, Qwerty from Chile. They all have sustainable certificates for their coffee growing. The user asked us, but they don't look like real suppliers. Can you tell me more? The agent said, sorry, I'm a supplier discovery agent, and I cannot answer that question for you. Now suddenly you're back to square one, where you said, what's the point of doing all this?

Stage 3 - Reducing Hallucinations with RAGs

We had to reduce the hallucinations. We had to bring in more trust and reliability in the system. We came across the idea of RAGs, which stands for Retrieval-Augmented Generative AI. Jay from Cohere was talking about RAGs, and a bunch of other speakers touched on this as well. I won't get into the nitty-gritties of what a RAG is, but essentially, what that meant for us is, our engineering stack and system grew much bigger. What we are trying to do with RAGs is, essentially, instead of getting the foundational model to answer the question, we give it data, and we give it context of the conversation, and we force it to use that context to answer the question.

There are different ways you can do RAGs, and we use this concept called Chain of Thoughts framework. Our planner and execution layer for LLMs now went from having an agent and a guardrail, to having Chain of Thoughts prompting, query rewrite, splitting into multiple query generation, custom guardrails based on Graphs of Thought, query based on data retrieval, and then summarizing all of this. This is one of the biggest things for me with LLMs right now, is every time you want to make your system a little bit more robust, a little bit more reliable, there's probably a new part of this new service or new piece of technology that you need to add in enabling it. We went from that into thinking about, how do we do RAGs with Chains of Thought prompting?

Essentially, what we saw was a big challenge with agents, was the reasoning process behind why the agents did what they did. With the Chain of Thoughts prompting, what we tried to do was, instead of going from question to answer, the LLM goes through a reasoning process. The good thing here is you can actually teach this reasoning process to an LLM. Instead of retraining your entire model, you can actually do a few-shot Chain of Thought where you take certain pieces of conversation, you take some reasoning. You provide this to the LLM, and the LLM understands, so this is the reasoning process I need to have when I'm working with the user.

It's basically like giving an LLM a roadmap to follow. If you know your domain and your processes well, you can actually do this quite easily. For this, of course, we use LangChain. There were people asking, is LangChain the best framework out there? At this moment, we think that's probably the framework that's being used by the community the most. I'm not sure if it's best or not, but it does the job for us.

Once the user starts to have a conversation with the system, and there is reasoning behind this, one of the things that we saw as a pattern on how users use the new generation of products, were, some people were still typing in keywords, because they're still used to the idea of using Google Search. They didn't really know how to use a conversational based system, so they typed in keywords. Whereas some of the other users put in an entire story. They spoke about, this was the problem they were having. This is the data point they were looking at. These are the kind of suppliers they want. This is the manufacturer they'd like to work with, so on and so forth.

We ended up having a passage in the very first message, rather than, tell me what your problem is. This is going to happen a lot when you enable a product or a feature that's a bit open-ended. We decided to look at, we need to maybe understand what the user wants to do, and transform it into a standard form. At some points we might have to actually split this into several queries, rather than one single query. This, backed with Chain of Thought, gave us the capability to look at, so we break down the problem into a, b, and c, and we need to make data calls to fetch data for these problems. Using this data and using this problem, we can find an answer.

Once you enable all of these. It's a lot of different pieces of technology, and you have to observe to understand what is going on. There are two parts around the LLM that you have to observe. The part one is the LLMs, like any other service, has response times, and also the number of tokens it processes. The bigger your large language model is going to be, the slower it can get. Which is why there's a lot of work that's happening in that space in terms of, how can we make this faster and also run it on smaller GPUs, rather than big clusters that you have to provision, and the actual conversation and the result.

Previously, when you built search systems and somebody typed in a query, you would find results, and these results were backed with relevancy. You would do an NDCG. You would do precision. You would do recall. Now it's a bit more complicated, because you need to understand if the conversation, or if the LLM understood the conversation. If it picked up all of the relevant set of data from the data systems to answer the question, did it pick up all of it and how much of it were right? You have context, precision, and recall. Of course, combining all of this information into an answer that the LLM can actually work with.

The good thing about the science community right now and the open-source world is, if you think you have a problem, chances are that there are lots of people who have this problem. There's probably a framework already being built that you can work with. We came across the Ragas framework. It's an open-source framework that helps you generate a score between generation and retrieval. It gives you ideas and pointers in terms of understanding, where do you actually have to go in order to fix your system.

A quick summary for stage 3. In the previous stage, the biggest challenge we saw was hallucination, and testing agents was the other piece. With introduction of RAGs, hallucinations drastically reduced. We're using knowledge graphs as our data source, and not vector database. We basically stored everything. We stored all the data that the users were chatting, the results that we were showing to the users in order to power our observability and product metrics. By eliminating agents, testing became a whole lot easier, because we could see exactly what the LLMs were trying to do, and we were trying to figure out, how can we make this better?

With this, we had other challenges that we need to start looking at. We started showing results, and it's an open-ended environment. Our users started to interrogate the data that we were showing, and they wanted to have a deeper conversation with the data itself, to achieve their goal. This was not enabled in our system yet. One of the other challenges that we saw was higher latency. With more people using, the latency numbers were quite high, so the response rates were slower. Our users started to get a bit annoyed because they had to wait for a few seconds before they could look at the answer on their screens.

What does reduction of hallucinations with RAGs look like? We play the same story. We're looking for coffee suppliers in South America. We want them to be sustainable. Essentially, what you can see on the right side of the screen is, because we force it to use data, and we force it to provide citations with data provenance, it actually tells the user that this calmcoffee supplier who produces coffee beans, they're based out of Brazil, and we found this information from this place. It gives our users to go check them out themselves if they want to, and now they can start to trust that this answer comes from a certain place, without having to just sit there and wonder, where did this information come from?

We also see that it looked at RoastYourBeans as another supplier. It did let the user know that they don't have any sustainable certificates. We found this information from roastyourbeans.com. The very next thing, the moment we unlocked this for our users, what they did was, they said, can you tell us a little bit more about RoastYourBeans in terms of their revenue, customers? How are they in terms of the delivery times? Unfortunately, we didn't have that information to share with our users.

Stage 4 - Expanding, Improving, and Scaling the Data

That leads to stage 4, which is, expanding, improving, and scaling the data. You can have large language models, but the effectiveness of your LLM is going to be based on the high-quality data that you have. It looks something like this, where you go place a report to a person, and then you're just praying and keeping your fingers crossed that the numbers are right when somebody asks you, can I trust you with your numbers? We didn't want to have that for our customers. There are different ways in which we were thinking about, how do we enhance and scale data? The one thing that we looked at was, we wanted to bring data from different places, and wanted to use our knowledge graph as a system of record.

In an enterprise landscape, if you're working in the ERP space, CRM space, or even in your own organization, you have data sitting in different datastores. What we tried to think about bringing this data was, instead of just vectorizing them and embedding them and putting them in embedding stores, you still had the challenge of understanding, how is this data even related? On top of that, we built a knowledge graph, which is a semantic understanding of the data and data sitting in different data fields. We started integrating different domain data, revenue from different data partners, a little bit about risks from different data partners, and data even coming from the customers themselves.

The other thing for us that was important was, as we scaled the whole data operation part, we didn't want to lose control of the explainability around the data itself and the provenance around it. We dabbled a bit with embeddings. One of the key challenges that we saw on the embedding side is, they're relatively faster to work with, and you could essentially take different questions and find answers that you'd probably not extract, and build into a knowledge graph. The challenge was explaining embeddings to an enterprise user, when you are working with a very big demographic, and also correcting them was a bit of a challenge. We still have some experiments that we are running on the embedding side.

I think we'll get to a point where we probably use both worlds. At the moment, everything that we are powering through our product is based on knowledge graphs. This is just a sneak peek of what our ontology of knowledge graph looks like. There's a saying that goes, at some point, most of the problems transforms itself into a graph. This is what our current one snapshot of our knowledge graph looks like. We did this about two-and-a-half years ago. We put in a lot of effort in getting and designing this ontology. One of the things that LLMs can do very well is, given your domain, you can actually use an LLM to design the ontology for you. What took us about 6 to 9 months of effort to build this is something that you can actually build for your domain in maybe a few months using an LLM.

Once we had this ontology that spread across different domains and different entities, we wanted to think, how do we now populate this? Ontology is great, but now we need to bring all of this data and populate it. We previously had our transformer-based models that were working on web content and other different data types to bring all of this information, but we had a problem that the quality that we needed the data to be in had to be much higher. We sat there wondering, we need high-quality data. We had access to some training data that we had annotated ourselves. Now, how can we get high-quality data in a short period of time to fine-tune a model? We used a superior LLM.

Basically, what we did was, instead of taking months and putting in a lot of effort with humans to generate annotated data, we took a much superior LLM and we generated high-quality training data to fine-tune a smaller LLM model. We had humans in the loop to validate this. Basically, our efforts went down by 10x to 20x of what we would have to spend with just humans annotating the data, to getting that gold standard data with using an LLM and having humans in the loop validating it.

The reason why we wanted a smaller model that's adapted to a certain task is, it's easier to operate, and when you're running LLMs, it's going to be much economical, because you can't run massive models all the time because it's very expensive and takes a lot of GPUs. Currently, we're struggling with getting GPUs in AWS. We searched all EU Frankfurt, Ireland, North Virginia. It's seriously a challenge now to get big GPUs to host your LLMs.

The second part of the problem is, we started getting data. It's high quality. We started improving the knowledge graph. The one thing that is interesting when you think about semantic search is that when people interact with your system, even if they're working on the same problem, they don't end up using the same language. Which means that you need to be able to translate or understand the range of language that your users can actually interact with your system.

We further expanded our knowledge graph, where we used an LLM to generate facts from data that we were looking at, based on our domain. We converted these facts with all of their synonyms, with all of the different ways one could potentially ask for this piece of data, and put everything into the knowledge graph itself. You could use LLMs to generate training data for your smaller models. You could also use an LLM to augment and expand that data, if you know well within your domain how to do this.

The third thing that we did for expansion was we actually started working with third-party data providers. There are some data providers that specifically provide you data, and there's a massive amount of them. We started working with data providers for getting financial information, risk information, so on and so forth, and brought all of that together into our knowledge graph.

The engineering problem of this. All of this sounds so great this. Theoretically, you're doing this. It's a POC. You run it on a few hundreds of documents, everything is fine. You need to now scale this to millions and probably billions of web pages and documents. Which essentially means you have big data, you have big models, and you actually have big problems, because orchestrating and running this is a nightmare. Our ML pipelines had to run LLM workloads. The LLM inference time had a big impact on throughput and cost, so we were trying to figure out, how do we reduce them? How do we run in their optimal form?

Data scientists wanted to run experiments at scale, because they weren't able to at the moment. We had to make sure the ML pipelines were observable, and should ideally use infrastructure efficiently, so that you know which jobs can use CPU, which jobs can use GPU. Infrastructure scales, comes back down when it's not being used. We ended up changing our entire ML and LLMOps platform. What we tried to do was we said, if we want to hit all of these things, we had a very big challenge. How many of you are running Spark pipelines, ML workloads in your organization? It's going to be a bit more harder to get Spark pipelines to run with LLMs as well. One of the other challenges we had was our data science team were not Spark aware. Our ML engineering team was Spark aware. Which essentially meant that anything that goes into production or needs to run at scale, there needs to be a translator from the data science world to Spark world.

The other challenge that we had was, Spark is written predominantly with Java and Scala, and our data scientists are very much away from that world. They have worked with scientific packages on Python for a very long time. Observing and understanding when Spark fails was a very big challenge for them. Also understanding how Spark is utilizing the cluster, how exactly the distribution of compute needs to happen was becoming tougher. We had our data pipelines. We had our ML pipelines. We had LLM workloads. There were so many different pieces, and we realized, if we keep running in this direction, it would be an absolute nightmare for us to maintain and manage everything. What we did was we introduced a universal compute framework for ML, LLM, as well as data workloads. We started using this framework called Ray, which is an open-source framework coming out of UC Berkeley.

The enterprise version of it is run by Anyscale. What they provide us is not just with a way to work with Ray, but the platform also provides us to host and run large language models that are optimized to run on smaller GPUs, make it run faster with the click of a button, rather than us having to manage this all by ourselves. Of course, it ran on our infrastructure, which meant that we didn't take a hit on the privacy and the security part. It's just that we found a better way to operate. We chose the path of buy rather than build, because build, at this point, was going to take us a very long time. It's a very cool project. If you're running massive data workloads or data pipelines, Ray is a very good framework for you to take a look at, to a point where your data science team can just use decorators to scale their code to run on massive infrastructure, rather than having to figure out how to schedule it.

Outcome of stage 4. We ran through the whole script again, and we had various scripts that we were testing. Finally, we got to a point where, when the user asked us to tell us a little bit more about the suppliers, we said, based on the data we've received from XYZ data partner, we see that the revenue of the company is $80 million. Here is the revenue. When they asked us about the delivery, quality of the supplier, this information is not available, either with the data partners or on the internet as well. What we can do, and this is what we started seeing, is every time we had a piece of data missing, we wanted to see how we can enable the users to try and get this data.

We designed a Chain of Thought prompt as a way to say, we can't get this data for you, but we do know that we have their email so we can help you draft an email so that you can start having a conversation with the supplier to understand if you can get more information from them.

Summary and Takeaways

Your product should warrant for an LLM use. Your Elasticsearches, your MongoDB databases, your actual databases, they all do a fantastic job. If your product doesn't warrant an LLM, you don't necessarily have to jump on the bandwagon, because doing is cool, but it can turn out to be very expensive. LLMs come at a cost. There's cost of upskilling, running the model, maintaining it. Brace yourself for failures and aim for continuous and sustainable improvement. LLM is not the golden bullet. You have to still work on high-quality data, your data contracts, your data ops, and managing an entire data lifecycle.

Please compute your ROI. You'll have to invest a lot of time and money and people at this, which means that your product needs to, at some point, have that return on investment. Measure everything, because it can look very cool. You can be allured by the technology. Store all the data, metadata, everything. As and when you can, get humans in the loop to validate it. The idea of LLMs need to be around efficiency and effectiveness, but not to replace humans, because it's not there.

Even if there are a lot of people who are talking about generalized artificial intelligence, it's definitely not there. Please do not underestimate the value of guardrails, domain adaptation, and your user experience. There's a lot of work on the user experience side that you'll have to think about in order to bring the best out of the LLMs and their interaction with the users. I think it adds a lot of value to your product.

Take care of your team. Your team's going to have prompt engineering fatigue. They're going to have burnouts. Some of your data scientists might be looking at the work they did in the last decade, and now an API can do it for you, so there's fear of LLMs replacing people. There are meltdowns. You have to embrace failure, because there's going to be a lot of failures before they come into production. Actively invest in upskilling, because nobody knows these things. The field is nascent. There are a lot of people coming out with very good content. There's free content out there. There are workshops that you can sign up for. Actively invest in upskilling, because it will help build a support system for your team.

System design: once you're past the POC stage, you have to think about sustainable improvements. You have to design your systems to work with flexibility, but at the same time with reliability. Version control everything, from your prompts, to your data, to your agents, to your APIs. Version control everything and tag your metadata with it, so you can always go back and run automated tests. One plus one is equal to two, and all of us know this, but it's not just important to know one, but it's also very important to know the plus operator in there. Think about this as a whole system, rather than just thinking about an LLM can solve the problem for you.

See more presentations with transcripts

Recorded at:

Oct 15, 2024

Nischal HP

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Mind Your Language Models: an Approach to Architecting Intelligent Systems

Summary

Bio

About the conference

Transcript

Background, and Outline

Case Study: Supplier Discovery (Semantic Search)

Stage 1 - Enabling LLMs in Our Product

Stage 2 - Bringing Open-Source LLMs, Domain Adaptation, and Guardrails

Stage 3 - Reducing Hallucinations with RAGs

Stage 4 - Expanding, Improving, and Scaling the Data

Summary and Takeaways

Related Sponsored Content

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ