Transcript
Rajpal: I'm going to be talking about guardrails for enterprise applications and breaking down the problem of reliability, safety, consistency for large language models. My name is Shreya. I am currently the CEO and co-founder of Guardrails AI. I've spent about a decade in machine learning. I started out doing research and publishing papers of top-end use in classical AI, deep learning. I then spent a number of years in self-driving cars working in everything in machine learning and self-driving cars, including sensors, perception, decision making. I really enjoyed that job, but then left to join an MLOps company based here in the Bay Area as their first hire, and led the ML infra team while I was there. I started Guardrails AI, which is a company I now lead.
AI Evolution
This is why we care about this, in the last year or so we've seen a Cambrian explosion of applications in AI. There are a lot of really interesting applications that we see, Auto-GPT and completely fully agentic workflows are something that we've seen. In addition to that, we've also seen very specific applications in high value use cases, so mental illness, medicine, law, sales. This is one of my favorite tweets that captures some of the early sentiment around what LLMs and AI will do for us, which is like, software engineering as we know it is dead, AI can code better than us. We can see interest in AI over time, really expanding. If you look at the timeline here, somewhere around this point is the ChatGPT release, and immediately interest in AI skyrocketed. Suddenly, everybody who was not related to software, like my mom was talking about artificial intelligence, and for the first time got what work I did. It's really nice to see everybody's imaginations sparked with what we'll be able to build with AI systems.
The reality is a little bit different. As you work with AI applications, you come across many issues with them that lead to, at the end of the day, reduced value from what you're able to get out of using AI. I think this was a case that got a lot of attention, because this was a very real-world implication of somebody overly relying on AI systems when they aren't at a place where they're very reliable. A lawyer used ChatGPT in court, and cited fake cases, and is commonly now known as the ChatGPT lawyer. I think the quality of answers on Stack Overflow, for example, has gone down, a bunch of other use cases. I've quoted some figures from an article which was published by Sequoia Capital, about the state of the world in generative AI. If you look at path to 100 million users, ChatGPT is one of the fastest climbing applications that got to 100 million users. Then if you actually look at like one month retention, or daily active users of those applications, generative AI applications tend to have much fewer retention. I haven't included the DAU graph here, but go check it out at this link, https://www.sequoiacap.com/article/generative-ai-act-two/. Much fewer compared to traditional applications that have also seen that meteoric rise. This is a very interesting trend. It points to the case that right now out of generative AI, people aren't getting the value that you really need to be able to get from other software systems that you used to.
Why is this the case? A common symptom that we see with people is like, my application completely worked in prototyping, it never failed. The moment I handed it off to someone else, it started erroring out, or my API stopped working. I think the root cause of the symptom is that machine learning is fundamentally non-deterministic. This is not a bug but a feature. If machine learning was fully deterministic, it wouldn't have the creativity that all of us have come to rely on for ChatGPT, other generative AI use cases. It wouldn't adapt so well or it wouldn't be as flexible as it is for a lot of applications. A lot of the goal here is like, how do we work with something that is still really useful but works differently than traditional software that we've been used to. I've outlined that problem here, which is that any software API, when you build with it, that software API is going to be deterministic. Maybe that service might be down sometimes, but any query that you send it, or any response that you send it is going to give you the same output response, time after time. As a result of this property, we're able to build really complex software systems that chain together multiple APIs, and all depend on each other.
Machine learning is different. Even if availability or uptime might be high for machine learning models, fundamentally, you're going to end up getting different responses from a machine learning model every single time you use it. For example, let's say you are building a generative AI application. You read all of the best guides on how to do prompt engineering, you end up constructing the best prompt for yourself. You're like, ok, I have this problem solved, now this prompt works. Every time you run that prompt in production, even without changing anything, you're going to end up seeing different responses. It doesn't mean that the ChatGPT or OpenAPI API is down, it just means that the model fundamentally responds to the same input differently. As a result, a lot of the applications that people are building using generative AI still are marrying traditional software systems. You chain together multiple APIs, you try to get the LLM to reason and that output is sent to another LLM, chaining is a very common abstraction that has gotten a lot of popularity. This fundamentally breaks down and doesn't work the moment you factor into account how common it is for AI model APIs to break down or just generate outputs that are not in line with what you expect them to be or what you tested them out to be.
Getting 'Correct' Outputs Always is Hard
Summarizing the problem, is that getting correct outputs always is hard. This is a fundamental assumption that we take for granted with traditional software systems. Some common issues of incorrectness that you end up seeing in applications are hallucinations, falsehoods, lack of correct structure. For example, a lot of people use ML APIs to generate JSON payloads that then you can use to query a tool or something. Those JSON payloads are often structured in weird ways. This is all on the output side. Even on the input side, you might be susceptible to prompt injection attacks, which is essentially a framework where someone can cause you to reveal your prompt, which might be your secret sauce, something that you don't actually want to reveal. By, in a very oversimplified way, like asking the LLM nicely, but often requires a lot of hacking around the limits of what the LLM response will do and won't do. You basically end up seeing the developer's prompt which you can use maliciously. To further complicate the problem, the only tool that is available to any developer that is working with a generative AI system is the prompt. This is a very novel world where you can't really code your way out of these problems, you often have to once again, write a bunch of English and figure out how different models respond to the same English prompt.
All of this combines together into a situation where the use of LLMs is limited, wherever correctness is critical. For example, one of the most common, most successful applications of LLMs that we see is GitHub Copilot. I use it in my workflows every day. Most of the folks at my company use it. If you look at like, what makes it tick, it's that the LLM output is generated, and if it's wrong, you just brush it off, move on. Only if it's correct do you accept that output, and then actually incorporate it. This has limited value. This makes it so that you can't really use LLMs in healthcare applications, or in banking, or corporate or enterprise applications where there's a substantial cost to the LLM being incorrect. This is a problem that I personally find very fascinating. Not just because it's like, you have this fascinating technology on your hands, how do you make it work with how we build software today? Also, because this is a pattern that we see a lot in self-driving cars, for example, where ML models are again stochastic at the end of the day, but they need to work in a real-world scenario where there's a substantial cost to that ML model being wrong.
Guardrails AI (Safety Firewall Around LLMs)
The problem here that I'm focused on, and Guardrails AI the open source is focused on, is, how do we add correctness guarantees to LLMs? This is a screenshot of the open source. This is the entire problem that the open source focuses on. I'm going to dig deeper into how architecture, we situate ourselves into a part of the stack that makes it so that we're able to assess these problems. Guardrails AI acts as a safety firewall around your LLMs. This is a high-level system overview. If you look at how traditional AI applications are built, you look at, there's some AI application. Within that AI application, you have a component, which is your LLM logic. That LLM logic, basically, there's an input that is a prompt. That prompt gets sent to your LLM API. That LLM API returns an output, and that output flows back into your application. How Guardrails AI is different is that we essentially advocate for using a verification suite, in addition to any LLM API that you use. This sits something like a sidecar in addition to your LLM API, and is essentially acting as a safety firewall around your LLM, and making sure that all of the functional risks and guarantees that you care about are all accounted for. Now, instead of sending your output back directly to your application logic, you would first pass that output to your verification suite. This verification suite would have a bunch of independently implemented guardrails. These guardrails all look for different risk areas, different safety concerns within your system. Let's say one of these safety concerns could be making sure that there is no personally identifying information in your output. If you don't want to send any PII that is accidentally generated by the LLM to your user, this can be something that filters that. Making sure that there is no profanity in your system, especially if this is a customer facing application.
Let's say you're also building a commercial application. For example, you are McDonald's, and you want to build a chatbot for taking orders via chat, even if a customer asks you about Burger King, you never actually want to be talking about Burger King. It also extends to non-English output. For example, let's say you're generating code with an LLM, there's a lot of code generation applications that large language models are really good at. You might want to make sure that any code that is generated is executable and correct for your development environment. Finally, let's say you're using LLMs to summarize an email, or summarize really long internal design documents, so you can make sure that the summary is actually grounded in the reality of what that document is, and is correct, and is not hallucinating any outputs. Now your workflow looks like, instead of passing that output directly to your end user, the verification suite independently passes and fails. You can set policies on how much you care about each single guardrail. Maybe mentioning competitors is not as bad for you, but profanity is really bad. That is one of those things that you can configure and make sure that is handled appropriately. Only if you pass validation do you send this out into your application logic. If you fail validation, Guardrails uses a very unique and novel capability off large language models, which is their ability to self-heal. Essentially, if you provide them enough context about why they're wrong, more often than not, LLMs have the ability to correct themselves and incorporate your feedback. In this case, Guardrails will automatically construct a new prompt for you that talks to the LLM in a way so that it's more likely to correct itself. This whole loop carries over again until you pass verification or until your policies give out and you are out of budget.
Why Not Use Prompt Engineering or a Better Model?
Why not use prompt engineering or a better model? First of all, an alternative to using this safety firewall is directly trying to control the LLM output with a prompt. As we discussed before, LLMs are sarcastic. The same input, even if you have the best prompt available to you does not guarantee the same output. Finally, prompts don't offer any guarantees, because LLMs don't always follow instructions. Anecdotally, a lot of people that I talk to and work with, they go to these extremes of like, ok, something really bad will happen, like a man will die if you don't give your output in this certain fashion, which is not a very sustainable way of building good, reliable software. I think the alternative is controlling the LLM output with a better model directly. The first is if you want to fine-tune a model, and this allows you to really build a better foundation for your application. The issue still remains that it's expensive and time consuming to train and fine-tune a model on any custom data. My entire background in machine learning, I've spent on training or fine-tuning models or building infrastructure to train and fine-tune models. A lot of the explosion that we're seeing with applications is because of how easy it is to now not have to do that, not have to build a dataset that works for your problem, not have to manage the training infrastructure, but instead just call an API and get the right output. Doing this, to some degree, limits the usability of generative AI applications. Alternatively, if you're using commercial APIs, so for example, you rely on OpenAI GPT-4, like newer GPT releases, I think the issue there ends up being that you as a developer don't have any control over model version updates. As someone who works with these models every day, one of the things that happens is that even without official model version releases, under the hood, the models are updated constantly. Once again, if you have a prompt that you think will work really well for one specific model, and you're just trying to use it for the next better model, often the model will change out from under you, and those same prompts will stop working.
What Guardrails AI Does
What Guardrails AI does, in this framework of building more reliable generative AI applications. It's a fully open source library that, first of all, offers a framework for creating these custom validations that we talked about in the previous slide. The second is that it offers the orchestration of prompting and verification and re-prompting or any other failure handling policies that you've implemented. It is a library and a catalog of many commonly used validators that you might need across your use cases. We're going to dig into a few of them later. Finally, it's a specification language for communicating requirements to LLMs, so that you're priming the LLM to be more correct than it ordinarily would, by taking some of the prompt engineering pain away from you. How does each guardrail actually work under the hood? We talked about PII removal. We talked about executable code. Guardrails by themselves really are any executable program that you can use for making sure that your output works for you. One common thing that we see here is like grounding via any external system. For example, if you want to make sure that your code is executable, or if you generate a SQL query for some natural language question, making sure that that SQL works for you by hooking it up to your database or a sandbox of your database, and making sure that there's no syntax errors, the output that you get is actually output that works for you. You can use rules-based heuristics for making sure that your LLM output is correct for you. For example, let's say you're using your LLM to perform data extraction from a PDF. You read the PDF and you generate unstructured data. You can make sure that any unstructured data you generate follows rules that you expect from that system. Interest rates must always be followed by a percentage sign. Dollar values must always be between x and y ranges. You can use traditional machine learning methods. For example, you can use embedding classifiers, decision trees, K-Nearest Neighbors classifiers, to make sure that any output that you get is correct for some constraints. Let's say you have a previous treasure trove of data that you know is correct. You can use more traditional ML methods to make sure that the new data that you extract is similar to the data that you know is correct. You can use some high precision deep learning classifiers. This essentially entails making sure that you're not using GPT scale models, but you're still using a model that you can reliably host and serve. That can act as a watchdog to make sure that your output is correct and doesn't violate any concerns. Finally, you can use the LLM as a self-reflection tool. Making sure that you ask the LLM to examine the response that you got, examine the correctness criteria that you specified, and make sure that the response respects that correctness criteria.
I also talked about handling invalid outputs, and I talked about re-asking a little bit. Guardrails, the open source framework, offers a bunch of policies that you can configure on how you want to handle your validators failing on your verification suite. For example, let's say this is the response that you end up getting, let's say you're extracting a JSON with two values, one is transaction fee, and one is the value of the transaction fee, which is 5, but you have some guardrail, which is a very easy rule-based guardrail here, which essentially always inline checks that the value is less than 2.5. You can handle this using a bunch of different policies that you configure. We talked about re-asking where there's a new prompt that Guardrails construct for you. For example, you're an assistant, correct your previous response, which is incorrect because of XYZ reasons. Then, finally, send that response over to the LLM, and end up getting the new response, which in this case is transaction fee and 1.5. You can programmatically fix any output. This isn't always possible, especially for a lot of the more complex, more sophisticated guardrails that we have. In this case, for simple rules-based guardrails, you can often just programmatically fix this to, for in this case, like the maximum value of your threshold. You can filter any incorrect or offending values. In this case, you don't throw away the entire LLM response, you essentially make sure that you're only getting rid of the incorrect value of 5. This also works for string outputs. Let's say you have a large paragraph, and some sentences within that paragraph are hallucinated, or incorrect, or contain profanity, or whatever. In that case, instead of throwing away the entire paragraph, you can basically just remove those few sentences that are incorrect, make sure the fluency of your overall passage isn't affected, and generate a new response. You can respond with a canned policy, or a statement about why your response is incorrect. For example, "Sorry, no appropriate response possible." This is more appropriate for using when you essentially make sure that your cost of getting some requirement wrong, is so substantial that not answering the customer's question is actually a better compromise, in this case. This is one of my favorite ones, which is noop. You pass the incorrect output as is to your customer. This is when this isn't a very sensitive requirement for you. At the same time, every single guardrail that fails is logged for you. In this case, what you end up getting is very rich metadata about every single response that is served for you in runtime, and everything that's right or wrong about that response. You can use this to build better systems, train a better model. We talked about how hard it is to create a dataset for model training. You can use that for creating your model training dataset. Finally, you can also raise an exception, in which case you can handle that in your code or programmatically.
Example: Internal Chatbot with 'Correct' Responses
I want to now walk through some applications of, let's say, you're building an internal chatbot and you care about getting correct responses from that chatbot every time you use it. Let's say you have a mobile application and you want a chatbot that is able to look at the help center articles of your application, and that anybody can come in ask an English language question, and then you answer that question. This is often referred to as the Hello World of generative AI applications which is a chat your data app. Then the correctness criteria that you care about, first of all, don't hallucinate. Second is, don't use any foul language. This is table stakes, like don't swear at your customer, basically. Finally, don't mention any competitors. If somebody asks you like, who is the best burger joint in town, don't say Burger King when you're McDonald's. In this case, we basically talk about implementing three different guardrails to make sure that this problem works for us. Essentially, we solve this problem by implementing provenance guardrails, which contain checking for embedding similarity with any code that you have. This essentially ensures that any output that is generated is similar to your source content. Making sure that any output that you have is classified to be correct using some NLI or natural language inference model. Finally, using LLM self-reflection.
How Do You Prevent Hallucinations?
We listed out this criteria of, how do you prevent hallucinations? This is a pretty complex problem, is a very active area of research at this time. Often the question is like, ok, I can't really control my model, how do I actually in practice correct hallucinations? Add guardrails. We have this thing that we like to call provenance guardrails, where provenance guardrails, essentially, make sure that every single LLM utterance that is generated has some provenance in the source of truth that it came from. These guardrails are active and available for you to use today in the open source. Why this ends up being a problem is that these models are often so powerful and trained on the entire internet that even when you give them, "Ok, these are my help center articles. Make sure that you don't respond with anything that isn't in the help center articles." The model will often quote things from there, but then use its general knowledge, or quote something that it has memorized from the internet. In spite of creating retrieval augmented generation applications that try to constrain the outputs of the model, you still end up getting hallucinations in practice. Provenance guardrails act as a way to do inline hallucination detection and mitigation while you're running and building an application.
Example: Getting Responses with 'Correct' Responses
Let's see what this looks like in practice as you're building something. By now we're familiar with our diagram that we went over earlier. I think the only difference is that this time around, our verification suite consists of provenance, profanity, and peer institution. Making sure that every utterance has some source and isn't hallucinated. Making sure that there's absolutely no profanity in your output. Making sure that there's no reference to peer institutions. As you can imagine, all of these are complex problems by themselves. It's almost as hard to solve for these safety issues as it is to build an application. All of these are guardrails that are available for you in the open source today, because we've done the hard work of doing the machine learning and solving for these ML problems in the open source. Let's say you have the system and a user comes in with a question of, how do I change my password? If you're familiar with building AI applications, what this turns into is a prompt, where you're force going to prime the LLM by saying something like, you're a knowledgeable customer service representative. Somebody has a question of like, how do I change my password? Make sure you answer this question to the best of your knowledge. Then you give it a list of articles that it is ok to answer from. You're going to be like, in my help center articles, these are the articles that might be relevant. As an LLM, reason over these articles and make sure that the output that you get answers the customer's question to the best of your ability.
Let's say, the initial output that you end up getting looks something like this, where you have some raw LLM output that says, log into your account, go to user settings by clicking on the top left corner, and click, Change Password. This is a toy example. Essentially, settings aren't at the top left corner of your application. This is somewhat obfuscated, but this is an actual real example of someone that I was working with that was using Guardrails for their use case where the chatbot would hallucinate, where certain components of your application should be, even though it has no ability to visually look at your application. These types of hallucinations often end up being pretty common, and end up misleading your customer or don't end up giving useful responses to your customers. When this output passes through your verification systems, yes, there's no profanity in it. There's also no reference to any peer institution or to a competitor of yours. Your provenance guardrail would fail because that utterance does not have any bearing in the source of truth or in the articles that you provided your LLM. In this case, because we said re-asking as our error correction policy for provenance, a re-ask prompt is automatically generated where a simplified version of that re-ask prompt looks something like this. You talk about specifically what was wrong in your response, what the error looked like, as well as the newer instructions. In this toy example, we end up getting the correct response, which is basically, click on the menu to go to settings. This time around your verification passes and you're able to route that output out to your LLM.
Complex Workflows with Guardrails
This is what happens on a one-shot workflow where you have a single LLM call, and you want to safeguard that LLM call with guardrails. Now, we come back to our original motivation of, I want to build really complex systems where I want to use my ML modeling APIs, but I want to make sure that my ML modeling APIs are reliable and secure and safeguarded. What this ends up looking like is that, instead of a single LLM API call, you now have two LLM API calls, each of them have specific prompts that work for them. Typically, the output of one LLM API is fed as the input to the prompt of the other API. In practice, this leads to the issue of like compounding errors as well, as you can imagine, because you essentially don't have water tightness on any of these APIs. As you move down the chain, you end up getting more incorrect responses. The guardrails way of the world in working with these systems essentially, is making sure that you have verification logic that surrounds every single LLM API that you're making, so that the first output first passes through your verification suite, and only clears that suite once you know that that output is functionally correct. If we fail, we obviously go through our re-prompting strategy. If it passes validation, only then you go on to the next stage, where you then again go through this verification logic, and make sure that you send your response back to the AI application only on passing the verification logic. If we zoom out from this framework, overall, this reduces the degree of like compounding error, and on the entire workflow level, you end up having applications that are much more robust to the inconsistencies with working with LLMs.
Example: Guardrails for Customer Experience Chatbot
I've included some example of what types of guardrails you typically end up needing as you're building a customer experience chatbot. We talked about like provenance and hallucinations. Typically, any unviolatable constraints that you have, like making sure that you can trust the output, making sure that your output is compliant. Or that there's brand safety and you aren't violating any constraints of like not using a language that wouldn't be supported by your brand. Making sure that you're not leaking any sensitive information. In general, making sure that your output has reliability, as well as data privacy. In practice, what this might end up looking like is that you only use vetted sources for building trust. You can encode guardrails for specific laws. For example, if you're a healthcare company, making sure that you're never giving any medical advice, because as an AI system, you're not allowed to give medical advice. Never using any angry or sarcastic language for your AI system. As well as making sure that you're able to control the input and output of any information that is sent into the LLM or retrieved from the LLM, and there's no leakage of any sensitive data. Finally, only responding within category boundaries. Let's say you're building a customer support chatbot, if somebody is just like chatting with you about something that doesn't pertain to the boundaries of that system, making sure that you're guarding against like proper use of any application that you build.
Examples of Validations
More examples of validations that we care about. We talked about hallucination, of course. Never giving any financial or healthcare advice is another common criteria. As you can probably expect, this is a pretty complex machine learning problem of, in the first case even detecting what financial advice is or what medical advice is. Never asking any private questions. Never asking the customer for, give me your social security number or any private or sensitive information. Not mentioning competitors. Only making sure that each sentence is from a verified source and accurate. No profanity. Guarding against prompt injection, as well as never exposing the prompt or the source code of your application.
Summary
Guardrails AI is a fully open source framework. You can use it for creating any custom validator, as well as looking through the catalog of validators to plug into your AI applications, and making them safe and performant and reliable. You can use it for orchestrating the validation and verification of your AI systems. You can use it as a specification language to make sure that you're communicating your requirements to the LLM correctly. To learn more, you can check out the GitHub repo at github.com/ShreyaR/guardrails. Our website is guardrailsai.com.
Questions and Answers
Participant 1: I liked the name Guardrails, you somehow arranged the Wild Wild West into a path that you can actually control. When you create the validators and now your input is going to go to the validator, how do you route it to the appropriate one? You have some logic, let's say, I have three validators, I'm going to send to the first one, then I'm going to send to the second one, and I'm going to send to the third one, or how do you ascertain which one goes first?
Rajpal: There's two ways of doing it today. I think the first one is that all validators are executed parallelly. You send your output, or your input to every single validator at the same time. I think in addition to that, we also have a specification language that allows you to stagger the execution of those validators. Some validators are typically more expensive. I talked about hallucination excessively. We have multiple different ways of targeting hallucination. Typically, there's this tradeoff between the fastest and cheapest methods to run will also not be the most accurate methods. You can stagger them in a way so that you run the fast and cheap one first, and only if it fails, do you go down to the more expensive method. We have this specification language that allows you to pipeline them in different ways.
Participant 1: Is that why you actually have MapReduce them, where your mapper is doing things in parallel, and then the reduce takes the end?
Rajpal: The MapReduce example, I included, is another common example of building AI applications. A common issue that folks who are building with these systems have to wrangle is their context limit. With GPT-4, for example, or with most common GPT APIs, you can only send 4000 tokens, and let's say you have a massive book. In that case you have to break that book down into 4000 tokens each, do your map stage over each book, and then do a final reduce stage using the LLM. That was a system that uses that MapReduce framework with Guardrails included at the mapping stage as well as at the reducing stage.
Participant 2: Thank you, again, for that thought. I was going to ask a question about context windows [inaudible 00:39:55] because of course, every time that you have to re-prompt [inaudible 00:40:00] the context window. So my question I'm going to focus on is about cost, actually. So you mentioned that there's a guardrail without a procedure sometimes you re-ask the question [inaudible 00:40:14]. What has been your [inaudible 00:40:19] re-prompting to make it more accurate?
Rajpal: I think there's two points I would highlight with respect to cost. All of the re-prompting is configurable. If you're working with a guardrail that you don't care about as much, or there's some cost to getting it wrong, but you can use it using other methods. You don't always have to re-prompt, you can just filter out the incorrect values or use a programmatic fix when available. Or just be like, ok, maybe I can't answer this query. Typically, what I've seen people do is only configure re-prompting as their policy when it's a guardrail, where, if this guardrail is wrong, or if this check fails, then my output just straight up isn't useful to me. For example, I really like the idea of healthcare advice or financial advice as one of those things where some of the folks, for example, that we work with, they're like, the cost of accidentally including healthcare advice in my output is so severe to me as an organization, when I might as well not respond, in which case I'm not giving value to my customer, or I might take that extra hit of re-prompting. Those are typically those subsets of guardrails on which people configure re-prompting. This is not as much on the software level, but more on the scope of problem level, a lot of the places where people end up using these systems are where the alternatives are typically way more expensive. Often, you typically have to wait in line for an hour to get an answer to your question, in which case companies are often ok with taking on that extra cost of re-prompting when necessary.
Participant 3: About your verification suite, how do you handle false positives?
Rajpal: A lot of the guardrails that we have are ML based as well, in which case they come with their own irreducible error or their own possibility to basically be wrong sometimes. I think how I like to look at this problem is, on some level, what we're doing is ensembling, which is basically this technique of making different machine learning models work together, where each model has different strengths and weaknesses. My favorite analogy to use for this is that you're stacking together different sieves, where each sieve has holes in different locations. You end up getting systems where the stack of sieves is way more watertight than each sieve individually would be. For example, for provenance or for our anti-hallucination guardrails, we have a bunch of internal testing, which allows us to estimate what the efficacy of the system is. The idea is that you mix and match a few of these together and ensemble the guardrails, that with the LLM ends up getting you to, overall, a much more robust system. That said, you're working with machine learning, you're never going to have zero error. The idea is to drive down the error as much as possible, using the tools that are available to us.
Participant 3: Once you start getting false positives, will that translate somehow into a new guardrail. I don't know. [inaudible 00:44:06] dynamic guardrail or something like that where [00:44:11].
Rajpal: I'll ground my answer in the example of provenance. In the example of provenance, I talked about embedding the similarity versus an NLI classifier. With embedding the similarity, which is the fast, cheap, and not as accurate method of detecting hallucination, what we end up finding is that it had a lot of false positives. Then, that allows us to figure out like, we have some estimation of the uncertainty of that system. What we end up doing basically is like K-Nearest Neighbors classification, and then what we end up seeing is like, based on our KNN classification, if the distance is greater than some threshold, then we probably don't have a very good estimate for this point, in which case, you then route it. The failure policy for this guardrail is to send it to the more performant but then much more expensive guardrail, which is more accurate. That failover and failing onto different systems is part of it.
Participant 4: If along with having guardrails with different style, have you considered maybe fine-tuning the model with an updated cost function that incorporates the data.
Rajpal: Our approach to that system is basically to make it very easy for you as a developer to log the outcomes of those guardrails. Where, you're running this live production system, you're serving thousands of requests every day, so you end up getting over a month of running this, like a decent dataset with features about what works for each request or not. You can directly use that dataset, which we make it very easy to export, like for your fine-tuning. It's a very interesting point where you can use it basically in your loss function of training a model and basically add it to your loss and basically get a much more representative loss. I haven't seen anybody do that today. I don't see a reason why not because the guardrails, like I said, are independently implemented, like modules basically. Yes, theoretically, you could.
See more presentations with transcripts