BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Denys Linkov on Micro Metrics for LLM System Evaluation

Denys Linkov on Micro Metrics for LLM System Evaluation

Live from the QCon San Francisco Conference, we are talking with Denys Linkov, Head of Machine Learning at Wisedocs and Advisor at Voiceflow. Linkov shares insights on using micro metrics to refine large language models (LLMs), highlighting the importance of granular evaluation, continuous iteration, and rigorous prompt engineering to create reliable and user-focused AI systems.

Key Takeaways

Transcript

Roland Meertens: Welcome to The InfoQ Podcast. Today, I'm talking with Denys Linkov, who is head of machine learning at Voiceflow, and we are live at QCon San Francisco. Welcome.

Denys Linkov: Thanks for having me.

Roland Meertens: Yes. You gave a talk yesterday called A Framework for Building Micro Metrics for LLM System Evaluation. Which micro metrics did you talk about? What is a micro metric and how does it compare to a macro metric?

Denys Linkov: So I think the broad idea of the talk is that you have these broader metrics like accuracy and then a plethora of data science metrics, F1, Rouge, all these different things. But oftentimes, technical folks get lost with what the business value is when you're building for something. And the challenge with the GenAI boom has been, we've had technology that has been looking for a problem to solve. We have this solutions looking for problems mentality. And because of that, you start optimizing for these things, almost this premature optimization that happens. And because of that, you have challenges where the metric doesn't reflect a user experience.

So the idea of a micro metric is to find something, either a problem you've seen in production or something that you foresee happening, and measure that specifically and use it to move some kind of business metric. So I gave the example of large language model switching languages. It's an issue we ran into. So you'd have somebody prompting in a non-English language, you'd go through a couple of turns, which is a couple computer person interactions, and then after a few, all of a sudden it changes back to English and people were quite upset about that. So we started measuring that as a way to say, how often does it occur? And we implemented a retry mechanism, which fixed 99% of these issues.

Roland Meertens: But the retry mechanism just means that you regenerate the answer while it is not having the correct language?

Denys Linkov: Yes, yes. So very simple solution, but it's something that we could track because we did a prompt template update that we had and we thought it caused it, but then when we measured it against both, it was not discernible. So it was just a fundamental flaw within the models. It was actually interesting, we were talking to the customer, they're like, "Hey, we want to use Claude". And we said, "Well, we're measuring a much higher occurrence of this with Claude". Claude wasn't as multilingual at the time, and we're like, "Hey, you should use ChatGPT". And they're like, "No, we like Claude". And we're like, "Okay, well".

You're trying to get everything perfect with LLMs, it won't happen. But these are the nuanced tradeoffs that once you have something in production, you start to learn. And for every industry, it's going to be different. When you're building a different kind of application, you have the domain expertise to define what's actually happening. And we as a platform provider, we can only guess and see the problems that we encounter from customer complaints or our own interactions. But at the end of the day, if you're building something for a domain, you're the expert and you should learn to define these metrics.

Roland Meertens: And so, do you see a big difference between how, on a high level, so there's overall accuracy metrics and there's overall metrics and dashboards comparing different large language models, do you have a feeling that it still matters which use case you have, or is it just generally if you take the highest performing one, be it Claude or GPT-4, it generally performs better?

Denys Linkov: Yes. So within the talk, I walk through the challenges of how to evaluate models. So saying accuracy, how do you measure accuracy? You can measure it with an exact match, Regex, GPT-4, LLM as a judge, semantic similarity, all these different metrics. And there are flaws in each of them. So to get to a conclusion that this is the correct accuracy, that's not necessarily a good premise to work on. You need to think an approximation. And even between people, if you read the blog posts of doing human feedback, the overlap isn't even that high sometimes between expert labelers or even the average person. You ask the average person what's their favorite type of animal, you're going to get different responses.

Roland Meertens: What do you mean with the overlap in this case?

Denys Linkov: In the overlap of saying this is a correct response, you show 10 people an LLM response, are they going to say, "This is good, this is great", or are they just going to skim over it and not actually read the response? So getting to an agreeable answer is actually harder then that's the case. And oftentimes, we use LLMs because we're lazy. We don't want to define good training and evaluation sets, but we have to move back to that. We have to unlearn this vibe space prompting. We need to define those training evaluation sets. We need to know exactly what we're looking for. And the more granular your metrics get, the more you can track it through a more complex system.

A RAG pipeline can be quite simple or quite complex, but even in very basic RAG, you're measuring retrieval metrics and you're measuring generation metrics. And within that, there's a whole bunch of macro, medium, micro metrics you can measure. And you as the domain expert need to know where is it worthwhile to spend time? So if I'm doing this for a news website, you might have the "correct ordering" for a given question on a topic, but that's actually going to evolve over time. We go back to the whole ML world of data drift, model drift. Data drift is going to evolve because the answer of what was the most recent storm that hit San Francisco is going to change on a day-to-day or weekly basis.

So information actually evolves, and that's where when you define your metrics, you define your priorities, you define your data sets, you're defining the system, and you're building that muscle to actually update it continuously. So it's not a one-time exercise either, it's a whole way of thinking and building.

Measuring hard to grasp metrics [05:25]

Roland Meertens: Yes, okay. So for example, for the retrieval, it's just relevancy of the retrieved documents, and I guess that's something you can measure, but for generation it's already harder. You've got things like the BLEU score or something. It's way harder to measure, especially for things like customer-facing agents, it's hard to measure what's friendliness or how well did the agent interpret the question.

Denys Linkov: Exactly. So for relevancy, in theory, yes, but when you have compound answers, so for example, there's a good data set that illustrates it called multi-hop QA, and that says, "Okay, if I have a question that requires multiple Wikipedia articles, how am I actually measuring the value or the accuracy based on each passage?" Because if you have a question that requires four different articles or chunks to pull in from, you're like, "Okay, these ones are required, but how do I rank them 1, 2, 3, 4?" It might not matter or it might. So this is where when you're working with it it's important, but on generation, for example, you might have a metric that says the model always has to say the user's name as a way of best practice. Even when we trade to do human-to-human interaction.

Roland Meertens: Sounds really annoying.

Denys Linkov: So that might be one metric or it might say only on the first turn, say the user's name or use hey instead of hi. So this is where it becomes interesting where you start having brand voices become defining factors and almost micro metrics. You say, "Here's my 100 list of what my brand voice should sound like, and I'm going to measure my LLM response based on that".

Roland Meertens: So I think especially the brand's response is something which a lot of companies are struggling with because often they have specific ways of talking, or I think generally everyone always has something like a respect customer or be friendly or kind. Those are the basic things, but there isn't really a well-defined metric for kindness.

Denys Linkov: So that's something that maybe at a high level, this is where you go back to statistics. If you have two models and you do some kind of user evaluation or human evaluation becomes just statistics, is this significant or not? And if it's not, it doesn't matter. But there are specific guidelines for companies. So those are the ones that you can measure. And some of them are quite programmatic for searching for certain keywords, for example, what should you do? For example, when I used to work at McDonald's as a cashier, if somebody came in and ordered a Whopper, there's no such thing as a Whopper at McDonald's, so the way you determine on how to handle that situation being like, "Oh, hey, go on Big Mac, or we don't serve that". There'd be a list that the corporate marketing team could define on how the interaction should go.

Roland Meertens: Is there a policy on what to do with people who order Whoppers by accident?

Denys Linkov: I can't remember. This was a number of years ago. But these are the kinds of things, and it gets interesting because at the time, if we didn't know how to handle something, it's like, "Oh, you can talk to my manager or something". But what is that response in an LLM world? Is it then connecting them to a human that seems logical? Or if we get into this multi-modal world, is it escalating to a more expensive model, for example, to determine these more complex scenarios?

Balancing short-term and long-term improvements [08:31]

Roland Meertens: Yes, okay. Do you generally see some balance between short-term improvements, long-term improvements that people just try to tweak the prompt? I think generally when I see people working with prompts, they often only just generate one answer and then say, is it better or not?

Denys Linkov: Yes, I think prompt engineering is still fairly immature. I saw, I think a meme about when somebody says, "Oh, you're a prompt engineer, but you forget the whole engineering aspect of being rigorous and measuring and so forth". So I think it's still quite immature. We're seeing the rise of auto-optimization frameworks like DSPy is something that we use for defining training set, test set. You define an optimizer to try to go through and use the training set to find a good set of prompts. And then you can have almost these, they're similar to micro metrics, but they have this idea of assertions of what needs to be true, what needs to be validated for a prompt to be correct.

So I think that's going to continue to evolve where we're going to build that rigor back into the development. But multiple responses, I remember seeing this a few months ago when you were in the OpenAI chat interface or ChatGPT, it would sometimes pop up. It was like, "Oh, which one of these answers is better?" And they're looking for training data, obviously, but for an end-user application, that would be strange, unless it's very specific.

Unless maybe you're generating an email or editing a document, then it might make sense as person, somebody is doing rather than content generation, you're doing content evaluation. You are guiding the model, you are doing human preference, and that takes a lot less cognitive load than creating something yourself. So that might become the design pattern. But we're also very, very early in UX patterns for generative AI, and we still haven't moved past that in a number of things like IDEs or chatbots, a lot of these things. We're still doing things we did 20 years ago.

Roland Meertens: Yes, I mean, another thing which I'm always a bit afraid of is consistency across answers. So if you upgrade or switch from one backend to another, does it change? Or the answers you gave in the past can be confusing for customers who suddenly get a completely different response.

Denys Linkov: Or even you type the same thing, because it's a question of how much context do you want to use? Do you want to use time of day? Because that might have some kind of statistical factors that you've studied in the past based on user behavior. Do you want to know if this is a returning customer? Do you have different behavior? If you want to do all these different things, that might depend, it might improve the relationship with the customer, but that's an intentional choice. When it's just variation between the same prompt, same answer, that gets confusing. And when you do a model upgrade, you also need to evaluate that as well.

Roland Meertens: Yes. Are you then also tracking mitigating regressions?

Denys Linkov: Yes. I think the test suite is really important. We actually had a few articles on our applied research blog about this where we upgraded from ChatGPT, the original version released in March 2023 to the November version, 1106, and we had a task for doing user intent classification, and we did the upgrade and accuracy dropped 10%. And this is the same model. It's just a version, but our prompt was-

Roland Meertens: More recent training data. What's that like?

Denys Linkov: Right. But we have no transparency from the model providers of what actually changed or any recommendations. They might say something like, "Oh, it's better at JSON data or something else", but they won't publish updated metrics or benchmarks. So this is, again, going back to your own evaluations, you need to have these things. So we learned this the hard way, releasing this into production, then we reverted back, and you actually have this conflict of if you have multiple models, certain versions will be better than others and certain tasks, or they might be a very specific monetary benefit.

The newer GPT-4o model version is half the price of the previous one, the initial release. So there, you might still have a bunch of customers, but you want to save money, but you need to write those evaluations and be sure of the migration. So you might start having to do a phase migration, and suddenly this model as an API isn't as simple as it is supposed to be.

LLMs as a black box [12:29]

Roland Meertens: Yes, you're plugging in a black box, which means you're stuck with a black box. Even when there would be updates, it's not as in traditional software development where you know exactly which lines of code changed or which heuristics changed. You don't know what new data they plugged in or how many layers they pruned.

Denys Linkov: Or even when you're doing traditional model training, you have your hyperparameters, you do all these different runs, it might just be a different run that's happening. So training data might not have changed, but because the models are so complex, the convergence behavior is very, very different. So that's also a challenge there because you're dealing with such a big model, such a big system. The training is not deterministic either.

Roland Meertens: Yes, that's also a good point. In terms of micro metrics, are there specific micro metrics which are useful for you or which you have good experiences with which you think more people should start tracking?

Denys Linkov: I think there's various, well-defined ones that in my talk I talked about a crawl-walk-run approach. So crawls actually start off with a macro metric, start off with accuracy or just some kind of way to have a feedback loop and measure what's happening. Once you start going into walk, then you can have slightly more specific metrics. For example, measuring retrieval versus generation in RAG. Once you get to micro, that's when you can do more specific things. When I think that everybody is doing is some kind of policy on content moderation. So what kind of response do you want to give when the user says something that's inappropriate? Whether the user says something that's inappropriate and is trying to elicit some kind of response from the model, or maybe the model generates something that goes against branding policy or it's a question that you don't want to respond on.

So that could be a micro metric is how many bad questions is the user asking, or how many out of domain questions because you might have seen the videos of people going on Amazon and then having the, what's it called, Rufus, write them like a Python query, because it's just a pure connection to an LLM versus other domains or other assistants that are deployed have more layers on that. And it's a trade-off in terms of complexity, but also measuring how often that occurs. And every industry is going to have a different trade-off between false positives and false negatives.

Voiceflow [14:37]

Roland Meertens: Yes, interesting. You are working as head of machine learning at Voiceflow. What is Voiceflow actually doing?

Denys Linkov: So we're an AI orchestration platform. Generally, people use us for customer support, lead generation, these kinds of tasks where you need to build out different workflows that customers would handle. So I think the main differentiator, we've been around for six and a half years now. So we've seen before the large language model boom, we're actually more of an Alexa company to now how do you orchestrate large language models? So we have a few areas that we focus on. We focus on team collaboration. So how do you make sure that business teams and technical teams collaborate well? So that's our collaborative canvas, a low-code approach where you can define business logic and build out what you have. We also want to make sure you have as much control as possible.

So think about the example of custom GPTs. It's just a prompt and you go back and forth, it's like, "Hey, you're a language coach or whatnot". But when you want to build something more complex, it starts to make sense to have different API calls, different specific business logic, validate if this is a returning user or a new user. And this requires a lot more control and a lot more steps.

So we have the concept of workflows where for a given user intent, you define what process you should take the user through. Should it be more deterministic? Should it be LLM driven? And you can build those out and then you can organize it in a way where you can update each one. And the final component is that it's hosted, so you don't have to run it on your own environment, figure out how do I manage my own vector database? How do I manage and build my own web chat? How do I do state management for different users? How do I synchronize with my system? So the hosted aspect lets you build prototype very quickly and then launch to production.

Roland Meertens: Nice. So if people define workflows?

Denys Linkov: Yes, exactly.

Roland Meertens: Try to figure out how to handle these edge cases.

Denys Linkov: So there's different ways that you can instrument the system. You can have different prompts or different API calls you make to say, "Okay, we have built-in things like content moderation", which comes out of the box because that's generally useful. But other things like, "Okay, did the user answer this question?" I can trigger an event that gets logged into my event-based system. It could be even milestones that users go through a workflow. Let's say I'm opening a new bank account. Where does the user drop off? You get useful analytics there similar to web or mobile.

And all these things, we have custom functions and components you can write, so you can reuse this. For example, you have a business user who says, "I need to do these five things". The developer goes, builds that function. And the business user then explicitly defines the business requirements based on those functions, inputs and outputs. So the workflows that control, I think we've had a lot of teams who try to use custom GPTs or even just blank chain out of the box. And then, it becomes quite complex when you start to understand your user and now need a lot more control to actually build that user experience that you need.

Roland Meertens: So now, these multimodal models are up and coming, are they essentially replacing what you're doing or are they making a job way easier?

Denys Linkov: It's another tool for people to use. For example, we do a series called Making and Breaking Bots on YouTube where we'll pick a scenario and try to rebuild an app experience. So we were doing one with a refund flow where somebody can take a picture of a receipt. So that becomes a very interesting way of triggering an event that happens. User uploads a receipt, we then send it to a model. You can choose the model that you're using as part of the platform, then you get a response of what are the items. And then, you can cross-match that to some kind of ERP or API-based system to verify, okay, is this the user's receipt? Does the order number match? And all of this. So it fits into the framework that you have all these different models and tools, but you need to orchestrate them and figure out what you're actually doing.

Roland Meertens: Yes, interesting. And so, do you then balance somehow general knowledge, which these AIs give you with the need to program these assistants and constraining users to a specific workflow that they don't just type the wrong thing and immediately go off the rails?

Denys Linkov: Yes. So again, it's because we're at the platform level, everybody's going to build differently, and that's, again, it goes to our thesis that you know your customers best, you know your business best. We provide that platform level, and it's a variety. You can just say, "I'm going to do user input, plug it into ChatGPT and just make a loop on that". And that's the simplest assistance.

Roland Meertens: Very easy bot.

Denys Linkov: Very easy bot. You put a system prompt that's two blocks, you hit publish, you're in production in less than five minutes end-to-end. Now, you can have more control where you have very strictly defined requirements of what you want to do. If you're working at a bank, certain legal policies should be outputted verbatim. So there should be when somebody says, "Hey, what is your refund policy? Or what is my terms of conditions?" There should be no interpretation by the large language model. There should be no RAG there. It should be go down a certain workflow and articulate it to the user. Or there might be a different search that somebody has a specific question on the terms and conditions, and you have a predefined document.

Again, probably not RAG-based to avoid any kind of overriding, but just give me that workflow. So really in between, a lot of people start off with build the most basic thing, crawl, and then go through and figure out, once you're in production, oh, this is actually what users are asking, this is what we care about.

Roland Meertens: Encounter problems.

Denys Linkov: Exactly. It's treating AI as a product. We're still fairly early in that regard. People just launch something. I think the hardest conversations we've had with people or the most illuminating is when they said, "Oh, we launched a chatbot three years ago". And we're like, "And?" They're like, "Yes, it's launched". We're like, "What do you mean? Your products have changed, your customers have changed, technology has changed. You should be updating just like you do a software". We have this whole agile release process, release off of trunk, different types of release processes. Don't just build and forget, it's the living artifact.

Roland Meertens: I guess those are the people who are responsible for those frustrating chatbots which just keep throwing you in the loop.

Denys Linkov: So it's a very interesting building experience. We get to see, we work with quarter million free users and then 60 enterprises, and everybody's building different things. So it's a very interesting position to be in working with customers and helping them solve those kinds of problems.

Roland Meertens: Yes, interesting. And you also integrate with other APIs. Is this still a challenge, or would it be easier if these APIs also just had a natural language interface?

Denys Linkov: This was one of the hot takes that I shared at another conference and on a LinkedIn post, is that I think just tool use is still quite immature for what we need. We define business processes all the time. We say, "I want to call a database". You should know exactly what you want from this database. It shouldn't be a, "Oh, go look in the fridge and see what's tasty kind of thing". It's like, "I want to fetch for this user, this information". So just write the API code. Use GenAI to help you write it faster. Have a way to update it or parameterize it so you're not constantly rewriting it, but you know what you want. Go write the code for what you want. We don't need, oh, imagine going to your database. It's like, "Hey, what's Bobby's user ID?" That's not a natural language query. That's a very specific question that you're asking.

Roland Meertens: Yes. Tell me more about the IDs of maybe Bobby. Probably Bobby.

Denys Linkov: Yes, Bobby drop tables.

LinkedIn courses on machine learning [21:39]

Roland Meertens: He says his name starts with Bobby. Yes, okay. Interesting. You also created a lot of LinkedIn courses, right?

Denys Linkov: Yep.

Roland Meertens: How are those going? Do you get a lot of people who attend them?

Denys Linkov: So they're online courses, so they're on a variety of different topics, starting off with basic prompt engineering, all the way to fairly advanced courses on AI pricing and grounding techniques to avoid hallucinations. So made eight or nine right now, it's blurring together on top of that, but it's a very interesting experience. LinkedIn has a very large and professional studio down in Carpentaria in the US and then I think they have a few global locations as well. So it's been a great experience. I did my first course in April 2023, so they were looking for instructors to help build out the AI content, and it's a very smooth operation, how it works.

So generally, the process goes is that there's a content manager, you work with the content manager on defining, they have a certain set of priorities, you have a certain set of interests, of courses to teach, aligning on that, creating a table of contents, and then from that you write the contents yourself and then work with a producer to actually bring that course live into the catalog.

Roland Meertens: Yes, interesting. What kind of people are following these courses? Are these people who are just getting started with AI? Are these people who have already a lot of experience there?

Denys Linkov: It's a variety. I think generally from what I've seen in the courses, it's typically people who are curious about the topic. So you'd see some people who are more advanced, but I don't think the platform has that reputation yet as being the expert curated course. Generally, the platform is available on educational providers, so universities, libraries, and so forth. And also, a lot of companies use it internally as their LMS, so you get a wide variety. Sometimes I've had people message me, we've had very in-depth conversations after they take the course, but generally the ones that are best performing are the intro to GPT-4 other ones because people just want to know what is this whole AI thing that's happening.

Roland Meertens: But it's also interesting that people need a course instead of just go to openai.com.

Denys Linkov: Yes. Everybody learns in different ways.

Roland Meertens: Yes, it's fascinating. Cool. Anyways, thank you very much for joining The InfoQ Podcast for being here at QCon in San Francisco.

Denys Linkov: Thanks for having me.

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

BT