Micro Metrics for LLM System Evaluation at QCon SF 2024

At the QCon San Francisco Conference 2024, Denys Linkov spoke on the complexities of evaluating large language models (LLMs) and the importance of micro-metrics. He emphasized that while LLMs have immense potential, their inherent complexity often leads to challenges in real-world applications, particularly when measuring and improving their performance.

He provided a framework for creating, tracking, and refining micro metrics tailored to LLM systems. He noted the importance of integrating robust observability systems, aligning metrics with business objectives, and adapting metrics as the system evolves.

One of the most significant insights from his talk was the problem with overreliance on single metrics like semantic similarity. He illustrated this with an example where multiple models mistakenly identified “I am a potato” as the best match for the phrase “I like to eat potatoes.” Such errors underscore the limitations of simplistic approaches and the need for a more nuanced, multidimensional evaluation strategy.

"The goal of metrics is to save human time and improve user experiences. If your metrics aren’t driving business or technical decisions, they’re not doing their job." - Linkov

He also discussed the challenges of using LLMs as judges for their own performance, a practice that can introduce biases. For example, he cited research showing how LLMs like GPT-4 often misalign with human judgment, especially when evaluating shorter prompts.

He proposed building micro metrics that target specific aspects of an LLM’s performance, similar to detailed feedback in performance reviews, where actionable insights are far more valuable than vague praise or criticism.

He outlined a phased approach to automation metrics, progressing from foundational to advanced practices. In the Crawl phase, teams focus on basics like response time. The Walk phase emphasizes maturity with metrics such as resolution rate. The Run phase drives innovation with proactive support copilots. As an example, in customer service he suggested starting with a few relevant metrics and iterating can enable faster success and a more refined automation strategy.

Observability was another major theme in Linkov’s presentation. Borrowing concepts from traditional software engineering, he advocated for robust systems to monitor metrics, logs, and traces. These tools enable engineers to identify and address issues in real-time, such as unexpected language shifts during conversations. He shared an example where a user reported a German-language chatbot suddenly responding in English.

Aligning metrics with business goals was another key takeaway. He stressed that metrics should drive both technical and business decisions, helping teams prioritize improvements that deliver the greatest value.

Developers and engineers interested in Linkov’s insights can explore his LinkedIn Learning courses. His presentation and a video of his QCon SF presentation are expected to be available on the conference website in the coming weeks.

About the Author

Andrew Hoblitzell

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Andrew Hoblitzell

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter