Writing Evals for LLM Agents

October 17, 2024

LLM powered software experiences provide a unique challenge within software engineering. Because models are non-deterministic, a variety of factors complicate evaluation:

  • Asking the same question multiple times can result in different answers each time.
  • Different foundation models or model versions respond in unique or unexpected ways.
  • Small changes to prompts can drastically change responses and performance.

Given this amount of uncertainty, how should we evaluate program correctness and robustness? The industry uses a variety of benchmarks to compare foundation models but these benchmarks may not accurately represent our usage and can’t measure prompt changes. Instead of writing tests that must pass in order to ensure the program works, we measure ongoing performance against a suite of tests. Using these tests, we quickly see the impact of using a different foundation model or tweaking the prompt.

For the Fantasy Football LLM Agents, I wrote a suite of tests that check for working SQL queries and contextually correct answers to each user question. In addition, each test suite is run multiple times and pass rates are averaged across all runs. The results are tabulated below:

Agent Llama 3.1-8B GPT-4o-mini GPT-4o
Basic 38.89% 44.45% 68.06%
C3 15.27% 20.83% 12.5%
Few Shot 63.89% 69.44% 62.5%
Chain of Thought 69.44% 73.61% 81.94%

The capabilities unlocked by evals far surpasses simple unit tests. Evals allow us to measure application performance across models and agents. They also open up new conversations and data driven analysis of tradeoffs. Teams can weigh devoting more engineering time to improving prompting vs paying for a better model. Analysts can also model cost per user against the average expected spend with LLM vendors. If a team is serious about delivering a well-engineered, LLM-powered solution custom evals are a must.