Introduction
This is a guide to designing good evaluations for AI agents in real-world domains. It grew out of my experience studying and analysing how frontier labs do capability evaluations that are actually useful. The examples here are from telecom, specifically 5G networks, because that’s the real-world domain I work in
When I started at GSMA, I focused on replicating the work in evaluations I did for METR and UK AISI, but for telecoms, only to realise that as I was simultaneously training models in telecom domain knowledge, these wouldn’t generalise to better performance in my benchmarks at all, let alone help me find any guidance on what data my language models needed to learn
A more effective approach I came up with when designing evaluations:
- Centre on a use case: a piece of real work from the industry
- Come up with a hypothetical agent that should be able to solve it, and name its capabilities
- Work out the environment to test it in; that rarely comes for free, so I put together a recipe: a set of open-source components that, combined, simulate the use case closely enough to act on
- Break the problem into a suite of sequential tasks scored in isolation
- Find an MVP agent harness that lets me evaluate fairly: raw bash and Python tools first, more advanced setups later
The recipe is the step that makes this different from coding. To simulate a coding environment, the bottleneck is reconstructing a codebase and its data; everything is already software, so a container plus libraries and synthetic data gets you a working world. In engineering domains the bottleneck is usually private layers and hardware: the live state and data of the running system, and the proprietary systems that hold them. In regulated industries the bottleneck is usually hardware, private data, regulations, humans, critical infrastructure, etc
- In telecom, the live network state and its performance data sit in systems that operators and vendors hold, and the radio equipment is physical and proprietary
- In energy, utilities hold the grid state in their control systems, and substations and meters are hardware in the field
- In oil, process data lives in the plant control systems, and the plant itself is physical equipment on a site
- In manufacturing, the state of the line lives in the factory’s control systems, and the machines on the floor are physical