Skip to content

The Art
of Evals

A guide to evaluating AI agents.

This guide outlines how I make evaluations in five phases: finding a use case, mapping the capabilities of an agent that could solve it, simulating the environment, creating a task-suite, and iterating on each isolated task. It grew out of my work at METR, UK AISI and GSMA, where I lead agentic evaluations in telecoms. Some of the samples are from telecom; as a regulated industry it presents components that are hard to replicate: real private physical infrastructure, lots of regulations and lots of slow humans (aren’t we all, after all?)

This guide is based on how I see evals; I believe this can be used in other domains with similar problems. If it’s useful, or you have any questions, find me on LinkedIn