The Art of Evals

This guide outlines how I make evaluations in five phases: finding a use case, mapping the capabilities of an agent that could solve it, simulating the environment, creating a task-suite, and iterating on each isolated task. It grew out of my work at METR, UK AISI and GSMA, where I lead agentic evaluations in telecoms. Some of the samples are from telecom; as a regulated industry it presents components that are hard to replicate: real private physical infrastructure, lots of regulations and lots of slow humans (aren’t we all, after all?)

Contents

1Introduction

2Structure of an evaluation

Use case Agent Environment Suite Task

3Examples

Stock Pilot Radio energy saving Transport fault repair

This guide is based on how I see evals; I believe this can be used in other domains with similar problems. If it’s useful, or you have any questions, find me on LinkedIn

The Artof Evals

The Art
of Evals