Use case

Finding a problem to translate

The foundation for building real enterprise evaluations depends entirely on the quality of your use case

Most of the time, instead of inventing your own problem, you can try finding problems that practitioners actually face. Industries usually publish their use cases or customer stories in blogs, news, white papers, etc - Which already gives you a good starting point, examples: GSMA, Databricks, Microsoft

From them, you can infer there’s already an AI system that tries to solve this, and reverse-engineer an agent + evaluation for it

Generalising the problem

A test I always do when picking one: could a hypothetical model with expert-level knowledge of this field solve this problem, without being trained for it specifically?

If only a task-specific model can pass it, the evaluation might not be that generalisable after all, and we might end up measuring the training data, not the capabilities emerging from it.

From use case to task suite

The figure below runs this chapter’s idea end to end on a telecom operator: six use cases, the capabilities they decompose into, and the task suites those become

AI Agent

Map of Capabilities

With the use case picked, the next chapter imagines the agent that would solve it

Agent