Use case
Finding a problem to translate
The foundation for building real enterprise evaluations depends entirely on the quality of your use case
Most of the time, instead of inventing your own problem, you can try finding problems that practitioners actually face. Industries usually publish their use cases or customer stories in blogs, news, white papers, etc - Which already gives you a good starting point, examples: GSMA, Databricks, Microsoft
From them, you can infer there’s already an AI system that tries to solve this, and reverse-engineer an agent + evaluation for it
Generalising the problem
A test I always do when picking one: could a hypothetical model with expert-level knowledge of this field solve this problem, without being trained for it specifically?
If only a task-specific model can pass it, the evaluation might not be that generalisable after all, and we might end up measuring the training data, not the capabilities emerging from it.
From use case to task suite
The figure below runs this chapter’s idea end to end on a telecom operator: six use cases, the capabilities they decompose into, and the task suites those become
With the use case picked, the next chapter imagines the agent that would solve it