Task – The Art of Evals

Task

A task has four parts you design: the starting state (the isolated environment for that trial), the prompt (the only part the agent sees), the tools granted, and the scored outcome the grader reads

Write the prompt so the agent must infer the work

The prompt describes the use case in natural language. The model infers the technical requirements from domain knowledge. If a model with no expertise in your domain could score above 0.5, the prompt leaks too much. What the prompt must leave out is covered with the other leak channels in Keep the answer hidden

Difficulty comes from the domain

A failure must point at a missing capability. A capability task should be hard because the work is hard, not because the setup is. The difference is in what a failure tells you. When an agent fails a well-built task, the failure points at a missing capability. When it fails because a tool lacked documentation or the sandbox was flaky, the score records a fact about the infrastructure instead; Anthropic describes the same effect on the specification side, where ambiguity in the task becomes noise in the metrics,¹ and METR ran human QA on most HCAST families to remove difficulties that were unfair in exactly this way.² On the task side, nothing about the setup, tools, docs, or boot sequence should be the reason an agent fails

The same logic decides what the agent gets access to. If a real practitioner would search the web, run code, or consult the governing standard, the environment should allow it. Withholding those tools makes the task harder, but the extra difficulty exists only inside the eval, so the score stops predicting performance on the real work. And it decides longevity: the same task should still work next year, on a different agent harness, with a different tool set; difficulty that depends on this year’s setup is the setup again, measured from another angle

One primary bottleneck per task. A task that requires five things to go right in sequence before any feedback tests luck as much as capability: the agent can fail at any link, and the final score cannot say which one. Harder tasks can be built by chaining smaller ones, but when a task cleaves cleanly into two sequential subtasks, split it; the suite gets two clean measurements instead of one it cannot interpret

Easier and harder variants

A variant changes one of the task’s parts, the prompt, input data, initial state, or difficulty level, while keeping the core scoring idea stable

Use variants to create:

easier guided cases (prefill fields, add partial guidance, relax a bound);
harder unaided cases (remove the guidance)

Deriving an easier variant from a hard task is cheap; the reverse, taking an easy task and making it genuinely hard while keeping it well-scoped, is much harder, so author the hard case first. Do not add variants just to increase task count. A variant should teach the suite something about the agent’s capability or reliability

Grading the outcome

Scores range from 0 to 1. A competent agent should score above 0.9; an incompetent agent below 0.1.³ Document what 0.0, 0.3, 0.5, 0.8, and 1.0 look like for each task

A grader can inspect the transcript, the outcome, or both. There are three grader families:

Grader type	Methods	Strengths	Weaknesses	Use
Code-based	String checks, binary tests, static analysis, outcome verification, tool-call checks, transcript metrics	Fast, cheap, objective, reproducible, easy to debug ⁴	Brittle if the expected pattern is too narrow; weak for subjective quality	Default where the desired outcome can be checked directly
Model-based	Rubric scoring, natural-language assertions, pairwise comparison, reference-based evaluation, multi-judge consensus	Flexible, scalable, captures nuance, handles freeform outputs	Non-deterministic, more expensive, needs calibration against human judgement ⁵	Use where deterministic checks cannot capture the relevant quality dimension
Human	SME review, spot checks, A/B testing, inter-annotator agreement ⁶	Best signal for expert judgement; useful for calibration	Slow, expensive, hard to scale	Use for QA, baselines, calibration, and reviewing surprising failures

Prefer code-based outcome grading where correctness can be checked directly. In engineering domains, correctness depends on precise parameter values and measurable behaviour, and LLMs add non-determinism that is hard to validate,⁷ so use model-based and human graders only for dimensions deterministic checks cannot capture

How scoring functions should behave:

Deterministic. Same submission, same score. Agent runs vary, so some noise in outcomes is fine, but the scoring function itself must not introduce noise; scoring stays fully automated, with no manual grading step that could drift between runs
Format-tolerant. Convert a value submitted as a string into a number; return 0 without errors on an unparseable submission; tolerate whitespace, quoting, and case
Outcome-led. Do not require a specific sequence of steps unless the sequence is what the task measures. Agents may find valid approaches you did not anticipate, so the default is to score the artefact, state, or measurement produced.⁸
Partial where useful. If a task has meaningful components, reflect that continuum. An agent that diagnoses the right failing network function but misses the remediation is not equivalent to one that never identifies the fault
Pass the human-expert litmus test. Could a human expert in the domain be evaluated with this same scoring mechanism? If the scoring only works for AI agents, it is overfit
Grounded thresholds. Every threshold traces to a normative standard, a published measurement, or an operational rationale explaining why this boundary matters. Criteria that trace to published standards or documented procedures make the same rubric work on any conformant platform

A reference solution proves the task is solvable. Before trusting a score, write one: a known-good answer, and confirm every grader passes it at 1.0, which verifies that the graders accept a correct answer. It also tells a broken grader from a broken task: with a capable model, a zero pass rate across many trials usually points at the task or grader, not the agent,¹ and re-running the reference solution says where to look first; if the known-good answer no longer passes, the grader; if it still passes, the task or prompt

One dimension per judge. If a dimension needs a model grader (an LLM-as-a-judge), keep each to a single dimension rather than one judge across all of them; each verdict is then made against a single rubric, and a moving score traces to a single judge. Give each judge the anchors the task already documents, and let it return “Unknown” when the transcript does not show enough to decide.¹ Without that exit a judge with too little to go on invents a score rather than abstaining, and an inconclusive dimension is scored as if it had been judged

Keep the answer hidden

Passing must require solving the problem, not exploiting a loophole or replaying a memorised solution. The answer (scoring bounds, answer keys, reference implementations, expected values) reaches the agent through three channels: the prompt, the environment at run time, and the training data. Close all three

In the prompt. The agent should not be able to read the answer off the task description. Leave out:

final or intermediate values;
formula and parameter names;
configuration identifiers and priority levels;
cause-class labels and named remediations;
clause, section, or rule numbers from the governing standard or code;
any names from a hidden reference implementation

At run time. The ground truth lives in hidden metadata, never in anything the agent can inspect: not retrieval pipelines, not agent-accessible file systems, not prompts. The agent’s knowledge comes from public standards documents, which contain the normative requirements the bounds derive from; the bounds themselves stay hidden. Runtime verification confirms the outcome is real; in the factory example, you can’t fake device registration or data-plane connectivity

If an agent can pass by any of these, the grader is measuring the loophole:⁹

echoing hidden values;
gaming a parser;
targeting a brittle string check;
optimising to a stated threshold without satisfying the operational requirement

The contamination test: could an agent find the exact scoring bounds by searching for the sample ID, prompt text, or a distinctive phrase? If yes, the pipeline is contaminated; never mix hidden ground truth into agent-visible context.¹⁰

In training data. A task only measures capability if the model has not already seen it; grading is outcome-led, so an agent that reproduces a memorised solution scores the same as one that derived it. The solution is the part that must never leave your hands: if it leaks, the task is dead.¹¹ The prompt matters too, but the solution is the fatal one

While a suite is in use, nothing about it goes on the public web: no repos, gists, blog posts, or link-shared drives; what leaks ends up in training corpora, and an agent with web access can retrieve it at run time. If an AI assistant helps you build a task, turn off training, history, and memory; assume the provider trains on your inputs unless you’ve turned it off. Treat any task you publish as spent

The public benchmarks survive by holding most of their tasks back: GDPval keeps 1,100 of its 1,320 tasks private and releases only a scrubbed 220-task subset,¹² and FrontierMath keeps all but a handful of its problems private.¹³ Anything they do release carries a canary string, a unique marker that says it must never appear in training data, so contamination can at least be detected;¹⁴ even then, public benchmarks like GPQA have still ended up in training sets.¹⁵

Anthropic, “Demystifying evals for AI agents” (2025). https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents ↩︎ ↩︎ ↩︎
Rein et al. (METR), “HCAST: Human-Calibrated Autonomy Software Tasks” (2025). https://arxiv.org/abs/2503.17354 ↩︎
METR, “Desiderata”, METR Task Development Guide. https://taskdev.metr.org/desiderata/ ↩︎
Chen et al., “Evaluating Large Language Models Trained on Code” (2021). https://arxiv.org/abs/2107.03374 ↩︎
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (2023). https://arxiv.org/abs/2306.05685 ↩︎
Artstein & Poesio, “Inter-Coder Agreement for Computational Linguistics” (2008). https://aclanthology.org/J08-4004/ ↩︎
Schroeder & Wood-Doughty, “Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge” (2024). https://arxiv.org/abs/2412.12509 ↩︎
Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” (2023). https://arxiv.org/abs/2310.06770 ↩︎
Krakovna et al., “Specification gaming: the flip side of AI ingenuity” (2020). https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ ↩︎
Oren et al., “Proving Test Set Contamination in Black Box Language Models” (2023). https://arxiv.org/abs/2310.17623 ↩︎
Magar & Schwartz, “Data Contamination: From Memorization to Exploitation” (2022). https://aclanthology.org/2022.acl-short.18/ ↩︎
OpenAI, “GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks” (2025). https://arxiv.org/abs/2510.04374 ↩︎
Epoch AI, “FrontierMath” (2024). https://epoch.ai/frontiermath/tiers-1-4/about ↩︎
Google, “BIG-bench: training on test set (canary string).” https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/training_on_test_set/README.md ↩︎
“BIG-Bench Canary Contamination in GPT-4,” LessWrong (2024). https://www.lesswrong.com/posts/kSmHMoaLKGcGgyWzs/big-bench-canary-contamination-in-gpt-4 ↩︎

Suite