Stock Pilot
I took this example from a Claude Code dev session: Stock Pilot, an inventory-management agent
Use case
A mid-size outdoor-gear retailer keeps 250 stocked products (SKUs) across 3 warehouses, buys from 12 suppliers, and has 90 days of sales history to go on. Someone has to do the daily round: spot what’s running low, work out how much to reorder and from whom, place the orders, tell the ops team in chat, and write up the week. Stock Pilot is the agent hired for that round
Agent
Each of the six capabilities below is graded by a task in the suite:
Environment
- To read the business, PostgreSQL holds the system of record, seeded with the same data every run, with the reorder policy supplying the thresholds that decide what counts as low
- To act on it, the agent writes purchase orders and ERP updates back to PostgreSQL, posts to Mattermost as a Slack-compatible chat, and drafts supplier mail into Mailpit; WireMock returns a fixed disruption feed
- To prove the result, the grader reads what the run left behind, the order rows, the posted message, the drafted mail, the weekly report, and the forecasts its tools returned; a usage meter records tokens and wall-clock for the efficiency and latency graders
Suite
- What’s running low? Look up a single SKU’s stock (R1), then list everything below its reorder point (R2)
- How much to reorder? Forecast demand over two weeks (R6 and R7), harder when a promotion skews the baseline (R8)
- Who to buy from, and on what terms? Weigh the suppliers’ lead times, place the purchase order, and write it back to the ERP (R3 to R5)
- How did the week go? Generate the ops summary (R9)
- Can it run the day end to end? Chain those moves into a daily low-stock sweep, then a promo reorder, then a batch of repeated alerts (F1 to F3)
| ID | Task | What it tests | Grader |
|---|---|---|---|
| R1 | Stock lookup | single read | exact_match |
| R2 | Below reorder point | list all SKUs below threshold | set_match |
| R3–R5 | Create PO · lead times · cycle-count | write paths, joins | action_taken / set_match |
| R6–R7 | Reorder rec · 14-day forecast | formula, baseline forecast | numeric_tolerance ±20% |
| R8 | Promo-month forecast | mean anchoring | numeric_tolerance ±25% |
| R9 | Weekly report | report structure | llm_judge |
| F1 | Daily low-stock sweep | stockout management | composite (action + wall-clock budget + ranked top-3) |
| F2 | Promo reorder with forecast | recommendation quality | regex_present (numeric confidence) |
| F3 | Batch low-stock alerts | cost of 10 routine alerts | efficiency (output-token budget) |
Tools
Each task asks the agent to take an action, and each action needs one tool, pointed at the part of the environment it touches. That’s four reads on the Postgres ERP, two writes back to it, an alert to the chat, a supplier note to the mail sink, and a disruption check on the WireMock feed; then three heavier tools that forecast demand, compare supplier quotes, and write the weekly report
Highlighted below, each of the last three wraps a whole subagent and hands back a block of prose: the forecast reads 90 days of history, the comparison weighs the supplier quotes, the report runs a full loop to fill a template. Every tool result lands raw in the agent’s context window, and these three return the most