Stock Pilot

I took this example from a Claude Code dev session: Stock Pilot, an inventory-management agent

Use case

A mid-size outdoor-gear retailer keeps 250 stocked products (SKUs) across 3 warehouses, buys from 12 suppliers, and has 90 days of sales history to go on. Someone has to do the daily round: spot what’s running low, work out how much to reorder and from whom, place the orders, tell the ops team in chat, and write up the week. Stock Pilot is the agent hired for that round

Agent

Each of the six capabilities below is graded by a task in the suite:

Monitor and flag

Watch stock levels; surface SKUs below the reorder point.

Forecast demand

Project demand, especially around promos and seasonality.

Choose suppliers

Weigh price against lead-time and reliability.

Place orders

Create purchase orders and update ERP records.

Notify ops

Send Slack alerts and supplier emails.

Report weekly

Generate the ops summary.

Environment

Simulated components

PostgreSQLERP system of record MattermostSlack-compatible chat Mailpitsupplier email sink WireMockseeded disruption feed seed generator250 SKUs · 3 warehouses · 12 suppliers · 90d sales reorder policyreorder points · thresholds usage metertokens · wall-clock

Software Data Observability

To read the business, PostgreSQL holds the system of record, seeded with the same data every run, with the reorder policy supplying the thresholds that decide what counts as low
To act on it, the agent writes purchase orders and ERP updates back to PostgreSQL, posts to Mattermost as a Slack-compatible chat, and drafts supplier mail into Mailpit; WireMock returns a fixed disruption feed
To prove the result, the grader reads what the run left behind, the order rows, the posted message, the drafted mail, the weekly report, and the forecasts its tools returned; a usage meter records tokens and wall-clock for the efficiency and latency graders

Suite

What’s running low? Look up a single SKU’s stock (R1), then list everything below its reorder point (R2)
How much to reorder? Forecast demand over two weeks (R6 and R7), harder when a promotion skews the baseline (R8)
Who to buy from, and on what terms? Weigh the suppliers’ lead times, place the purchase order, and write it back to the ERP (R3 to R5)
How did the week go? Generate the ops summary (R9)
Can it run the day end to end? Chain those moves into a daily low-stock sweep, then a promo reorder, then a batch of repeated alerts (F1 to F3)

ID	Task	What it tests	Grader
R1	Stock lookup	single read	exact_match
R2	Below reorder point	list all SKUs below threshold	set_match
R3–R5	Create PO · lead times · cycle-count	write paths, joins	action_taken / set_match
R6–R7	Reorder rec · 14-day forecast	formula, baseline forecast	numeric_tolerance ±20%
R8	Promo-month forecast	mean anchoring	numeric_tolerance ±25%
R9	Weekly report	report structure	llm_judge
F1	Daily low-stock sweep	stockout management	composite (action + wall-clock budget + ranked top-3)
F2	Promo reorder with forecast	recommendation quality	regex_present (numeric confidence)
F3	Batch low-stock alerts	cost of 10 routine alerts	efficiency (output-token budget)

Tools

Each task asks the agent to take an action, and each action needs one tool, pointed at the part of the environment it touches. That’s four reads on the Postgres ERP, two writes back to it, an alert to the chat, a supplier note to the mail sink, and a disruption check on the WireMock feed; then three heavier tools that forecast demand, compare supplier quotes, and write the weekly report

Highlighted below, each of the last three wraps a whole subagent and hands back a block of prose: the forecast reads 90 days of history, the comparison weighs the supplier quotes, the report runs a full loop to fill a template. Every tool result lands raw in the agent’s context window, and these three return the most

StockPilot

12 tools · the agent's access in the sandbox

get_stock_level

forecast_demand

compare_supplier_quotes

generate_weekly_report

list_low_stock

get_sales_velocity

get_supplier_catalog

draft_email_to_supplier

create_purchase_order

update_erp_record

send_slack_alert

search_web_for_disruptions

Highlighted: the three tools that each wrap a hardcoded subagent.

Radio energy saving