Skip to content

Stock Pilot

I took this example from a Claude Code dev session: Stock Pilot, an inventory-management agent

Use case

A mid-size outdoor-gear retailer keeps 250 stocked products (SKUs) across 3 warehouses, buys from 12 suppliers, and has 90 days of sales history to go on. Someone has to do the daily round: spot what’s running low, work out how much to reorder and from whom, place the orders, tell the ops team in chat, and write up the week. Stock Pilot is the agent hired for that round

Agent

Each of the six capabilities below is graded by a task in the suite:

Environment

Simulated components
PostgreSQLERP system of record MattermostSlack-compatible chat Mailpitsupplier email sink WireMockseeded disruption feed seed generator250 SKUs · 3 warehouses · 12 suppliers · 90d sales reorder policyreorder points · thresholds usage metertokens · wall-clock
Software Data Observability
  • To read the business, PostgreSQL holds the system of record, seeded with the same data every run, with the reorder policy supplying the thresholds that decide what counts as low
  • To act on it, the agent writes purchase orders and ERP updates back to PostgreSQL, posts to Mattermost as a Slack-compatible chat, and drafts supplier mail into Mailpit; WireMock returns a fixed disruption feed
  • To prove the result, the grader reads what the run left behind, the order rows, the posted message, the drafted mail, the weekly report, and the forecasts its tools returned; a usage meter records tokens and wall-clock for the efficiency and latency graders

Suite

  • What’s running low? Look up a single SKU’s stock (R1), then list everything below its reorder point (R2)
  • How much to reorder? Forecast demand over two weeks (R6 and R7), harder when a promotion skews the baseline (R8)
  • Who to buy from, and on what terms? Weigh the suppliers’ lead times, place the purchase order, and write it back to the ERP (R3 to R5)
  • How did the week go? Generate the ops summary (R9)
  • Can it run the day end to end? Chain those moves into a daily low-stock sweep, then a promo reorder, then a batch of repeated alerts (F1 to F3)
IDTaskWhat it testsGrader
R1Stock lookupsingle readexact_match
R2Below reorder pointlist all SKUs below thresholdset_match
R3–R5Create PO · lead times · cycle-countwrite paths, joinsaction_taken / set_match
R6–R7Reorder rec · 14-day forecastformula, baseline forecastnumeric_tolerance ±20%
R8Promo-month forecastmean anchoringnumeric_tolerance ±25%
R9Weekly reportreport structurellm_judge
F1Daily low-stock sweepstockout managementcomposite (action + wall-clock budget + ranked top-3)
F2Promo reorder with forecastrecommendation qualityregex_present (numeric confidence)
F3Batch low-stock alertscost of 10 routine alertsefficiency (output-token budget)

Tools

Each task asks the agent to take an action, and each action needs one tool, pointed at the part of the environment it touches. That’s four reads on the Postgres ERP, two writes back to it, an alert to the chat, a supplier note to the mail sink, and a disruption check on the WireMock feed; then three heavier tools that forecast demand, compare supplier quotes, and write the weekly report

Highlighted below, each of the last three wraps a whole subagent and hands back a block of prose: the forecast reads 90 days of history, the comparison weighs the supplier quotes, the report runs a full loop to fill a template. Every tool result lands raw in the agent’s context window, and these three return the most

StockPilot
12 tools · the agent's access in the sandbox
get_stock_level
forecast_demand
compare_supplier_quotes
generate_weekly_report
list_low_stock
get_sales_velocity
get_supplier_catalog
draft_email_to_supplier
create_purchase_order
update_erp_record
send_slack_alert
search_web_for_disruptions
Highlighted: the three tools that each wrap a hardcoded subagent.