Transport fault repair

Use case

China Mobile Zhejiang and ZTT built a system that, on its own, finds and fixes faults in their transport network, the backbone links that carry traffic between sites. They wrote it up in the GSMA Foundry use-case library

The eval rebuilds that loop as eight graded tasks. When a run starts, a hidden fault knocks a router link down, and the agent is handed the alarm burst it set off plus a live command line into the routers. The headline score, called pass^3, rewards bringing the path back on three runs in a row

Agent

Each of the six capabilities below is graded by a task in the suite:

Read the alarms

Collapse an X.733 alarm burst into distinct events and decide whether it warrants a work order.

Localise the fault

Pin the fault to one device or link, cross-checked against live OSPF adjacency.

Diagnose the misconfiguration

Name the wrong parameter and its current value, read from the router over the CLI.

Plan the repair

Write the ordered repair steps and judge whether a draft plan is complete.

Apply the repair

Change the router config over vtysh until adjacency, routes, and config are healthy again.

Verify reachability

Confirm end-to-end connectivity is restored after the repair.

Environment

Simulated components

FRRmetro router chain data-networkreachability host OSPFadjacency · routes fault recordseeded OSPF misconfig alarm burstX.733, seeded ping · traceroutereachability probes

HW sim Software Data Observability

To read the fault, FRR, the open-source router software, runs the OSPF routing protocol across the chain of routers and emits a burst of standard (X.733) alarms the agent triages; a hidden fault record seeds the same misconfiguration every trial
To act on it, vtysh, the routers’ command line, lets the agent inspect and rewrite config on the live device while the rest of the chain stays up
To prove the result, the grader re-reads the links between routers (their adjacency), their routes, and their running config, and runs an independent ping

Suite

What just happened? Collapse the alarm burst into distinct events, then decide whether it’s worth a work order (t1 and t2)
Where’s the fault? Pin it to one device or link (t3)
What’s misconfigured? Name the wrong parameter and its value (t4)
What’s the repair? Write the ordered plan, then check a draft of it for gaps (t5 and t6)
Does the fix hold? Apply it on the live router (t7)
Is the path back? Confirm reachability end to end (t8)

ID	Task	What it tests	Grader
t1	Collapse the alarm burst into distinct events	alarm correlation	Jaccard over the dedup tuple, partial credit
t2	Decide whether the burst needs a work order	triage against live state	verdict vs live state
t3	Localise the fault to a device or link	fault localisation	entity match + live adjacency cross-check
t4	Diagnose the misconfigured parameter and value	config diagnosis	parameter and current-value match
t5	Write the ordered repair plan	repair planning	step coverage + ordering, partial credit
t6	Judge whether a draft plan is complete	plan-completeness judgment	verdict plus the named omitted step
t7	Apply the repair via CLI	live repair	live adjacency, routes, and config
t8	Verify reachability after repair	reachability verification	verdict vs scorer re-probe

Tools

The only tools are for the steps that touch the live network. Collapsing the alarm burst, planning the repair, judging a draft, those are reasoning steps on what the agent is handed, and need no tool. Localising, diagnosing, and applying the fix (t3, t4, t7) mean reading and rewriting router config, so the agent gets vtysh, the FRR command line, and a general shell for what vtysh doesn’t cover. Verifying (t8) means checking the path end to end, so a reachability probe and a traceroute

Transport O&M agent

4 tools · the agent's access in the sandbox

vtysh

exec_shell

ping_pair

traceroute

Four tools carry one incident from alarm to verified repair. The grader never reads these calls; it re-reads the live network state on its own. The two highlighted tools name that state: vtysh exposes the adjacency, routes, and config it re-reads, and ping_pair the reachability it re-probes.

Notes

Impractical task	Problem	Practical alternative
Break a live metro backbone to test the repair	Injecting an OSPF fault on a production network drops customer traffic	Run a six-container FRR/OSPF chain in Docker and seed the fault there
Wait for a real config drift to take an adjacency down	Real misconfigurations are rare, slow, and unrepeatable	Inject one fault from a hidden record at boot, the same fault every trial
Triage against a vendor alarm and CLI platform	Proprietary, vendor-locked, and different per operator	Emit a seeded X.733 alarm burst and expose the routers through FRR’s vtysh CLI

Inconsistency	Problem	Better
Let the agent in before OSPF has converged	A still-flapping adjacency makes the same fault look different each run	Converge OSPF and gate on the fault actually manifesting before the agent’s first turn
Score the repair from the agent’s report	A narrated fix can’t be trusted, and the path may still be down	Re-read live adjacency, routes, and running-config after the agent submits
Judge reachability from a single probe	One ping can pass or fail on incidental container timing	Re-probe with an independent scorer and judge the verdict against it

Radio energy saving