---
name: agentic-error-recovery
description: Review or design error recovery for agentic AI systems. Use when an agent pipeline fails silently, loops forever, double-executes actions, or when hardening an agent for production — covers retries, idempotency, loop guards, fallbacks, and observability.
---

# Agentic Error Recovery Review

You are hardening an agentic AI system (LLM agents with tool use, multi-step plans, or autonomous loops) for production. Agents fail differently from normal services: the failure is often a *plausible wrong action*, not an exception. Review the system against each failure class below; for each, report what exists, what's missing, and the concrete fix.

## Failure class 1 — The model output is wrong or malformed

- Tool-call arguments are schema-validated before execution, and validation failures are sent back to the model for repair (bounded to 2–3 attempts) instead of crashing the run.
- Free-text outputs that downstream code parses (JSON, dates, IDs) go through a tolerant parser plus a repair prompt, never a bare JSON.parse.
- Hallucinated references (nonexistent file paths, record IDs, URLs) are checked against reality before acting on them.

## Failure class 2 — The action executed, then something failed

This is the dangerous one. An agent that retries a step containing a side effect will double-send the email, double-charge, double-post.

- Every tool with a side effect is classified: idempotent / idempotent-with-key / unsafe-to-retry.
- Unsafe tools use idempotency keys derived from the run + step, or a check-before-write pattern.
- The agent's transcript/journal records side effects as committed BEFORE reporting success, so a crash between action and record can be detected on resume.
- Resume logic replays from the journal rather than re-planning from scratch when possible.

## Failure class 3 — Loops and runaway behavior

- Hard ceilings exist: max steps per run, max tokens per run, max wall-clock, max cost. Each ceiling has a defined behavior when hit (fail loudly, summarize partial progress — never silently truncate).
- Loop detection: identical tool call with identical args N times in a row aborts with a diagnostic, not another retry.
- Sub-agent spawning (if any) has depth and fan-out limits.

## Failure class 4 — External dependencies fail

- Provider errors are segmented: rate limits (retry with backoff + jitter), transient 5xx (retry bounded), content refusals (do NOT retry identically — rephrase or escalate), context overflow (compact and retry once).
- A fallback model or degraded mode exists for hard provider outages, and it's actually wired, not just designed.
- Tool timeouts are set per tool, not one global value; slow tools can't stall the whole run past the run ceiling.

## Failure class 5 — Nobody can tell what happened

- Every run has a trace: each step's prompt summary, model, tool calls with args, results, tokens, latency, and cost — queryable by run ID.
- Failures are classified into the classes above in metrics, so "agent reliability" is a dashboard, not a vibe.
- A human-readable run summary is produced on failure: what was attempted, what succeeded, what state was left behind, what's safe to retry.
- Escalation path: runs that exhaust recovery land in a review queue with full context, not a generic error log.

## Output format

1. **Risk verdict** — which failure classes are unhandled, ranked by blast radius.
2. **Findings per class** — present/missing/partial, with file:line evidence where you reviewed code.
3. **Fix plan** — ordered list; idempotency and ceilings come before observability polish.
4. **What to test** — chaos cases worth scripting (kill mid-step, force malformed output, simulate 429 storms).
