All Skills
AI Engineering

Agentic Error Recovery

agentic-error-recovery.md · updated 2026-06-12

Reviews an agentic AI system against the five failure classes that actually break agents in production: malformed model output, side effects that double-execute, runaway loops, provider failures, and missing observability. Returns a risk verdict and an ordered fix plan.

Use this when

  • An agent pipeline fails silently or double-executes actions
  • Before giving an agent autonomous access to user or business data
  • Designing retry and fallback behavior for multi-step agent runs
SKILL.md
---
name: agentic-error-recovery
description: Review or design error recovery for agentic AI systems. Use when an agent pipeline fails silently, loops forever, double-executes actions, or when hardening an agent for production — covers retries, idempotency, loop guards, fallbacks, and observability.
---

# Agentic Error Recovery Review

You are hardening an agentic AI system (LLM agents with tool use, multi-step plans, or autonomous loops) for production. Agents fail differently from normal services: the failure is often a *plausible wrong action*, not an exception. Review the system against each failure class below; for each, report what exists, what's missing, and the concrete fix.

## Failure class 1 — The model output is wrong or malformed

- Tool-call arguments are schema-validated before execution, and validation failures are sent back to the model for repair (bounded to 2–3 attempts) instead of crashing the run.
- Free-text outputs that downstream code parses (JSON, dates, IDs) go through a tolerant parser plus a repair prompt, never a bare JSON.parse.
- Hallucinated references (nonexistent file paths, record IDs, URLs) are checked against reality before acting on them.

## Failure class 2 — The action executed, then something failed

This is the dangerous one. An agent that retries a step containing a side effect will double-send the email, double-charge, double-post.

- Every tool with a side effect is classified: idempotent / idempotent-with-key / unsafe-to-retry.
- Unsafe tools use idempotency keys derived from the run + step, or a check-before-write pattern.
- The agent's transcript/journal records side effects as committed BEFORE reporting success, so a crash between action and record can be detected on resume.
- Resume logic replays from the journal rather than re-planning from scratch when possible.

## Failure class 3 — Loops and runaway behavior

- Hard ceilings exist: max steps per run, max tokens per run, max wall-clock, max cost. Each ceiling has a defined behavior when hit (fail loudly, summarize partial progress — never silently truncate).
- Loop detection: identical tool call with identical args N times in a row aborts with a diagnostic, not another retry.
- Sub-agent spawning (if any) has depth and fan-out limits.

## Failure class 4 — External dependencies fail

- Provider errors are segmented: rate limits (retry with backoff + jitter), transient 5xx (retry bounded), content refusals (do NOT retry identically — rephrase or escalate), context overflow (compact and retry once).
- A fallback model or degraded mode exists for hard provider outages, and it's actually wired, not just designed.
- Tool timeouts are set per tool, not one global value; slow tools can't stall the whole run past the run ceiling.

## Failure class 5 — Nobody can tell what happened

- Every run has a trace: each step's prompt summary, model, tool calls with args, results, tokens, latency, and cost — queryable by run ID.
- Failures are classified into the classes above in metrics, so "agent reliability" is a dashboard, not a vibe.
- A human-readable run summary is produced on failure: what was attempted, what succeeded, what state was left behind, what's safe to retry.
- Escalation path: runs that exhaust recovery land in a review queue with full context, not a generic error log.

## Output format

1. **Risk verdict** — which failure classes are unhandled, ranked by blast radius.
2. **Findings per class** — present/missing/partial, with file:line evidence where you reviewed code.
3. **Fix plan** — ordered list; idempotency and ceilings come before observability polish.
4. **What to test** — chaos cases worth scripting (kill mid-step, force malformed output, simulate 429 storms).

Install in Claude Code

mkdir -p ~/.claude/skills/agentic-error-recovery && curl -fsSL https://harshrastogi.tech/skills/agentic-error-recovery.md -o ~/.claude/skills/agentic-error-recovery/SKILL.md

Then ask Claude Code for the task — the skill is picked up automatically. For a project-scoped install, use .claude/skills/ inside your repo instead.

Using a different agent?

Skills are plain markdown. Paste the file into any capable AI assistant alongside your task, or wire it into any agent framework that supports system instructions.

Tags

AI AgentsReliabilityIdempotencyProduction