AI & Machine Learning

Agentic AI in Production: Error Recovery, Observability, and Scaling Patterns

Lessons from running agentic AI systems at scale — how we handle failures, monitor agent behavior, manage costs, and scale from single agents to multi-agent pipelines at Modelia.ai and Asynq.ai.

Harsh RastogiHarsh Rastogi
Mar 5, 202615 min
Agentic AIAI SystemsDevOpsTypeScriptObservability

The Hard Part Isn't Building Agents — It's Running Them

Building your first AI agent is exciting. It calls tools, reasons through problems, and produces results that feel magical. Then you deploy it to production and everything breaks.

At Asynq.ai, our candidate evaluation agent worked flawlessly in development. In production, it hallucinated tool parameters, got stuck in loops, occasionally produced evaluations that contradicted its own reasoning, and cost 3x what we budgeted. At Modelia.ai, our image generation pipeline agent would sometimes approve obviously flawed images because it optimized for completing the workflow rather than quality.

These aren't edge cases — they're the reality of agentic AI in production. This post covers the patterns we developed to handle them.

Failure Modes of Agentic Systems

After running agents in production across two companies, I've categorized failures into five types:

1. Tool Parameter Hallucination

The agent calls the right tool but with fabricated parameters — IDs that don't exist, enum values that aren't valid, or dates in wrong formats.

typescript
// The agent might generate: { model_id: "model-abc-123" }
// when the actual ID is: "mod_7f3a2b1c"

// Solution: Validate ALL tool inputs before execution
function createSafeToolExecutor(tool: Tool): SafeExecutor {
  return async (rawInput: unknown) => {
    // Step 1: Schema validation
    const parseResult = tool.parameters.safeParse(rawInput);
    if (!parseResult.success) {
      return {
        success: false,
        error: `Invalid input: ${parseResult.error.issues.map(i => i.message).join('; ')}`,
        hint: `Expected schema: ${JSON.stringify(tool.parameters.shape, null, 2)}`,
      };
    }

    // Step 2: Referential integrity check
    const input = parseResult.data;
    if (input.model_id) {
      const exists = await db.model.exists({ id: input.model_id });
      if (!exists) {
        const suggestions = await db.model.findMany({
          take: 3,
          orderBy: { usedAt: 'desc' },
          select: { id: true, name: true },
        });
        return {
          success: false,
          error: `Model ID "${input.model_id}" not found.`,
          hint: `Available models: ${suggestions.map(m => `${m.id} (${m.name})`).join(', ')}`,
        };
      }
    }

    return tool.execute(input);
  };
}

The key insight is returning helpful hints when validation fails. The agent uses these hints to self-correct on the next iteration.

2. Infinite Loops

The agent gets stuck repeating the same action because the tool result doesn't change the state in a way the agent recognizes.

typescript
class LoopDetector {
  private history: string[] = [];
  private readonly windowSize = 5;

  recordAction(action: string): boolean {
    this.history.push(action);

    if (this.history.length < this.windowSize) return false;

    const recent = this.history.slice(-this.windowSize);

    // Check for exact repetition (A, A, A, A, A)
    if (recent.every(a => a === recent[0])) {
      return true; // Loop detected
    }

    // Check for alternating pattern (A, B, A, B, A)
    if (recent.length >= 4) {
      const isAlternating = recent.every((a, i) =>
        a === recent[i % 2]
      );
      if (isAlternating) return true;
    }

    return false;
  }

  getBreakingPrompt(): string {
    const repeatedAction = this.history[this.history.length - 1];
    return `You appear to be repeating the action "${repeatedAction}" without making progress. \
Please try a different approach or explain why you're stuck so we can help.`;
  }
}

3. Context Window Overflow

Long-running agents accumulate tool results that can exceed the model's context window. The agent starts "forgetting" early decisions.

typescript
interface ContextManager {
  messages: Message[];
  tokenCount: number;
  maxTokens: number;
}

function compactContext(ctx: ContextManager): ContextManager {
  if (ctx.tokenCount < ctx.maxTokens * 0.75) return ctx;

  const systemMsg = ctx.messages[0];
  const recentMessages = ctx.messages.slice(-6);

  // Summarize the middle section
  const middleMessages = ctx.messages.slice(1, -6);
  const toolResults = middleMessages
    .filter(m => m.role === 'tool')
    .map(m => {
      const parsed = JSON.parse(m.content);
      // Keep only essential data from tool results
      return {
        tool: m.toolName,
        success: parsed.success,
        summary: parsed.summary || truncate(JSON.stringify(parsed.data), 200),
      };
    });

  const summaryMessage: Message = {
    role: 'user',
    content: `[Context Summary] Previous actions and results:\n${
      toolResults.map(r => `- ${r.tool}: ${r.success ? 'Success' : 'Failed'} — ${r.summary}`).join('\n')
    }\n\nContinue with the workflow based on these results.`,
  };

  return {
    messages: [systemMsg, summaryMessage, ...recentMessages],
    tokenCount: estimateTokens([systemMsg, summaryMessage, ...recentMessages]),
    maxTokens: ctx.maxTokens,
  };
}

4. Goal Drift

The agent subtly shifts away from the original objective. At Modelia.ai, we noticed our agents sometimes optimizing for "interesting" images rather than brand-compliant ones.

typescript
// Inject goal reminders every N turns
function addGoalReminder(messages: Message[], originalGoal: string, turnNumber: number): Message[] {
  if (turnNumber % 5 === 0 && turnNumber > 0) {
    return [
      ...messages,
      {
        role: 'user',
        content: `[Reminder] Your original goal: "${originalGoal}". \
Ensure your next actions directly serve this goal. If you've drifted, course-correct now.`,
      },
    ];
  }
  return messages;
}

5. Cost Explosion

Each LLM call costs money. An agent making 30 tool calls at 4K tokens each adds up fast.

typescript
class CostTracker {
  private totalInputTokens = 0;
  private totalOutputTokens = 0;
  private totalThinkingTokens = 0;

  // Pricing per million tokens (Claude Sonnet 4 example)
  private pricing = {
    input: 3.00,
    output: 15.00,
    thinking: 3.00,
  };

  addUsage(usage: { input_tokens: number; output_tokens: number; thinking_tokens?: number }) {
    this.totalInputTokens += usage.input_tokens;
    this.totalOutputTokens += usage.output_tokens;
    this.totalThinkingTokens += usage.thinking_tokens || 0;
  }

  getCurrentCost(): number {
    return (
      (this.totalInputTokens / 1_000_000) * this.pricing.input +
      (this.totalOutputTokens / 1_000_000) * this.pricing.output +
      (this.totalThinkingTokens / 1_000_000) * this.pricing.thinking
    );
  }

  isOverBudget(maxCost: number): boolean {
    return this.getCurrentCost() > maxCost;
  }
}

Observability: Seeing Inside Your Agents

You can't fix what you can't see. Here's our observability stack for agentic systems:

Structured Logging

Every agent action produces a structured log entry:

typescript
interface AgentLogEntry {
  timestamp: string;
  workflowId: string;
  agentName: string;
  turn: number;
  action: 'llm_call' | 'tool_call' | 'tool_result' | 'thinking' | 'final_output' | 'error';
  data: {
    model?: string;
    toolName?: string;
    toolInput?: unknown;
    toolOutput?: unknown;
    thinkingContent?: string;
    textContent?: string;
    error?: string;
    tokenUsage?: { input: number; output: number; thinking?: number };
    latencyMs?: number;
  };
}

class AgentLogger {
  private entries: AgentLogEntry[] = [];

  log(entry: Omit<AgentLogEntry, 'timestamp'>) {
    const fullEntry = { ...entry, timestamp: new Date().toISOString() };
    this.entries.push(fullEntry);

    // Stream to your logging system
    console.log(JSON.stringify(fullEntry));
  }

  getTrace(): AgentLogEntry[] {
    return this.entries;
  }

  getToolCallSummary(): { tool: string; calls: number; errors: number; avgLatencyMs: number }[] {
    const byTool = new Map<string, { calls: number; errors: number; totalLatency: number }>();

    for (const entry of this.entries.filter(e => e.action === 'tool_call' || e.action === 'tool_result')) {
      const name = entry.data.toolName || 'unknown';
      const existing = byTool.get(name) || { calls: 0, errors: 0, totalLatency: 0 };

      if (entry.action === 'tool_call') existing.calls++;
      if (entry.action === 'tool_result' && entry.data.error) existing.errors++;
      if (entry.data.latencyMs) existing.totalLatency += entry.data.latencyMs;

      byTool.set(name, existing);
    }

    return Array.from(byTool.entries()).map(([tool, stats]) => ({
      tool,
      calls: stats.calls,
      errors: stats.errors,
      avgLatencyMs: stats.calls > 0 ? Math.round(stats.totalLatency / stats.calls) : 0,
    }));
  }
}

Dashboard Metrics

We push these to Grafana for real-time monitoring:

typescript
// Prometheus-style metrics
const agentMetrics = {
  workflowsStarted: new Counter('agent_workflows_started_total', 'Total workflows started', ['agent_name']),
  workflowsCompleted: new Counter('agent_workflows_completed_total', 'Total workflows completed', ['agent_name', 'status']),
  turnsPerWorkflow: new Histogram('agent_turns_per_workflow', 'Turns per workflow', {
    buckets: [1, 3, 5, 10, 15, 20, 25, 30],
    labelNames: ['agent_name'],
  }),
  toolCallLatency: new Histogram('agent_tool_call_duration_ms', 'Tool call latency', {
    buckets: [50, 100, 250, 500, 1000, 2500, 5000],
    labelNames: ['tool_name'],
  }),
  costPerWorkflow: new Histogram('agent_workflow_cost_usd', 'Cost per workflow in USD', {
    buckets: [0.01, 0.05, 0.10, 0.25, 0.50, 1.00, 2.00, 5.00],
    labelNames: ['agent_name'],
  }),
  escalationRate: new Gauge('agent_escalation_rate', 'Percentage of workflows requiring human intervention', ['agent_name']),
};

Scaling: From Single Agent to Pipeline

At Modelia.ai, we evolved from a monolithic agent to a pipeline of specialized agents. Here's the architecture:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Planner    │───▶│  Generator   │───▶│  QC Agent    │───▶│  Publisher   │
│   Agent      │    │  Agent       │    │              │    │  Agent       │
└──────────────┘    └──────────────┘    └──────┬───────┘    └──────────────┘
                                               │
                                     (score < 80)
                                               │
                                        ┌──────▼───────┐
                                        │  Refinement  │──── (back to Generator)
                                        │  Agent       │
                                        └──────────────┘

Each agent is small, focused, and independently testable. Communication happens through a shared state object:

typescript
interface WorkflowState {
  id: string;
  status: 'planning' | 'generating' | 'reviewing' | 'refining' | 'publishing' | 'complete' | 'failed';
  brief: CreativeBrief;
  plan?: ShootPlan;
  generatedImages: GeneratedImage[];
  qualityResults: QualityResult[];
  publishedItems: CatalogEntry[];
  errors: WorkflowError[];
  metadata: {
    startedAt: string;
    currentAgent: string;
    totalCost: number;
    totalTurns: number;
  };
}

class WorkflowOrchestrator {
  private agents: Map<string, AgentRunner>;
  private state: WorkflowState;

  async run(brief: CreativeBrief): Promise<WorkflowState> {
    this.state = initializeState(brief);

    try {
      // Step 1: Plan
      this.state.status = 'planning';
      this.state.plan = await this.runAgent('planner', {
        brief: this.state.brief,
      });

      // Step 2: Generate images (can be parallelized)
      this.state.status = 'generating';
      const generatePromises = this.state.plan.shots.map(shot =>
        this.runAgent('generator', { shot, brief: this.state.brief })
      );
      this.state.generatedImages = await Promise.all(generatePromises);

      // Step 3: Quality check each image
      this.state.status = 'reviewing';
      for (const image of this.state.generatedImages) {
        const qcResult = await this.runAgent('quality-checker', {
          image,
          guidelines: this.state.brief.brandGuidelines,
        });

        if (qcResult.score < 80 && image.retryCount < 3) {
          // Step 3b: Refine and re-generate
          this.state.status = 'refining';
          const refined = await this.runAgent('refiner', {
            image,
            feedback: qcResult.feedback,
          });
          image.url = refined.url;
          image.retryCount++;
        }

        this.state.qualityResults.push(qcResult);
      }

      // Step 4: Publish approved images
      this.state.status = 'publishing';
      const approved = this.state.generatedImages.filter((_, i) =>
        this.state.qualityResults[i]?.score >= 80
      );

      for (const image of approved) {
        const entry = await this.runAgent('publisher', {
          image,
          product: this.state.brief.product,
        });
        this.state.publishedItems.push(entry);
      }

      this.state.status = 'complete';
    } catch (error) {
      this.state.status = 'failed';
      this.state.errors.push({
        agent: this.state.metadata.currentAgent,
        message: error.message,
        timestamp: new Date().toISOString(),
      });
    }

    return this.state;
  }
}

Testing Agentic Systems

Unit tests don't cut it for agents. You need scenario-based tests with mocked tool results:

typescript
describe('Fashion Workflow Agent', () => {
  it('should regenerate when quality score is below threshold', async () => {
    const mockTools = new Map([
      ['generate_fashion_image', async () => ({ url: 'https://img.test/1.png', success: true })],
      ['evaluate_image_quality', async (input) => {
        // First call returns low score, second returns high
        return callCount++ === 0
          ? { score: 45, feedback: 'Poor composition', brand_compliance: false }
          : { score: 88, feedback: 'Good quality', brand_compliance: true };
      }],
      ['publish_to_catalog', async () => ({ entry_id: 'cat_123', success: true })],
    ]);

    const result = await runClaudeAgent(testGoal, agentConfig, mockTools);

    expect(result.success).toBe(true);
    // Agent should have called generate twice (original + retry)
    expect(mockTools.get('generate_fashion_image')).toHaveBeenCalledTimes(2);
    // Agent should have called evaluate twice
    expect(mockTools.get('evaluate_image_quality')).toHaveBeenCalledTimes(2);
    // Agent should have published only after quality passed
    expect(mockTools.get('publish_to_catalog')).toHaveBeenCalledTimes(1);
  });

  it('should escalate after max retries', async () => {
    const mockTools = new Map([
      ['generate_fashion_image', async () => ({ url: 'https://img.test/1.png' })],
      ['evaluate_image_quality', async () => ({ score: 30, feedback: 'Unacceptable quality' })],
      ['escalate_to_human', async () => ({ ticket_id: 'ESC-456' })],
    ]);

    const result = await runClaudeAgent(testGoal, agentConfig, mockTools);

    expect(result.success).toBe(true);
    expect(mockTools.get('escalate_to_human')).toHaveBeenCalledTimes(1);
    expect(mockTools.get('publish_to_catalog')).not.toHaveBeenCalled();
  });
});

Key Takeaways

  • Validation is your first line of defense — Validate tool inputs for schema correctness AND referential integrity. Return helpful hints on failure.
  • Loop detection is mandatory — Monitor for repetitive patterns and inject course-correction prompts before the agent wastes tokens and budget.
  • Context management determines agent lifespan — Without compaction, long-running agents degrade. Summarize aggressively.
  • Goal reminders prevent drift — Periodically remind the agent of its original objective, especially in long workflows.
  • Track cost per workflow — Set hard budget limits and alert on anomalies. An uncapped agent will surprise you with bills.
  • Structured logging enables debugging — Log every turn, tool call, and result. Without traces, you're flying blind when things go wrong.
  • Specialize your agents — A pipeline of focused agents outperforms one monolithic agent. Each agent is simpler to test, debug, and improve.
  • Test with scenarios, not units — Mock tool results and verify the agent makes correct decisions across multi-step workflows.

Share this article

Harsh Rastogi - Full Stack Engineer

Harsh Rastogi

Full Stack Engineer

Full Stack Engineer building production AI systems at Modelia. Previously at Asynq and Bharat Electronics Limited. Published researcher.

Connect on LinkedIn

Follow me for more insights on software engineering, system design, and career growth.

View Profile