Harsh Rastogi

The Hard Part Isn't Building Agents — It's Running Them

Building your first AI agent is exciting. It calls tools, reasons through problems, and produces results that feel magical. Then you deploy it to production and everything breaks.

At Asynq.ai, our candidate evaluation agent worked flawlessly in development. In production, it hallucinated tool parameters, got stuck in loops, occasionally produced evaluations that contradicted its own reasoning, and cost 3x what we budgeted. At Modelia.ai, our image generation pipeline agent would sometimes approve obviously flawed images because it optimized for completing the workflow rather than quality.

These aren't edge cases — they're the reality of agentic AI in production. This post covers the patterns we developed to handle them.

Failure Modes of Agentic Systems

After running agents in production across two companies, I've categorized failures into five types:

1. Tool Parameter Hallucination

The agent calls the right tool but with fabricated parameters — IDs that don't exist, enum values that aren't valid, or dates in wrong formats.

typescript

// The agent might generate: { model_id: "model-abc-123" }
// when the actual ID is: "mod_7f3a2b1c"

// Solution: Validate ALL tool inputs before execution
function createSafeToolExecutor(tool: Tool): SafeExecutor {
  return async (rawInput: unknown) => {
    // Step 1: Schema validation
    const parseResult = tool.parameters.safeParse(rawInput);
    if (!parseResult.success) {
      return {
        success: false,
        error: `Invalid input: ${parseResult.error.issues.map(i => i.message).join('; ')}`,
        hint: `Expected schema: ${JSON.stringify(tool.parameters.shape, null, 2)}`,
      };
    }

    // Step 2: Referential integrity check
    const input = parseResult.data;
    if (input.model_id) {
      const exists = await db.model.exists({ id: input.model_id });
      if (!exists) {
        const suggestions = await db.model.findMany({
          take: 3,
          orderBy: { usedAt: 'desc' },
          select: { id: true, name: true },
        });
        return {
          success: false,
          error: `Model ID "${input.model_id}" not found.`,
          hint: `Available models: ${suggestions.map(m => `${m.id} (${m.name})`).join(', ')}`,
        };
      }
    }

    return tool.execute(input);
  };
}

The key insight is returning helpful hints when validation fails. The agent uses these hints to self-correct on the next iteration.

2. Infinite Loops

The agent gets stuck repeating the same action because the tool result doesn't change the state in a way the agent recognizes.

typescript

class LoopDetector {
  private history: string[] = [];
  private readonly windowSize = 5;

  recordAction(action: string): boolean {
    this.history.push(action);

    if (this.history.length < this.windowSize) return false;

    const recent = this.history.slice(-this.windowSize);

    // Check for exact repetition (A, A, A, A, A)
    if (recent.every(a => a === recent[0])) {
      return true; // Loop detected
    }

    // Check for alternating pattern (A, B, A, B, A)
    if (recent.length >= 4) {
      const isAlternating = recent.every((a, i) =>
        a === recent[i % 2]
      );
      if (isAlternating) return true;
    }

    return false;
  }

  getBreakingPrompt(): string {
    const repeatedAction = this.history[this.history.length - 1];
    return `You appear to be repeating the action "${repeatedAction}" without making progress. \
Please try a different approach or explain why you're stuck so we can help.`;
  }
}

3. Context Window Overflow

Long-running agents accumulate tool results that can exceed the model's context window. The agent starts "forgetting" early decisions.

typescript

interface ContextManager {
  messages: Message[];
  tokenCount: number;
  maxTokens: number;
}

function compactContext(ctx: ContextManager): ContextManager {
  if (ctx.tokenCount < ctx.maxTokens * 0.75) return ctx;

  const systemMsg = ctx.messages[0];
  const recentMessages = ctx.messages.slice(-6);

  // Summarize the middle section
  const middleMessages = ctx.messages.slice(1, -6);
  const toolResults = middleMessages
    .filter(m => m.role === 'tool')
    .map(m => {
      const parsed = JSON.parse(m.content);
      // Keep only essential data from tool results
      return {
        tool: m.toolName,
        success: parsed.success,
        summary: parsed.summary || truncate(JSON.stringify(parsed.data), 200),
      };
    });

  const summaryMessage: Message = {
    role: 'user',
    content: `[Context Summary] Previous actions and results:\n${
      toolResults.map(r => `- ${r.tool}: ${r.success ? 'Success' : 'Failed'} — ${r.summary}`).join('\n')
    }\n\nContinue with the workflow based on these results.`,
  };

  return {
    messages: [systemMsg, summaryMessage, ...recentMessages],
    tokenCount: estimateTokens([systemMsg, summaryMessage, ...recentMessages]),
    maxTokens: ctx.maxTokens,
  };
}

4. Goal Drift

The agent subtly shifts away from the original objective. At Modelia.ai, we noticed our agents sometimes optimizing for "interesting" images rather than brand-compliant ones.

typescript

// Inject goal reminders every N turns
function addGoalReminder(messages: Message[], originalGoal: string, turnNumber: number): Message[] {
  if (turnNumber % 5 === 0 && turnNumber > 0) {
    return [
      ...messages,
      {
        role: 'user',
        content: `[Reminder] Your original goal: "${originalGoal}". \
Ensure your next actions directly serve this goal. If you've drifted, course-correct now.`,
      },
    ];
  }
  return messages;
}

5. Cost Explosion

Each LLM call costs money. An agent making 30 tool calls at 4K tokens each adds up fast.

typescript

class CostTracker {
  private totalInputTokens = 0;
  private totalOutputTokens = 0;
  private totalThinkingTokens = 0;

  // Pricing per million tokens (Claude Sonnet 4 example)
  private pricing = {
    input: 3.00,
    output: 15.00,
    thinking: 3.00,
  };

  addUsage(usage: { input_tokens: number; output_tokens: number; thinking_tokens?: number }) {
    this.totalInputTokens += usage.input_tokens;
    this.totalOutputTokens += usage.output_tokens;
    this.totalThinkingTokens += usage.thinking_tokens || 0;
  }

  getCurrentCost(): number {
    return (
      (this.totalInputTokens / 1_000_000) * this.pricing.input +
      (this.totalOutputTokens / 1_000_000) * this.pricing.output +
      (this.totalThinkingTokens / 1_000_000) * this.pricing.thinking
    );
  }

  isOverBudget(maxCost: number): boolean {
    return this.getCurrentCost() > maxCost;
  }
}

Observability: Seeing Inside Your Agents

You can't fix what you can't see. Here's our observability stack for agentic systems:

Structured Logging

Every agent action produces a structured log entry:

typescript

interface AgentLogEntry {
  timestamp: string;
  workflowId: string;
  agentName: string;
  turn: number;
  action: 'llm_call' | 'tool_call' | 'tool_result' | 'thinking' | 'final_output' | 'error';
  data: {
    model?: string;
    toolName?: string;
    toolInput?: unknown;
    toolOutput?: unknown;
    thinkingContent?: string;
    textContent?: string;
    error?: string;
    tokenUsage?: { input: number; output: number; thinking?: number };
    latencyMs?: number;
  };
}

class AgentLogger {
  private entries: AgentLogEntry[] = [];

  log(entry: Omit<AgentLogEntry, 'timestamp'>) {
    const fullEntry = { ...entry, timestamp: new Date().toISOString() };
    this.entries.push(fullEntry);

    // Stream to your logging system
    console.log(JSON.stringify(fullEntry));
  }

  getTrace(): AgentLogEntry[] {
    return this.entries;
  }

  getToolCallSummary(): { tool: string; calls: number; errors: number; avgLatencyMs: number }[] {
    const byTool = new Map<string, { calls: number; errors: number; totalLatency: number }>();

    for (const entry of this.entries.filter(e => e.action === 'tool_call' || e.action === 'tool_result')) {
      const name = entry.data.toolName || 'unknown';
      const existing = byTool.get(name) || { calls: 0, errors: 0, totalLatency: 0 };

      if (entry.action === 'tool_call') existing.calls++;
      if (entry.action === 'tool_result' && entry.data.error) existing.errors++;
      if (entry.data.latencyMs) existing.totalLatency += entry.data.latencyMs;

      byTool.set(name, existing);
    }

    return Array.from(byTool.entries()).map(([tool, stats]) => ({
      tool,
      calls: stats.calls,
      errors: stats.errors,
      avgLatencyMs: stats.calls > 0 ? Math.round(stats.totalLatency / stats.calls) : 0,
    }));
  }
}

Dashboard Metrics

We push these to Grafana for real-time monitoring:

typescript

// Prometheus-style metrics
const agentMetrics = {
  workflowsStarted: new Counter('agent_workflows_started_total', 'Total workflows started', ['agent_name']),
  workflowsCompleted: new Counter('agent_workflows_completed_total', 'Total workflows completed', ['agent_name', 'status']),
  turnsPerWorkflow: new Histogram('agent_turns_per_workflow', 'Turns per workflow', {
    buckets: [1, 3, 5, 10, 15, 20, 25, 30],
    labelNames: ['agent_name'],
  }),
  toolCallLatency: new Histogram('agent_tool_call_duration_ms', 'Tool call latency', {
    buckets: [50, 100, 250, 500, 1000, 2500, 5000],
    labelNames: ['tool_name'],
  }),
  costPerWorkflow: new Histogram('agent_workflow_cost_usd', 'Cost per workflow in USD', {
    buckets: [0.01, 0.05, 0.10, 0.25, 0.50, 1.00, 2.00, 5.00],
    labelNames: ['agent_name'],
  }),
  escalationRate: new Gauge('agent_escalation_rate', 'Percentage of workflows requiring human intervention', ['agent_name']),
};

Scaling: From Single Agent to Pipeline

At Modelia.ai, we evolved from a monolithic agent to a pipeline of specialized agents. Here's the architecture:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Planner    │───▶│  Generator   │───▶│  QC Agent    │───▶│  Publisher   │
│   Agent      │    │  Agent       │    │              │    │  Agent       │
└──────────────┘    └──────────────┘    └──────┬───────┘    └──────────────┘
                                               │
                                     (score < 80)
                                               │
                                        ┌──────▼───────┐
                                        │  Refinement  │──── (back to Generator)
                                        │  Agent       │
                                        └──────────────┘

Each agent is small, focused, and independently testable. Communication happens through a shared state object:

typescript

interface WorkflowState {
  id: string;
  status: 'planning' | 'generating' | 'reviewing' | 'refining' | 'publishing' | 'complete' | 'failed';
  brief: CreativeBrief;
  plan?: ShootPlan;
  generatedImages: GeneratedImage[];
  qualityResults: QualityResult[];
  publishedItems: CatalogEntry[];
  errors: WorkflowError[];
  metadata: {
    startedAt: string;
    currentAgent: string;
    totalCost: number;
    totalTurns: number;
  };
}

class WorkflowOrchestrator {
  private agents: Map<string, AgentRunner>;
  private state: WorkflowState;

  async run(brief: CreativeBrief): Promise<WorkflowState> {
    this.state = initializeState(brief);

    try {
      // Step 1: Plan
      this.state.status = 'planning';
      this.state.plan = await this.runAgent('planner', {
        brief: this.state.brief,
      });

      // Step 2: Generate images (can be parallelized)
      this.state.status = 'generating';
      const generatePromises = this.state.plan.shots.map(shot =>
        this.runAgent('generator', { shot, brief: this.state.brief })
      );
      this.state.generatedImages = await Promise.all(generatePromises);

      // Step 3: Quality check each image
      this.state.status = 'reviewing';
      for (const image of this.state.generatedImages) {
        const qcResult = await this.runAgent('quality-checker', {
          image,
          guidelines: this.state.brief.brandGuidelines,
        });

        if (qcResult.score < 80 && image.retryCount < 3) {
          // Step 3b: Refine and re-generate
          this.state.status = 'refining';
          const refined = await this.runAgent('refiner', {
            image,
            feedback: qcResult.feedback,
          });
          image.url = refined.url;
          image.retryCount++;
        }

        this.state.qualityResults.push(qcResult);
      }

      // Step 4: Publish approved images
      this.state.status = 'publishing';
      const approved = this.state.generatedImages.filter((_, i) =>
        this.state.qualityResults[i]?.score >= 80
      );

      for (const image of approved) {
        const entry = await this.runAgent('publisher', {
          image,
          product: this.state.brief.product,
        });
        this.state.publishedItems.push(entry);
      }

      this.state.status = 'complete';
    } catch (error) {
      this.state.status = 'failed';
      this.state.errors.push({
        agent: this.state.metadata.currentAgent,
        message: error.message,
        timestamp: new Date().toISOString(),
      });
    }

    return this.state;
  }
}

Testing Agentic Systems

Unit tests don't cut it for agents. You need scenario-based tests with mocked tool results:

typescript

describe('Fashion Workflow Agent', () => {
  it('should regenerate when quality score is below threshold', async () => {
    const mockTools = new Map([
      ['generate_fashion_image', async () => ({ url: 'https://img.test/1.png', success: true })],
      ['evaluate_image_quality', async (input) => {
        // First call returns low score, second returns high
        return callCount++ === 0
          ? { score: 45, feedback: 'Poor composition', brand_compliance: false }
          : { score: 88, feedback: 'Good quality', brand_compliance: true };
      }],
      ['publish_to_catalog', async () => ({ entry_id: 'cat_123', success: true })],
    ]);

    const result = await runClaudeAgent(testGoal, agentConfig, mockTools);

    expect(result.success).toBe(true);
    // Agent should have called generate twice (original + retry)
    expect(mockTools.get('generate_fashion_image')).toHaveBeenCalledTimes(2);
    // Agent should have called evaluate twice
    expect(mockTools.get('evaluate_image_quality')).toHaveBeenCalledTimes(2);
    // Agent should have published only after quality passed
    expect(mockTools.get('publish_to_catalog')).toHaveBeenCalledTimes(1);
  });

  it('should escalate after max retries', async () => {
    const mockTools = new Map([
      ['generate_fashion_image', async () => ({ url: 'https://img.test/1.png' })],
      ['evaluate_image_quality', async () => ({ score: 30, feedback: 'Unacceptable quality' })],
      ['escalate_to_human', async () => ({ ticket_id: 'ESC-456' })],
    ]);

    const result = await runClaudeAgent(testGoal, agentConfig, mockTools);

    expect(result.success).toBe(true);
    expect(mockTools.get('escalate_to_human')).toHaveBeenCalledTimes(1);
    expect(mockTools.get('publish_to_catalog')).not.toHaveBeenCalled();
  });
});

Key Takeaways

›Validation is your first line of defense — Validate tool inputs for schema correctness AND referential integrity. Return helpful hints on failure.
›Loop detection is mandatory — Monitor for repetitive patterns and inject course-correction prompts before the agent wastes tokens and budget.
›Context management determines agent lifespan — Without compaction, long-running agents degrade. Summarize aggressively.
›Goal reminders prevent drift — Periodically remind the agent of its original objective, especially in long workflows.
›Track cost per workflow — Set hard budget limits and alert on anomalies. An uncapped agent will surprise you with bills.
›Structured logging enables debugging — Log every turn, tool call, and result. Without traces, you're flying blind when things go wrong.
›Specialize your agents — A pipeline of focused agents outperforms one monolithic agent. Each agent is simpler to test, debug, and improve.
›Test with scenarios, not units — Mock tool results and verify the agent makes correct decisions across multi-step workflows.

Written by Harsh Rastogi — AI Product Engineer leading AI product direction at Modelia. Connect with me on LinkedIn for more on Shopify, Generative AI, agentic systems, and production engineering.

Agentic AI in Production: Error Recovery, Observability, and Scaling Patterns

The Hard Part Isn't Building Agents — It's Running Them

Failure Modes of Agentic Systems

1. Tool Parameter Hallucination

2. Infinite Loops

3. Context Window Overflow

4. Goal Drift

5. Cost Explosion

Observability: Seeing Inside Your Agents

Structured Logging

Dashboard Metrics

Scaling: From Single Agent to Pipeline

Testing Agentic Systems

Key Takeaways

Connect on LinkedIn

Related Articles

Building Agentic AI Systems: Real-World Patterns from Production

Building Agentic Workflows with Anthropic's Claude: A Production Guide

Claude Opus 4.8 Is Here: Honesty Gains, Dynamic Workflows, and a 2.5× Faster Fast Mode