The Hard Part Isn't Building Agents — It's Running Them
Building your first AI agent is exciting. It calls tools, reasons through problems, and produces results that feel magical. Then you deploy it to production and everything breaks.
At Asynq.ai, our candidate evaluation agent worked flawlessly in development. In production, it hallucinated tool parameters, got stuck in loops, occasionally produced evaluations that contradicted its own reasoning, and cost 3x what we budgeted. At Modelia.ai, our image generation pipeline agent would sometimes approve obviously flawed images because it optimized for completing the workflow rather than quality.
These aren't edge cases — they're the reality of agentic AI in production. This post covers the patterns we developed to handle them.
Failure Modes of Agentic Systems
After running agents in production across two companies, I've categorized failures into five types:
1. Tool Parameter Hallucination
The agent calls the right tool but with fabricated parameters — IDs that don't exist, enum values that aren't valid, or dates in wrong formats.
// The agent might generate: { model_id: "model-abc-123" }
// when the actual ID is: "mod_7f3a2b1c"
// Solution: Validate ALL tool inputs before execution
function createSafeToolExecutor(tool: Tool): SafeExecutor {
return async (rawInput: unknown) => {
// Step 1: Schema validation
const parseResult = tool.parameters.safeParse(rawInput);
if (!parseResult.success) {
return {
success: false,
error: `Invalid input: ${parseResult.error.issues.map(i => i.message).join('; ')}`,
hint: `Expected schema: ${JSON.stringify(tool.parameters.shape, null, 2)}`,
};
}
// Step 2: Referential integrity check
const input = parseResult.data;
if (input.model_id) {
const exists = await db.model.exists({ id: input.model_id });
if (!exists) {
const suggestions = await db.model.findMany({
take: 3,
orderBy: { usedAt: 'desc' },
select: { id: true, name: true },
});
return {
success: false,
error: `Model ID "${input.model_id}" not found.`,
hint: `Available models: ${suggestions.map(m => `${m.id} (${m.name})`).join(', ')}`,
};
}
}
return tool.execute(input);
};
}The key insight is returning helpful hints when validation fails. The agent uses these hints to self-correct on the next iteration.
2. Infinite Loops
The agent gets stuck repeating the same action because the tool result doesn't change the state in a way the agent recognizes.
class LoopDetector {
private history: string[] = [];
private readonly windowSize = 5;
recordAction(action: string): boolean {
this.history.push(action);
if (this.history.length < this.windowSize) return false;
const recent = this.history.slice(-this.windowSize);
// Check for exact repetition (A, A, A, A, A)
if (recent.every(a => a === recent[0])) {
return true; // Loop detected
}
// Check for alternating pattern (A, B, A, B, A)
if (recent.length >= 4) {
const isAlternating = recent.every((a, i) =>
a === recent[i % 2]
);
if (isAlternating) return true;
}
return false;
}
getBreakingPrompt(): string {
const repeatedAction = this.history[this.history.length - 1];
return `You appear to be repeating the action "${repeatedAction}" without making progress. \
Please try a different approach or explain why you're stuck so we can help.`;
}
}3. Context Window Overflow
Long-running agents accumulate tool results that can exceed the model's context window. The agent starts "forgetting" early decisions.
interface ContextManager {
messages: Message[];
tokenCount: number;
maxTokens: number;
}
function compactContext(ctx: ContextManager): ContextManager {
if (ctx.tokenCount < ctx.maxTokens * 0.75) return ctx;
const systemMsg = ctx.messages[0];
const recentMessages = ctx.messages.slice(-6);
// Summarize the middle section
const middleMessages = ctx.messages.slice(1, -6);
const toolResults = middleMessages
.filter(m => m.role === 'tool')
.map(m => {
const parsed = JSON.parse(m.content);
// Keep only essential data from tool results
return {
tool: m.toolName,
success: parsed.success,
summary: parsed.summary || truncate(JSON.stringify(parsed.data), 200),
};
});
const summaryMessage: Message = {
role: 'user',
content: `[Context Summary] Previous actions and results:\n${
toolResults.map(r => `- ${r.tool}: ${r.success ? 'Success' : 'Failed'} — ${r.summary}`).join('\n')
}\n\nContinue with the workflow based on these results.`,
};
return {
messages: [systemMsg, summaryMessage, ...recentMessages],
tokenCount: estimateTokens([systemMsg, summaryMessage, ...recentMessages]),
maxTokens: ctx.maxTokens,
};
}4. Goal Drift
The agent subtly shifts away from the original objective. At Modelia.ai, we noticed our agents sometimes optimizing for "interesting" images rather than brand-compliant ones.
// Inject goal reminders every N turns
function addGoalReminder(messages: Message[], originalGoal: string, turnNumber: number): Message[] {
if (turnNumber % 5 === 0 && turnNumber > 0) {
return [
...messages,
{
role: 'user',
content: `[Reminder] Your original goal: "${originalGoal}". \
Ensure your next actions directly serve this goal. If you've drifted, course-correct now.`,
},
];
}
return messages;
}5. Cost Explosion
Each LLM call costs money. An agent making 30 tool calls at 4K tokens each adds up fast.
class CostTracker {
private totalInputTokens = 0;
private totalOutputTokens = 0;
private totalThinkingTokens = 0;
// Pricing per million tokens (Claude Sonnet 4 example)
private pricing = {
input: 3.00,
output: 15.00,
thinking: 3.00,
};
addUsage(usage: { input_tokens: number; output_tokens: number; thinking_tokens?: number }) {
this.totalInputTokens += usage.input_tokens;
this.totalOutputTokens += usage.output_tokens;
this.totalThinkingTokens += usage.thinking_tokens || 0;
}
getCurrentCost(): number {
return (
(this.totalInputTokens / 1_000_000) * this.pricing.input +
(this.totalOutputTokens / 1_000_000) * this.pricing.output +
(this.totalThinkingTokens / 1_000_000) * this.pricing.thinking
);
}
isOverBudget(maxCost: number): boolean {
return this.getCurrentCost() > maxCost;
}
}Observability: Seeing Inside Your Agents
You can't fix what you can't see. Here's our observability stack for agentic systems:
Structured Logging
Every agent action produces a structured log entry:
interface AgentLogEntry {
timestamp: string;
workflowId: string;
agentName: string;
turn: number;
action: 'llm_call' | 'tool_call' | 'tool_result' | 'thinking' | 'final_output' | 'error';
data: {
model?: string;
toolName?: string;
toolInput?: unknown;
toolOutput?: unknown;
thinkingContent?: string;
textContent?: string;
error?: string;
tokenUsage?: { input: number; output: number; thinking?: number };
latencyMs?: number;
};
}
class AgentLogger {
private entries: AgentLogEntry[] = [];
log(entry: Omit<AgentLogEntry, 'timestamp'>) {
const fullEntry = { ...entry, timestamp: new Date().toISOString() };
this.entries.push(fullEntry);
// Stream to your logging system
console.log(JSON.stringify(fullEntry));
}
getTrace(): AgentLogEntry[] {
return this.entries;
}
getToolCallSummary(): { tool: string; calls: number; errors: number; avgLatencyMs: number }[] {
const byTool = new Map<string, { calls: number; errors: number; totalLatency: number }>();
for (const entry of this.entries.filter(e => e.action === 'tool_call' || e.action === 'tool_result')) {
const name = entry.data.toolName || 'unknown';
const existing = byTool.get(name) || { calls: 0, errors: 0, totalLatency: 0 };
if (entry.action === 'tool_call') existing.calls++;
if (entry.action === 'tool_result' && entry.data.error) existing.errors++;
if (entry.data.latencyMs) existing.totalLatency += entry.data.latencyMs;
byTool.set(name, existing);
}
return Array.from(byTool.entries()).map(([tool, stats]) => ({
tool,
calls: stats.calls,
errors: stats.errors,
avgLatencyMs: stats.calls > 0 ? Math.round(stats.totalLatency / stats.calls) : 0,
}));
}
}Dashboard Metrics
We push these to Grafana for real-time monitoring:
// Prometheus-style metrics
const agentMetrics = {
workflowsStarted: new Counter('agent_workflows_started_total', 'Total workflows started', ['agent_name']),
workflowsCompleted: new Counter('agent_workflows_completed_total', 'Total workflows completed', ['agent_name', 'status']),
turnsPerWorkflow: new Histogram('agent_turns_per_workflow', 'Turns per workflow', {
buckets: [1, 3, 5, 10, 15, 20, 25, 30],
labelNames: ['agent_name'],
}),
toolCallLatency: new Histogram('agent_tool_call_duration_ms', 'Tool call latency', {
buckets: [50, 100, 250, 500, 1000, 2500, 5000],
labelNames: ['tool_name'],
}),
costPerWorkflow: new Histogram('agent_workflow_cost_usd', 'Cost per workflow in USD', {
buckets: [0.01, 0.05, 0.10, 0.25, 0.50, 1.00, 2.00, 5.00],
labelNames: ['agent_name'],
}),
escalationRate: new Gauge('agent_escalation_rate', 'Percentage of workflows requiring human intervention', ['agent_name']),
};Scaling: From Single Agent to Pipeline
At Modelia.ai, we evolved from a monolithic agent to a pipeline of specialized agents. Here's the architecture:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Planner │───▶│ Generator │───▶│ QC Agent │───▶│ Publisher │
│ Agent │ │ Agent │ │ │ │ Agent │
└──────────────┘ └──────────────┘ └──────┬───────┘ └──────────────┘
│
(score < 80)
│
┌──────▼───────┐
│ Refinement │──── (back to Generator)
│ Agent │
└──────────────┘Each agent is small, focused, and independently testable. Communication happens through a shared state object:
interface WorkflowState {
id: string;
status: 'planning' | 'generating' | 'reviewing' | 'refining' | 'publishing' | 'complete' | 'failed';
brief: CreativeBrief;
plan?: ShootPlan;
generatedImages: GeneratedImage[];
qualityResults: QualityResult[];
publishedItems: CatalogEntry[];
errors: WorkflowError[];
metadata: {
startedAt: string;
currentAgent: string;
totalCost: number;
totalTurns: number;
};
}
class WorkflowOrchestrator {
private agents: Map<string, AgentRunner>;
private state: WorkflowState;
async run(brief: CreativeBrief): Promise<WorkflowState> {
this.state = initializeState(brief);
try {
// Step 1: Plan
this.state.status = 'planning';
this.state.plan = await this.runAgent('planner', {
brief: this.state.brief,
});
// Step 2: Generate images (can be parallelized)
this.state.status = 'generating';
const generatePromises = this.state.plan.shots.map(shot =>
this.runAgent('generator', { shot, brief: this.state.brief })
);
this.state.generatedImages = await Promise.all(generatePromises);
// Step 3: Quality check each image
this.state.status = 'reviewing';
for (const image of this.state.generatedImages) {
const qcResult = await this.runAgent('quality-checker', {
image,
guidelines: this.state.brief.brandGuidelines,
});
if (qcResult.score < 80 && image.retryCount < 3) {
// Step 3b: Refine and re-generate
this.state.status = 'refining';
const refined = await this.runAgent('refiner', {
image,
feedback: qcResult.feedback,
});
image.url = refined.url;
image.retryCount++;
}
this.state.qualityResults.push(qcResult);
}
// Step 4: Publish approved images
this.state.status = 'publishing';
const approved = this.state.generatedImages.filter((_, i) =>
this.state.qualityResults[i]?.score >= 80
);
for (const image of approved) {
const entry = await this.runAgent('publisher', {
image,
product: this.state.brief.product,
});
this.state.publishedItems.push(entry);
}
this.state.status = 'complete';
} catch (error) {
this.state.status = 'failed';
this.state.errors.push({
agent: this.state.metadata.currentAgent,
message: error.message,
timestamp: new Date().toISOString(),
});
}
return this.state;
}
}Testing Agentic Systems
Unit tests don't cut it for agents. You need scenario-based tests with mocked tool results:
describe('Fashion Workflow Agent', () => {
it('should regenerate when quality score is below threshold', async () => {
const mockTools = new Map([
['generate_fashion_image', async () => ({ url: 'https://img.test/1.png', success: true })],
['evaluate_image_quality', async (input) => {
// First call returns low score, second returns high
return callCount++ === 0
? { score: 45, feedback: 'Poor composition', brand_compliance: false }
: { score: 88, feedback: 'Good quality', brand_compliance: true };
}],
['publish_to_catalog', async () => ({ entry_id: 'cat_123', success: true })],
]);
const result = await runClaudeAgent(testGoal, agentConfig, mockTools);
expect(result.success).toBe(true);
// Agent should have called generate twice (original + retry)
expect(mockTools.get('generate_fashion_image')).toHaveBeenCalledTimes(2);
// Agent should have called evaluate twice
expect(mockTools.get('evaluate_image_quality')).toHaveBeenCalledTimes(2);
// Agent should have published only after quality passed
expect(mockTools.get('publish_to_catalog')).toHaveBeenCalledTimes(1);
});
it('should escalate after max retries', async () => {
const mockTools = new Map([
['generate_fashion_image', async () => ({ url: 'https://img.test/1.png' })],
['evaluate_image_quality', async () => ({ score: 30, feedback: 'Unacceptable quality' })],
['escalate_to_human', async () => ({ ticket_id: 'ESC-456' })],
]);
const result = await runClaudeAgent(testGoal, agentConfig, mockTools);
expect(result.success).toBe(true);
expect(mockTools.get('escalate_to_human')).toHaveBeenCalledTimes(1);
expect(mockTools.get('publish_to_catalog')).not.toHaveBeenCalled();
});
});Key Takeaways
- ›Validation is your first line of defense — Validate tool inputs for schema correctness AND referential integrity. Return helpful hints on failure.
- ›Loop detection is mandatory — Monitor for repetitive patterns and inject course-correction prompts before the agent wastes tokens and budget.
- ›Context management determines agent lifespan — Without compaction, long-running agents degrade. Summarize aggressively.
- ›Goal reminders prevent drift — Periodically remind the agent of its original objective, especially in long workflows.
- ›Track cost per workflow — Set hard budget limits and alert on anomalies. An uncapped agent will surprise you with bills.
- ›Structured logging enables debugging — Log every turn, tool call, and result. Without traces, you're flying blind when things go wrong.
- ›Specialize your agents — A pipeline of focused agents outperforms one monolithic agent. Each agent is simpler to test, debug, and improve.
- ›Test with scenarios, not units — Mock tool results and verify the agent makes correct decisions across multi-step workflows.
