---
name: llm-cost-audit
description: Audit a codebase's LLM usage for cost. Use when AI infrastructure spend is growing, before scaling an AI feature, or when asked to "make the AI cheaper" — finds caching wins, model-tier mismatches, token waste, and missing usage controls.
---

# LLM Cost Audit

You are auditing every LLM call in this codebase for cost efficiency. The goal is a ranked list of savings with estimated impact — not a rewrite. Real platforms have cut 30–60% of AI infra cost with the patterns below without hurting quality.

## Step 1 — Inventory every call site

Search for LLM SDK usage (anthropic, openai, generative ai clients, raw fetch to inference endpoints, internal gateway wrappers). For each call site record:

- model used, max_tokens, where the prompt comes from
- call frequency (per request? per item in a loop? cron?)
- whether the output is user-facing or internal/intermediate

Flag immediately: LLM calls inside loops over collections, calls on every keystroke/page load, and retries without backoff.

## Step 2 — Model-tier mismatches

For each call site, ask: does this task need this model?

- Classification, extraction, routing, yes/no checks, title generation → smallest model tier.
- Multi-step reasoning, code generation, user-facing long-form → larger tier, but check if a mid tier was ever evaluated.
- Flag any place where one "default model" constant serves every task in the app — per-task model selection is usually the single biggest lever.

## Step 3 — Caching

- **Prompt caching**: system prompts, few-shot examples, and document context that repeat across calls should use the provider's prompt cache. Estimate the hit: repeated prefix tokens × call volume.
- **Response caching**: identical or near-identical requests (same input doc, same question) should hit an application-level cache (Redis keyed on a hash of normalized input). Look for deterministic tasks (temperature 0 or extraction tasks) — those are safe to cache aggressively.
- **Negative caching**: failed/refused generations that will fail again identically.

## Step 4 — Token waste

- Prompts that ship the whole document when a section would do; history that grows unbounded in multi-turn flows (no summarization or windowing).
- max_tokens set far above what's consumed (wastes nothing directly but hides runaway outputs); missing stop sequences.
- Verbose output formats: asking for JSON with long keys, prose wrappers around structured data, or chain-of-thought returned to users who never see it.
- Retries that resend full context on parse failures instead of repairing locally.

## Step 5 — Controls and observability

- Per-feature/per-tool usage metering exists? If not, recommend tagging every call with a feature label and recording tokens in/out — you cannot optimize what you don't attribute.
- Spend alerts and per-user/per-tenant rate limits for abuse-prone surfaces.
- A/B or shadow-test path to validate model downgrades safely before committing.

## Output format

1. **Top savings, ranked** — each with: call site (file:line), current pattern, proposed change, estimated % of that call's cost saved, risk level, and how to validate quality is unchanged.
2. **Quick wins** — changes safe to ship this week.
3. **Instrumentation gaps** — what must be measured before further optimization.

Estimates may be rough — state assumptions (volume, token counts). A directionally-correct ranked list beats false precision.