Debugging AI Agents: A Practical Guide for Developers
Common failure modes like hallucination, tool call errors, and cost runaway — and how trace data helps you fix them fast.
Your AI agent just did something inexplicable. Maybe it hallucinated a function that doesn't exist. Maybe it called the same tool 47 times in a row. Maybe it spent $12 on a task that should have cost $0.02. Whatever happened, you need to figure out why — fast.
This guide covers the most common agent failure modes and how to use trace data to diagnose and fix them.
The Five Agent Failure Modes
1. Hallucinated Tool Calls
The agent invokes a tool that doesn't exist, passes invalid parameters to a real tool, or fabricates tool results instead of actually calling them. This is the most common failure mode and the hardest to catch without trace data.
How to diagnose: Look at the trace for tool call steps where the function name doesn't match any registered tool, or where the parameters don't match the expected schema. Pay special attention to cases where the agent "claims" a tool returned a result but no actual tool execution appears in the trace.
Fix: Strengthen your tool descriptions, add parameter validation with clear error messages, and consider using structured output mode (function calling) instead of free-text tool invocation.
2. Infinite Loops
The agent gets stuck in a cycle — retrying a failed tool call, re-asking the same question, or oscillating between two states without making progress. This is expensive (burning tokens on every iteration) and often invisible until you check the bill.
How to diagnose: In your traces, look for repeated patterns: the same tool being called with identical or nearly-identical parameters, or the agent producing similar outputs in consecutive steps. Track step count per session — anything above your expected maximum is a red flag.
Fix: Implement hard limits on loop iterations and total steps per session. Add a "progress check" that forces the agent to evaluate whether it's making forward progress every N steps. Set per-session cost caps.
3. Cost Runaway
Not always a loop — sometimes the agent legitimately needs many steps but the cost is disproportionate to the value. A research task that chains 30 web searches. A code generation task that iterates 15 times on compilation errors. The agent is "working" but burning far more resources than intended.
How to diagnose: Track cumulative cost per session in real time. Compare against historical averages for similar task types. Flag sessions that exceed 3x the median cost.
Fix: Set budget limits per task type, not just per session. Implement escalation: when cost exceeds a threshold, pause the agent and notify a human. Consider using cheaper models for intermediate reasoning steps.
4. Context Window Overflow
As agents accumulate conversation history, tool results, and intermediate reasoning, they can exceed the model's context window. The result: the model starts "forgetting" earlier instructions, losing track of its goal, or producing incoherent output as important context gets truncated.
How to diagnose: Monitor token count per LLM call. When input tokens approach the model's context limit, quality degrades. Look for traces where output quality drops sharply after a certain step count.
Fix: Implement context summarization — periodically compress the conversation history into a summary. Use separate memory stores for long-term context. Prune tool results to include only relevant information.
5. Silent Failures
The scariest failure mode: the agent completes successfully, returns a confident response, and is completely wrong. No error, no exception, no retry. Just a plausible-sounding answer that happens to be fabricated.
How to diagnose: This requires output validation — checking the agent's final output against ground truth or running automated quality checks. Trace data helps by showing you what the agent based its answer on: did it actually retrieve the data it claims, or did it skip the retrieval step entirely?
Fix: Add verification steps to critical agent workflows. Require citation of sources. Implement confidence scoring. For high-stakes decisions, require human-in-the-loop approval.
Setting Up Alerts That Actually Matter
Not all alerts are created equal. Here's what to alert on (and what to just log):
- Alert immediately: Cost per session exceeds 5x median, error rate > 20% over 5 minutes, any agent stuck (no progress for > 60 seconds)
- Alert on threshold: Daily cost exceeds budget, new tool call error type detected, latency p95 > 2x normal
- Log only: Individual tool call failures (expected at some rate), token usage per request (for trend analysis), model response time fluctuations
The Power of Session Replay
When a user reports a bad experience or you spot an anomaly in your metrics, session replay lets you watch exactly what the agent did — step by step, in order, with full context. It's the difference between "something went wrong around 3 PM" and "the agent received this prompt, made these 8 tool calls, got a timeout on step 5, retried with different parameters, and returned an incomplete result."
Session replay turns debugging from guesswork into forensics. It's the single most valuable capability in an agent observability tool.
Stop guessing. Start tracing.
OpenClaw Trace captures every step of your agent's execution. Free, open source, one command to start.