AI Incident Response: What to Do When Your LLM Fails in Production
At 2:47 AM, your PagerDuty alert fires. You pull up the dashboard. Error rate is normal. Latency is normal. Throughput is normal. Nothing looks wrong.
Meanwhile, your LLM has been giving customers subtly incorrect answers for six hours.
This is the defining challenge of AI incident response: the system can be functionally "up" while catastrophically broken. Traditional SRE playbooks weren't built for this. On-call engineers trained to look for 5xx errors and latency spikes will miss the class of failure that matters most in AI systems.
Here's how to close that gap.
Why LLM Failures Are Different
Software incidents usually announce themselves. A database goes down — connections fail, errors spike, the monitoring dashboard turns red. You page the on-call, identify the root cause, roll back or patch.
LLM failures don't work this way for four reasons:
1. Non-deterministic outputs. The same prompt produces different outputs across calls. This is by design, but it makes regression detection hard. You can't diff today's output against yesterday's and expect consistent results even when nothing has changed.
2. Quality degradation is silent. When a model starts hallucinating more, outputting shorter responses, or drifting off-domain, the API returns HTTP 200. Your load balancer sees healthy responses. Your error rate monitor sees nothing wrong. The only signal is in the content — and content isn't a metric your infrastructure understands by default. We covered this problem in depth in our post on detecting AI model drift before your users do.
3. No stack traces for quality issues. When a traditional service breaks, you get a stack trace. When an LLM starts hallucinating, you get a response object that looks correct. The cost of the failure is measured in user trust and downstream business impact, not log lines. As we documented in the hidden cost of AI hallucinations in production, these silent failures compound over time before anyone notices.
4. Costs can spike independently of traffic. A prompt injection attack or a runaway agentic loop can cause your token spend to multiply without any corresponding increase in user-facing activity. By the time your monthly invoice arrives, thousands of dollars may have been wasted on a failure mode that had no traditional error signature. Our LLM cost monitoring guide details how to catch this in real time.
The LLM Incident Taxonomy
Not all AI incidents are equal. Treating a cost anomaly the same as a safety violation wastes time; missing a quality degradation because you're only watching for outages leaves your users exposed. Here are the four categories every AI incident response plan needs to cover:
(a) Total Outage
What it looks like: The model API is unreachable or returning consistent errors. HTTP 5xx rates spike. Latency jumps or requests time out entirely.
Detection: Standard infrastructure monitoring catches this — error rate alerts, latency thresholds, health check failures. This is the incident type traditional SRE tools handle well.
Response: Fail over to a backup model, serve a graceful degraded experience, or queue requests for retry. The playbook here is identical to a downstream service outage.
(b) Quality Degradation
What it looks like: Response coherence drops, hallucination frequency increases, outputs drift from expected format or domain, or response length collapses. The API is healthy; the outputs are not.
Detection: This requires semantic monitoring — scoring outputs against quality benchmarks, tracking hallucination signals (hedge language, numerical inconsistencies, unsupported factual claims), monitoring format compliance rates and response length distributions. Standard infrastructure metrics are blind to this failure type.
Escalation threshold: Quality score drops more than 15% from baseline over any 30-minute window, or format compliance falls below 95% over 1,000 calls.
Response: Pin the model version if a recent update coincides with the degradation. Activate a stricter prompt version. Route traffic to an alternate model while investigating. Do not just wait — quality degradation compounds as users receive bad outputs.
(c) Cost Anomaly
What it looks like: Token usage spikes without a corresponding increase in request volume. Cost-per-request jumps. A single user or endpoint is consuming disproportionate tokens.
Detection: Track tokens-per-request as a standalone metric alongside total spend. Set rate limits per user and per session. Alert on any request exceeding 3x the p95 token count baseline.
Response: Rate-limit the offending user or endpoint immediately. Investigate whether the spike is a prompt injection attack, a runaway agentic loop, or a legitimate load pattern. Adjust token limits at the API layer before proceeding.
(d) Safety Violation
What it looks like: Harmful, offensive, or materially incorrect outputs reach users. This includes PII leakage, policy violations, dangerous advice, or content that creates legal or reputational exposure.
Detection: Content classifiers on outputs, PII pattern matching, policy keyword detection, user-flagging mechanisms. Detection must happen before outputs reach users — post-hoc detection after delivery means the damage is already done. Our guide on AI guardrails in production covers the five-layer approach to catching violations at the infrastructure layer.
Response: This is a P0 incident regardless of scale. Block the output immediately, notify affected users if exposure has occurred, engage legal and compliance as appropriate, and treat root-cause analysis as mandatory before resuming normal operation.
Building the Incident Response Playbook
A playbook that only handles outages leaves three of your four failure modes unaddressed. Here's the structure that covers all of them:
Detection layer: For each failure type, define the exact metric, threshold, and alerting mechanism. Quality degradation requires semantic scoring. Cost anomalies require per-request spend tracking. Safety violations require output classifiers. Total outages require standard infrastructure monitors. You need all four instrumented.
Escalation thresholds: Define quantitative triggers for paging the on-call engineer vs. auto-remediating vs. escalating to engineering lead. Example: cost anomaly >$50/hour auto-throttles and pages; safety violation of any severity pages immediately with no auto-remediation; quality degradation at severity 1 pages after 10 minutes, severity 2 creates a ticket.
Rollback strategies:
- Model pinning: Lock to the last known-good model version. Most providers support version pinning in the API call. Document which version you're pinned to and when you pinned it.
- Prompt versioning: Maintain named prompt versions with checksums. When quality degrades, you can roll back to a prior prompt without a code deploy.
- Traffic shifting: Route a percentage of traffic to an alternate model or provider while diagnosing. Start at 10%, validate quality, increase if stable.
Communication protocol: Define who gets notified and when. For safety violations, legal and comms need to know within 30 minutes regardless of hour. For quality degradation, the product team should know within the first SLA window. Total outages follow your existing incident communication process.
Post-Incident Analysis for AI Systems
Traditional postmortems ask: what failed, when, why, and how do we prevent it? For AI incidents, you need five additional data points that standard postmortem templates won't capture:
1. Prompt version at time of incident. What was the exact prompt — not the template, the rendered prompt — that was in use when degradation started? Store prompt versions with checksums and timestamps.
2. Model version and configuration. Which model, which version, which temperature and sampling parameters? Provider model updates are silent — you won't get a notification that behavior changed unless you're pinned to an explicit version.
3. Sample outputs. Capture a statistically representative sample of outputs from the incident window. Qualitative analysis of actual outputs often reveals patterns that metrics don't — a systematic bias, a new failure mode, a prompt injection technique.
4. Drift metrics at incident time. What were your semantic similarity scores, hallucination rates, and format compliance rates in the 2 hours before the incident was declared? Drift often precedes failure — the post-incident analysis should determine whether the signal was visible and missed, or genuinely invisible.
5. Input distribution shift. Did the distribution of incoming requests change before the incident? New user segments, new query types, or a traffic source change can degrade a model that worked fine on previous inputs. Compare the input distribution in the incident window against your baseline.
These five data points turn a vague "the model started behaving badly" post-incident into a reproducible diagnosis with a clear prevention story.
Automated Detection Before Incidents Reach Users
The pattern above assumes you detect the incident after it starts. The better model is detecting the signal before it becomes an incident.
GuardLayer monitors all four failure categories in real time — tracking quality scores, semantic drift, cost-per-request, and safety violations against configurable thresholds. When a threshold is crossed, alerts fire before users are affected. The post-incident question "when did this start?" becomes answerable from the dashboard, with exact timestamps, sample outputs, and drift metrics already captured.
The SRE teams we talk to spend too much time in retrospective investigation — pulling logs, reconstructing timelines, figuring out what the model was doing at 2 AM. That work should be automated. Your on-call engineer should arrive at the incident with context, not have to build it from scratch.
AI incidents will happen. The teams that handle them well aren't the ones who were lucky enough to catch them early — they're the ones who built detection into the system before the first incident occurred.
GuardLayer monitors LLM quality, cost, and safety violations in real time. Start a free trial — no DevOps required.