7 Metrics Every ML Team Should Track in Production
Most ML teams have monitoring. They're tracking the usual suspects — server uptime, HTTP 5xx rates, p99 latency at the infrastructure layer. What they're not tracking is the stuff that actually breaks AI systems: silent model degradation, token cost drift, grounding failures that never throw an exception.
This is the ML production monitoring checklist we use at GuardLayer. Seven metrics, each with a practical example of what going wrong looks like, and how we catch it.
1. Model Latency P50 / P95
What it is: The median and 95th-percentile response time from your model provider — measured from request sent to first token received (TTFT) and last token received (total completion time).
Why it matters: P50 tells you what typical users experience. P95 tells you what your worst-case 1-in-20 user sees. A P50 of 800ms with a P95 of 12 seconds is a sign your traffic is bimodal — some requests are triggering extremely long completions, probably due to uncapped max_tokens or prompts that induce verbose chain-of-thought.
What going wrong looks like: You deploy a new prompt template on Friday. By Monday, your P95 latency has tripled. The prompt is correct — it's just much more verbose than the previous version, and you didn't cap output tokens. Users experience 15-second waits. No error rate spike. No alert fires.
How to catch it: Separate latency tracking per endpoint, model, and prompt template version. Alert when P95 crosses 2× the 7-day rolling baseline, not just a fixed threshold.
2. Token Cost Per Request
What it is: The average (and p95) cost in dollars per API call — input tokens + output tokens × your provider's per-token rate, broken down by model and endpoint.
Why it matters: Token costs are the most common vector for billing surprises. A prompt that worked fine in testing can be 10× more expensive in production because real user inputs are longer than your test fixtures, or because users found a way to trigger unexpectedly verbose completions.
What going wrong looks like: A customer-facing summarization feature averages $0.003/request in staging. After launch, average real-world documents are 3× longer than test data. Cost per request climbs to $0.022. At 50,000 daily requests, that's $1,100/day instead of $150/day — a number that doesn't show up until the monthly invoice.
How to catch it: Track cost per request as a time-series metric. Set rolling alerts when daily average cost-per-request exceeds 150% of the prior 7-day average. Break this down by model (GPT-4o vs. Claude 3.5 vs. Gemini) to catch cases where routing logic sends expensive models traffic intended for cheap ones.
3. Hallucination / Grounding Score
What it is: A 0–100 score indicating how well model responses are grounded in retrieved context or verified facts — typically produced by a fast evaluator model running in parallel with production traffic.
Why it matters: Hallucinations don't throw exceptions. An LLM confidently producing incorrect information is indistinguishable at the API layer from one producing correct information. Without an active evaluation layer, you find out about grounding failures from support tickets — or worse, from users who made a decision based on bad output.
What going wrong looks like: Your RAG-based knowledge base assistant scores 91% grounding on your evaluation set. Three weeks after a knowledge base update, the chunking strategy for new documents differs subtly from old ones. Grounding score drops to 67% for queries about new content. Users report the chatbot is "making things up." No infrastructure metric fired.
How to catch it: Run a lightweight evaluator (5-signal prompt: numerical mismatch, temporal claims, invented entities, hedge language, length anomaly) against a sample of production responses. Alert when the 24-hour rolling grounding score drops more than 8 points from the 7-day baseline.
4. Retrieval Quality Drift
What it is: The similarity score between user queries and the documents being retrieved from your vector store — measured as cosine similarity for the top-k results, tracked over time.
Why it matters: In RAG systems, the retrieval layer is where most quality failures originate. If your embedding model, chunking strategy, or index contents change — even subtly — retrieval relevance degrades silently. The LLM will still produce fluent-sounding responses. They'll just be grounded in the wrong documents.
What going wrong looks like: A quarterly knowledge base refresh adds 40,000 new documents using a slightly different chunking function (512 tokens vs. 256 tokens from before). Retrieval similarity scores for typical queries drop from 0.82 to 0.71. Answers start drawing from tangentially related documents. Users notice "the assistant doesn't seem to know about X anymore" — even though X is in the knowledge base.
How to catch it: Log top-k retrieval scores for every production query. Track the rolling median similarity score per collection. Alert when 7-day median drops more than 0.05 from the prior 30-day baseline.
5. Error Rate by Model Version
What it is: The percentage of requests that result in provider errors (rate limits, context length exceeded, content filter blocks, timeouts) — segmented by model version and broken down by error type.
Why it matters: Aggregate error rate hides version-specific regressions. When OpenAI ships GPT-4o-2025-03 and your router starts sending traffic to it, any error pattern specific to that version gets diluted in your aggregate metrics. Error rates by version let you isolate problems to their source.
What going wrong looks like: Your default model is updated by your provider from gpt-4o-2024-08-06 to gpt-4o-2025-01-15. The new version has a shorter effective context window for certain content types, causing context_length_exceeded errors for your longest documents. Aggregate error rate nudges from 0.3% to 0.9%. By the time you notice, 6,000 requests have failed silently and retried — tripling your actual API spend for those requests.
How to catch it: Tag every API call with the exact model version string. Group error rates by model_version + error_type. Alert when any (model_version, error_type) pair exceeds 1% in a rolling 1-hour window.
6. Input / Output Token Ratio Anomalies
What it is: The ratio of input tokens to output tokens per request, tracked as a time-series distribution. Normal production traffic has a characteristic ratio range for each endpoint.
Why it matters: Unexpected shifts in this ratio signal prompt injection, jailbreak attempts, or abuse patterns — before they appear in cost or error metrics. An attacker who successfully extends your output via injection will move your I/O ratio sharply. A prompt leak that exposes your system prompt in responses will do the same.
What going wrong looks like: Your customer support bot typically produces 150–300 tokens per response against 200–400 token inputs — ratio of roughly 0.5–1.5. A new jailbreak technique circulates on social media; users discover it on your platform. A subset of sessions start producing 2,000–4,000 token outputs. The I/O ratio for affected sessions spikes to 8–12. No policy rule fires. Cost per affected session is 15× normal.
How to catch it: Compute the I/O token ratio per request. Track the p95 ratio per endpoint on a rolling 4-hour basis. Alert when endpoint-level p95 I/O ratio exceeds 2× the 7-day rolling p95.
7. SLA Compliance Percentage
What it is: The percentage of requests that complete within your defined SLA threshold (e.g., p95 < 3 seconds, error rate < 0.5%) in each rolling 24-hour period, tracked per customer tier.
Why it matters: Aggregates lie. Your overall p95 latency might look fine while a specific customer — the one on your Enterprise plan with an SLA in their contract — is experiencing degraded service. SLA compliance as a per-customer, per-tier metric is the only way to catch this before they raise it with your account manager.
What going wrong looks like: You have 12 Enterprise customers on a 2-second p95 SLA. A new large free-tier customer starts sending very high-volume traffic with long prompts. Resource contention degrades p95 for Enterprise customers to 3.4 seconds. Your aggregate p95 looks fine — the volume from Enterprise customers is dwarfed by the free-tier traffic. Three Enterprise customers notice and escalate simultaneously.
How to catch it: Track SLA compliance as an explicit metric per customer and per tier. Compute the percentage of requests meeting your SLA threshold in each rolling 24-hour window. Alert when any paid customer tier's compliance drops below 98% for two consecutive hours.
Putting It Together
These seven metrics cover the failure modes that don't show up in standard infrastructure monitoring: latency regressions, cost drift, silent grounding failures, retrieval degradation, version-specific errors, abuse patterns, and SLA violations.
Most teams are tracking two or three of these informally. The difference between informal tracking and production-grade observability is:
- Baselines — alert on deviation from historical norms, not just absolute thresholds
- Segmentation — break every metric down by model version, customer tier, and endpoint
- Evaluation — run active scoring (grounding, retrieval quality) against production traffic samples, not just logged inputs and outputs
If you're building this yourself, expect 6–8 weeks of engineering time to get all seven implemented and tuned. For most teams shipping production AI, that's time better spent on the product.
GuardLayer tracks all 7 automatically — instrumented in under 10 minutes, no changes to your inference stack. See pricing details here.
GuardLayer Engineering Team — April 2026