The Express Middleware Trap That Broke Our AI Monitoring

On March 3rd, we shipped a routine refactor to our ingestion pipeline. On March 14th, a customer asked why their error rate dashboard showed zero violations for the past week and a half.

It hadn't. Their system had been throwing errors. We just weren't catching them.

This is the post-mortem: what broke, why it was invisible for 11 days, and what we changed to make sure it can't happen again.


The Setup

GuardLayer's core loop is straightforward. AI API calls hit our proxy, we evaluate every request and response against the customer's configured rules, log the results, and forward the traffic. The evaluation pipeline runs as Express middleware — a chain of functions that inspect, score, and optionally block requests before they reach the upstream provider.

The middleware stack looked roughly like this:

authMiddleware → rateLimiter → requestValidator → evaluationPipeline → proxyForward → responseEvaluator → logger

Each middleware calls next() to pass control to the next function. If something throws, Express's error handler catches it, logs it, and returns an appropriate response. Standard Express patterns. Nothing exotic.

The evaluation pipeline — the part that actually checks for hallucinations, policy violations, cost anomalies — is async. It awaits calls to our scoring models, runs regex-based content filters, and queries the customer's rule configuration from the database. A typical evaluation takes 15-40ms.


The Refactor

We were splitting our monolithic evaluationPipeline middleware into smaller, composable functions: contentFilter, costEvaluator, halluccinationScorer, policyEnforcer. The goal was testability. Each function would be independently unit-testable and configurable per customer.

The refactored stack:

authMiddleware → rateLimiter → requestValidator → contentFilter → costEvaluator → hallucinationScorer → policyEnforcer → proxyForward → responseEvaluator → logger

The refactor was clean. Tests passed. Code review approved. We deployed on March 3rd.


The Bug

Here's the code that shipped for hallucinationScorer:

const hallucinationScorer = async (req, res, next) => {
  try {
    const score = await scoreHallucination(req.body, req.customerConfig);
    req.evaluationResults.hallucination = score;
    next();
  } catch (err) {
    req.evaluationResults.hallucination = { error: err.message };
    next();
  }
};

See the problem? Look at the catch block. When the hallucination scorer throws — network timeout to the scoring model, malformed input, database connection failure — it swallows the error, writes it to the evaluation results object, and calls next() as if nothing happened.

That's not inherently wrong. We wanted the pipeline to be fault-tolerant. A failure in hallucination scoring shouldn't block the customer's API request from reaching their model. The design intention was: score what you can, log failures, keep traffic flowing.

The actual bug was one layer deeper. The responseEvaluator middleware — which runs after the proxy forwards the request — checks req.evaluationResults for violations. But when it found an error property instead of a score property, it silently skipped that evaluator's results. No violation recorded. No error logged. No metric incremented.

// responseEvaluator.js — the broken check
if (results.hallucination && results.hallucination.score > threshold) {
  violations.push({ type: 'hallucination', severity: 'high', ... });
}

When results.hallucination was { error: "Connection timeout" } instead of { score: 0.87, confidence: 0.92 }, the condition results.hallucination.score > threshold evaluated to undefined > 0.7, which is false. No violation. No crash. No log. Just silence.

And this wasn't limited to the hallucination scorer. The same pattern existed in costEvaluator and policyEnforcer. Any evaluation failure silently disappeared.


Why It Was Invisible for 11 Days

Three compounding factors:

1. No error-level logs. The catch blocks didn't log at error level. They wrote to req.evaluationResults, which only appears in our structured event logs — a different pipeline that nobody monitors in real-time. Our alerting was wired to Express error handler invocations and explicit console.error calls. Neither fired.

2. The health check passed. Our /health endpoint verifies database connectivity, upstream provider reachability, and middleware chain integrity (it sends a synthetic request through the pipeline). But the synthetic request was well-formed and small — it never triggered the edge cases that caused the scorers to throw. The health check returned 200 throughout the entire incident.

3. Metrics showed "normal" traffic. Request volume, latency, and success rate all looked healthy. The only anomaly was the absence of violation events — which is indistinguishable from "the customer's AI system is behaving well" unless you have a baseline expectation for violation frequency. We didn't.

The customer noticed because they'd been tracking a specific hallucination pattern and expected to see it flagged. When zero violations appeared for 11 days, they investigated. We should have caught it first.


The Fix

Four changes, deployed within 6 hours of the customer report:

1. Explicit error handling in the response evaluator. The evaluation results checker now distinguishes between "evaluator returned a clean score" and "evaluator failed to run." Failed evaluators generate a distinct evaluation_failure event with the error details, separate from the violation pipeline.

// Fixed responseEvaluator.js
for (const [evaluator, result] of Object.entries(results)) {
  if (result.error) {
    logEvaluationFailure(evaluator, result.error, req);
    metrics.increment('evaluation.failure', { evaluator });
    continue;
  }
  if (result.score > thresholds[evaluator]) {
    violations.push({ type: evaluator, severity: result.severity, ... });
  }
}

2. Evaluation coverage metric. We now track the percentage of requests that successfully complete all configured evaluators. A drop from 100% to 85% in evaluation.coverage triggers an alert. This would have caught the March 3rd regression within minutes.

3. Synthetic violation injection. Our health check now includes a test request designed to trigger at least one violation (using a known-bad prompt that matches a guardrail rule). If the health check returns zero violations, the check fails. This validates the entire pipeline end-to-end, not just the middleware chain.

4. Error-level logging in catch blocks. Every middleware catch block now calls console.error with structured context in addition to writing to the evaluation results. This feeds our existing alerting pipeline. A single evaluation timeout now generates a visible alert.


Lessons

The async middleware trap is subtle. In synchronous Express middleware, an unhandled throw crashes the request — ugly but visible. In async middleware with try/catch, a swallowed error is silent by default. The middleware completes "successfully," next() fires, and downstream code runs against partial or corrupted state. This isn't a bug in Express. It's a design pattern that demands discipline.

Fault tolerance requires observability at the fault boundary. Making a system fault-tolerant by swallowing errors is only half the job. The other half is ensuring every swallowed error is observable — logged, metered, and alertable. If you catch an error and don't log it at error level, you've traded a crash for a blind spot.

Test for the absence of expected signals, not just the presence of errors. Our monitoring was built to detect bad things happening. It wasn't built to detect good things not happening. The violation pipeline going silent looked like success. We now monitor for expected event rates and alert when they drop.

Health checks that don't exercise failure paths are theater. A health check that only validates the happy path will happily report 200 while your system is silently broken. Inject known failures. Validate that your system catches them. If the health check can't distinguish "working" from "broken but not crashing," it's not a health check.


What This Means for Your AI Systems

If you're running AI in production behind Express (or any middleware-based framework), audit your async error handling. Look for catch blocks that call next() without logging. Look for downstream code that assumes upstream middleware succeeded because it didn't throw.

These bugs don't show up in your error rate. They show up as missing data — violations you expected to catch but didn't, evaluations that silently failed, guardrails that quietly stopped guarding.

GuardLayer exists to catch exactly these kinds of silent failures. Our monitoring layer tracks evaluation coverage, violation baselines, and pipeline integrity — so a broken middleware function becomes a visible alert, not an 11-day blind spot.

Start monitoring your AI pipeline → Free trial. No DevOps required.