LLM Testing in Production: Beyond Unit Tests

Your CI pipeline passes. All 847 tests green. You're ready to ship.

Except your LLM application starts returning subtly worse answers three weeks after deployment — and your test suite never noticed.

This is the fundamental gap in how we test AI systems. Traditional testing assumes determinism: give the same input, get the same output, compare against expected. LLMs break that assumption at every level. The same prompt can produce three valid answers. A prompt change can degrade output quality by 15% with no error thrown. And your test suite won't catch it.

Here's what's actually needed.

Why Traditional Testing Fails for LLMs

Unit tests work because code has a correct answer. Your function either returns 5 or it doesn't. The test passes or fails. Clean.

LLM outputs don't work that way. There are usually multiple valid responses, and the difference between good and bad isn't a boolean — it's a spectrum. A response that scored 7/10 last month might score 4/10 today, and your pipeline would still pass if you're only checking for schema correctness.

Three specific failure modes make LLM testing hard:

Stochastic outputs. The same input produces different outputs across calls. A unit test that asserts on exact string matching fails 95% of the time even when the model is behaving correctly.

Prompt sensitivity. Small changes to system prompts — a word removed, a constraint added — can meaningfully shift behavior across thousands of downstream responses. You won't catch this with snapshot tests.

Invisible regression. The model you use today might be a different version than last month. Outputs that were acceptable in March might be unacceptable in April — and if you don't have benchmarks running continuously, you have no way to know.

The path forward isn't to test less. It's to test differently.

The Three Layers of LLM Testing

Effective LLM evaluation covers three distinct layers, each answering a different question.

Layer 1: Functional Tests — Does the Output Have the Right Shape?

Functional tests verify that outputs conform to your expected structure: schema validation, type correctness, required fields present. These are the baseline. If an output doesn't pass functional checks, nothing else matters.

import Ajv from 'ajv';

const ajv = new Ajv();
const responseSchema = {
  type: 'object',
  properties: {
    answer: { type: 'string' },
    confidence: { type: 'number', minimum: 0, maximum: 1 },
    sources: { type: 'array', items: { type: 'string' } }
  },
  required: ['answer', 'confidence', 'sources'],
  additionalProperties: false
};

const validate = ajv.compile(responseSchema);

function functionalTest(output, prompt) {
  const valid = validate(output);

  if (!valid) {
    return {
      passed: false,
      layer: 'functional',
      error: validate.errors
    };
  }

  return { passed: true, layer: 'functional' };
}

Functional tests are fast, deterministic, and automatable. They catch obvious failures. They don't tell you if the answer is good.

Layer 2: Behavioral Tests — Does the Model Handle Edge Cases Correctly?

Behavioral tests evaluate whether the model behaves correctly across a range of inputs: edge cases, adversarial prompts, sensitive topics, style requirements. This is where prompt engineering and model selection decisions get validated.

Behavioral tests need scoring functions — deterministic logic that evaluates LLM outputs against defined criteria. These aren't LLM-judged tests (which introduce circular validation). They're rule-based evaluators built around your specific requirements.

function behavioralScorer(output, context) {
  const scores = {};

  // Does it refuse appropriately on sensitive topics?
  scores.refusalAccuracy = detectUnsafeRefusal(
    context.prompt,
    output
  );

  // Does it stay within character/format constraints?
  scores.formatCompliance = evaluateFormatConstraints(
    context.constraints,
    output
  );

  // Is it hallucinating specific numbers or facts?
  scores.factAccuracy = crossReferenceFacts(
    output,
    context.groundTruth
  );

  // Does it match the required tone/style?
  scores.styleMatch = evaluateStyleConstraints(
    context.styleGuide,
    output
  );

  return scores;
}

function detectUnsafeRefusal(prompt, output) {
  const sensitiveTopics = ['medical', 'financial', 'legal'];
  const isSensitive = sensitiveTopics.some(t =>
    prompt.toLowerCase().includes(t)
  );

  if (!isSensitive) return 1.0; // Not applicable

  const hasRefusal = output.toLowerCase().includes('i cannot') ||
                     output.toLowerCase().includes('i'm not able');

  // Is it a genuine refusal or a workaround response?
  return hasRefusal ? 0.0 : 1.0; // Pass if it answers, fail if it refuses
}

Behavioral tests require more setup — you need ground truth datasets, edge case corpora, and scoring functions. But they catch the failure modes that schema validation misses.

Layer 3: Regression Tests — Did a Change Degrade Existing Quality?

Regression tests are the most important and most neglected layer. When you update a prompt, swap a model, or change a system configuration, you need to know whether quality on your existing test suite changed.

This requires maintaining a golden dataset — a curated set of inputs with known-good outputs that you run against every change.

class RegressionTestSuite {
  constructor(goldenDataset) {
    this.dataset = goldenDataset;
    this.history = [];
  }

  async run(currentOutputs) {
    const results = this.dataset.map((item, idx) => ({
      prompt: item.prompt,
      expected: item.expectedOutput,
      actual: currentOutputs[idx],
      scores: this.scoreItem(item, currentOutputs[idx])
    }));

    return results;
  }

  compare(baselineResults, currentResults) {
    const deltas = baselineResults.map((baseline, idx) => {
      const current = currentResults[idx];
      return {
        prompt: baseline.prompt,
        baselineScore: baseline.scores.overall,
        currentScore: current.scores.overall,
        delta: current.scores.overall - baseline.scores.overall
      };
    });

    return {
      averageDelta: this.mean(deltas.map(d => d.delta)),
      regressions: deltas.filter(d => d.delta < -0.05),
      improvements: deltas.filter(d => d.delta > 0.05)
    };
  }

  mean(values) {
    return values.reduce((a, b) => a + b, 0) / values.length;
  }
}

The key discipline: never let a change ship if it causes regressions on the golden dataset. This sounds obvious. Most teams don't do it because building the dataset is work and running it takes time. Both are solvable problems.

Building an Evaluation Pipeline

The three layers combine into a pipeline that runs on every change. Here's the structure:

Trigger (PR / scheduled / manual)
  → Functional tests (fast gate, must pass)
  → Behavioral tests (medium, configurable threshold)
  → Regression suite (if behavioral passes)
  → Deploy if all green, block if any regression

In Node.js, the harness looks like this:

async function runEvaluationPipeline(prompt, output, config) {
  const results = {
    functional: functionalTest(output, prompt),
    behavioral: behavioralScorer(output, config.behavioralCriteria),
    timestamp: new Date().toISOString()
  };

  results.behavioral.scores.overall = average(
    Object.values(results.behavioral.scores)
  );

  // Determine pass/fail across all layers
  const passed = results.functional.passed &&
                 results.behavioral.scores.overall >= config.threshold;

  await logEvaluationResult(results);

  return { passed, results };
}

Set your thresholds based on your risk tolerance. A customer-facing application answering financial questions needs a higher bar than an internal summarization tool. Thresholds should be documented, not hidden in code.

Production Monitoring as Continuous Testing

Tests catch regressions before deploy. But the real world has inputs your test suite never imagined. Production monitoring fills that gap by treating live traffic as an ongoing test suite.

The approach:

Continuous benchmark sampling. Randomly sample a percentage of production calls and run them through your behavioral scoring pipeline. You don't need to score every call — 2-5% gives you a statistically meaningful signal.

Drift detection on scoring distributions. If your average behavioral score drops by more than 10% week-over-week, alert. If it drops by 20%, pause and investigate. Score distribution changes precede user-visible degradation.

Golden set refresh. Your golden dataset ages. Set a quarterly reminder to add new production examples to it — specifically cases where the model surprised you.

Regression alerting with GuardLayer. GuardLayer automatically runs behavioral scoring on your production traffic, tracks score distributions over time, and alerts when quality metrics degrade. You get continuous testing without building the infrastructure yourself.

Start With the Golden Dataset

The single most valuable thing you can do for LLM quality is build a golden dataset — 50-100 curated examples with known-good outputs. Not edge cases. Normal, representative inputs with outputs your team has reviewed and approved.

Once you have it, run your regression suite before every prompt change, every model update, every system configuration change. That one habit catches more regressions than any other practice.

The goal isn't perfect outputs. It's knowing when things get worse — before your users tell you.

For more on monitoring LLM quality in production, see How to Detect AI Model Drift Before Your Users Do and AI Guardrails in Production: A Practical Implementation Guide.