2026-06-29 · 10 min read

Stop Buying Benchmarks: Build an Honest LLM Evaluation Rig

By Diogo Hudson Dias

Engineering lead reviewing a dashboard with latency histograms and cost charts while a teammate runs an evaluation on a laptop in a modern office.

Another day, another leaderboard declaring a new number one model. One post claims GLM beats Claude in their shop. Another dev shows their resume swinging from 90 to 74 to 88 when run through an open-source ATS. These aren’t contradictions; they’re a reminder: without a workload-true evaluation rig, you’re buying marketing. As a CTO, you need a benchmark that answers one question only: which model wins for your exact mix of tasks, latencies, and costs?

What vendor leaderboards miss (and why you pay for it)

Public leaderboards optimize for spectacle, not your SLOs. They rarely capture:

Distribution mismatch: Your prompts aren’t 3-sentence trivia. You run 300–2000 token contexts, mixed tool-calls, and strict JSON returns. Leaderboards average over tasks that don’t resemble yours.
Safety and refusal rates: The fastest model is worthless if it refuses 7% of your customer support prompts or silently drops required fields.
Tail latency under load: p95 and p99 explode first. Your customers feel the tail, not the mean.
Evaluator fragility: The “judge” model flips verdicts with tiny prompt changes (see the whiplash ATS scoring anecdotes). If your scorer isn’t robust, your rankings aren’t either.
Operational realities: Rate limits, transient 429/5xx spikes, API routing jitter, and SDK bugs (remember the recent bug in a popular HTTP library) skew results unless you build guardrails.

Stop renting other people’s assumptions. Build a small, honest, repeatable benchmark that reflects your world.

The five things you must measure (in this order)

Task success rate (quality): Exact match for structured outputs; programmatic acceptance for code or extraction; calibrated pairwise judging for open-ended answers.
p95 latency at target RPS: Under your concurrency and region. Warm-ups lie; measure steady-state.
Cost per resolved item: Tokens-in + tokens-out + retries, divided by successful completions. This is the number finance remembers.
Failure/refusal rate: JSON schema breaks, tool-call misses, policy refusals, and network/API errors.
Variance/drift: Run-replay deltas and weekly drift. Don’t crown a champion that’s great only on Tuesdays.

Your minimal viable LLM benchmark (MVB)

1) Curate a dataset that matches production

Size: 1,000–2,000 items is enough to separate contenders without bankrupting you.
Mix: Mirror your top 3–5 use cases. Example: 40% classification/extraction (JSON), 40% agent/tool-calls, 20% long-form reasoning. Include length bins: short (≤128 tokens), medium (129–512), long (513–2048+).
Languages/regions: If you serve the Americas, include English, Spanish, and Brazilian Portuguese. Even frontier models can drop 5–15% on non-English tasks.
Avoid contamination: No public benchmarks the models likely trained on. Use internal tickets, synthetic-but-validated variants, or time-sliced data captured after the model’s known cutoff.

2) Build a reliable scorer

Structured tasks: Validate JSON with a strict schema; use regexes for IDs; run code tests for codegen. No human in the loop needed.
Open-ended tasks: Use pairwise ELO with three independent judge models and tie-breakers. Never let a candidate judge itself. Fix the judging prompt and record its hash. Re-run 10% of items each week to detect judge drift.
Human calibration: Sample 100 items and have two domain experts grade them blindly. Compute inter-annotator agreement (Cohen’s kappa). Your judge should agree ≥0.6 with humans or you’re grading noise.

3) Instrument for latency and load

Concurrency sweeps: Test at 1, 4, 16, and 64 concurrent requests. Pin client instances per model to avoid cross-model interference.
Warm-up, but don’t cheat: Send a 30–60s warm-up burst, then record for at least 10 minutes at steady RPS. Customers don’t live in your warm-up window.
Measure tail and stability: Record p50/p95/p99, timeout %, and retry counts. Tail tells you where pagers come from.

4) Make it reproducible

Hermetic env: Containerize the harness. Pin dependencies. If you’re on Python, lock with uv; if on Node, lock with a reproducible installer. Don’t ship “latest”.
Idempotent requests: Hash(model, prompt template, inputs, temperature, tools) to create a cache key. Cache raw responses with headers. On reruns, detect cache hits but still re-measure latency with a “no-op” branch so you can replay quality cheaply and latency honestly.
Observability: Emit OpenTelemetry traces with attributes for model, template, token counts, retry attempts, and error class. Keep traces for at least 30 days.
HTTP hardening: Use exponential backoff, jitter, and circuit breakers. Implement a dead-letter queue for pathological items so one bad prompt doesn’t stall the run. And yes—bugs in popular HTTP clients happen; pin versions and monitor for regressions.

5) Control non-determinism

Sampling settings: Fix temperature, top_p, and presence penalties. If the API supports a seed, set it; if not, use n=3 samples and take majority for classification or best-of by judge for generation.
Template discipline: Freeze prompt templates and tool schemas. Record their SHA-256 in results. A two-word tweak can move your win-rate by 3–7%—log it.

Run plan: how to compare 4–6 models in a week

Day 1–2: Finalize dataset and scorer; lock templates and schemas. Dry run with a cheap baseline model to validate metrics and cost.
Day 3: Concurrency calibration. For each model, sweep concurrency to find the highest RPS that holds p95 under your SLO (e.g., 700 ms for classification, 2–6 s for long-form). Record the knee-of-the-curve where timeouts appear.
Day 4–5: Full evaluation. 1,500 items × 4 models × 3 samples = 18,000 calls. Stagger models in a Latin-square schedule to avoid time-of-day effects. Use separate API keys and regions per model where possible.
Day 6: Analysis and ablations. Re-run the top two models on 300 edge cases and with a second judge model to confirm ranking stability.

Cost math that won’t surprise finance

Don’t guess; budget with a simple formula:

Total tokens = items × samples × (avg prompt tokens + avg completion tokens)
Gross cost = (input tokens × input $/M) + (output tokens × output $/M)
Effective cost per resolved item = gross cost ÷ successful items

Example: 1,500 items, 3 samples, 800-token prompts, 300-token outputs ⇒ 1,500 × 3 × (800 + 300) = 4.95M tokens. At $2/M input and $8/M output (illustrative), split 800:300 ≈ 72%:28% ⇒ cost ≈ (3.56M × $2) + (1.39M × $8) ≈ $7.1K + $11.1K ≈ $18.2K for one model. If that stings, reduce samples to n=2 and items to 1,000 for the first pass, then expand for finalists.

The point isn’t the absolute number; it’s discipline. Know your tokens, know your exposure, and never run blind.

Reading results: how to pick winners you can ship

Segment by task, not vibes

Strict JSON tasks: Prefer models with the lowest schema break rate even if they’re 10–15% slower. Broken JSON cascades into retries and agent stalls.
Tool-heavy agents: Measure tool-call accuracy and chain step count. A model that takes 1.2 fewer steps on average often beats a faster-tokens counterpart.
Reasoning/long context: Rank by judge win-rate at fixed latency and by cost per passing answer. If a smaller model wins 92% quality at 40% of the cost, it’s the production pick today, with an upgrade path later.

Cut lines you can live with

Latency SLOs: p95 ≤ 700 ms for classification, ≤ 2.5 s for extraction/tool-turns, ≤ 6 s for long-form. Tune to your UX.
Stability: Weekly drift in win-rate ≤ 3% and schema break deltas ≤ 1%. If a vendor’s p95 drifts 15% week-to-week, dual-source or drop.
Cost ceiling: Set a max $ per 1,000 items per task type. Back-solve to token budgets per call so product doesn’t panic later.

Pitfalls that invalidate 60% of internal benchmarks

Judging with the candidate model: It flatters itself. Always isolate the judge; rotate it quarterly.
One-shot runs: Non-determinism plus transient API variance will reorder your leaderboard. Replay 20% of items and report confidence intervals.
Warm-up illusions: Measuring only right after cache warm ups yields fantasy p95s. Hold steady load for minutes, not seconds.
Template churn: If product edits the prompt mid-run, you’re measuring the template, not the model.
Hidden pre-token burn: Agent runtimes and MCP-style setups can spend tens of thousands of tokens before the first user token is processed. Instrument preambles and system prompts explicitly, or you’ll undercount by 10–30%.

Governance: logs, privacy, and repeatability

Evidence-grade logs: Keep request/response payload digests, token counts, and judge decisions for 90 days. You’ll need them when product asks why you switched models—or when legal asks what the AI actually did.
PII hygiene: If you use production data, scrub or synthesize. Avoid leaking customer secrets to third-party APIs. We’ve published elsewhere on building ephemeral AI patterns; apply them here too.
Release artifact: Treat the winner as a versioned package: model name, API region, prompt template hash, tool schema version, and sampling settings. This is your rollback unit.

A pragmatic toolchain

Harness: Python + uv or Go for determinism. Avoid sprawling notebooks.
Data/metrics: Parquet for results, DuckDB or SQLite for analysis. Keep it queryable on a laptop.
Judging: Use a single judge prompt stored in Git. Optionally add a second judge for finalists.
Orchestration: A simple work queue (Celery, Sidekiq, or a lightweight Go worker) beats overengineered pipelines. Add a dead-letter topic.
Observability: OpenTelemetry spans with attributes: model, template_hash, tokens_in, tokens_out, retries, status, latency_ms.

When (and how) to distill or fine-tune after you benchmark

Once your rig shows a stable winner for a task, consider a cost-down path:

Prompt-only optimization: Usually worth 5–15% in quality and 10–20% fewer tokens if you tighten instructions, exemplars, and schema hints. Re-benchmark to confirm.
Fine-tune small models: If your data is proprietary and narrow, a 7–14B parameter model fine-tuned on a few thousand labeled items can match 80–90% of a frontier model at a fraction of the cost and latency. Your rig will tell you if it’s close enough.
Knowledge distillation: Use your best model as a teacher to label data for a smaller student. Keep an eye on legal and policy boundaries; benchmark the student with the same rig so you don’t ship regressions.

Why this matters for nearshore and hybrid teams

If you run distributed pods (onshore + nearshore), your benchmark becomes the lingua franca. It encodes decisions in code and numbers, not in Slack threads. A Brazil-based pod can run the same harness overnight, report deltas in the morning, and you agree on a promotion or rollback with evidence. Expect 6–8 hours of daily overlap for discussion, and asynchronous runs at scale when you sleep.

What good looks like in practice

Weekly 30-minute review: Topline from last run: win-rate by task, p95 by RPS tier, cost per resolved item, drift since last week, notable regressions.
Promotion criteria: A model graduates when it outperforms current prod by ≥5% on success rate, meets p95 SLOs, and cuts cost per resolved item by ≥10%—for two consecutive runs.
Rollback discipline: If drift exceeds 3% or p95 blows past SLO for 2 hours, automatically fail back to prior winner. Your artifact versioning makes this trivial.

The unpopular truth

The model “best” on someone else’s chart may be your third-place finisher. That’s fine. Your customers don’t use benchmarks; they use your product. Build the rig once, and you’ll stop arguing about vibes, stop overpaying for tokens you don’t need, and start shipping upgrades with confidence.

Key Takeaways

Public leaderboards rarely reflect your prompts, latencies, or costs—build a workload-true benchmark.
Measure five things: success rate, p95 at target RPS, cost per resolved item, failure/refusal rate, and drift.
Use strict schemas for structured tasks and pairwise multi-judge ELO for open-ended ones; calibrate with humans.
Run concurrency sweeps, steady-state loads, and replay 20% of items to control non-determinism.
Version everything: model, region, prompt hash, tool schema, and sampling settings—this is your rollback unit.
Use the rig to justify prompt ops, small-model fine-tunes, or distillation when they beat frontier models on cost per resolved item.