The problem
The client's product had become an LLM-routing company by accident. Three years of "ship the feature, worry about cost later" had produced a fleet of microservices, each owning its own provider key, its own retry logic, and its own opinions about which model to use for what. Spend was up and to the right. Per-tenant cost was a guess. The CFO asked a reasonable question — how much does customer X cost us? — and nobody could answer it.
What we built
A single inference platform that every product team calls instead of the model providers directly. The platform owns:
- Routing. A request comes in tagged with
route+tenant+priority. The platform decides between self-hosted (vLLM/Triton) and provider APIs based on a per-route policy file. Policies live in a Postgres table; changes take effect in seconds. - Continuous batching. Self-hosted endpoints run vLLM with continuous batching. Same hardware as before, 3.4× throughput measured on the customer's actual traffic mix.
- Semantic and exact caching. Two layers. An exact-match cache for completion-level idempotency (re-runs, retries). A semantic cache for paraphrase hits on a curated allow-list of cacheable routes. Hit-rate dashboards are a release blocker.
- KV-cache reuse. For multi-turn or long-context routes, the platform pins the prefix on a sticky shard. Tail latency on long-context routes dropped 41% from this one change.
- Cost attribution. Every request emits a structured record to ClickHouse with
tenant,route,model,tokens_in,tokens_out,cache_hit,cents. The CFO's dashboard is one SQL view away.
The router policy that earned its keep
The cheap part of inference cost is not the model's per-token price. It's the requests you don't make and the tokens you don't generate. The router enforces, per-route:
- Token budgets.
max_tokensis not a suggestion; the router caps it. - Model fallbacks. If the primary returns a rate-limit, the router tries a configured fallback chain. No retries against the same overloaded endpoint.
- Cache eligibility. Routes opt in to caching by name. Default is off; we'd rather over-spend than serve a stale answer to a feature where freshness matters.
- Frontier-only escalation. A small "escalate to frontier" annotation in the request body lets a route opt into the expensive model only for the specific cases that need it.
Observability the on-call engineer can use at 3 a.m.
Three dashboards, each linked from the alert payload:
- Health. Per-route p50/p95/p99 latency, error rate, cache hit rate, queue depth.
- Cost. Spend by tenant, route, model, hour. Anomaly bands on each.
- Quality. Sampled-output evals against the tenant's eval suite, scored hourly. Drift alerts gate a paging policy.
Every alert that fires has a runbook entry. Every runbook entry has been replayed in a chaos drill.
How we got the migration done
We didn't ship the platform and walk away. We pair-programmed the migration with each owning team, one route at a time, starting with the cheapest and noisiest. By the third route, the platform team had a stack of patterns they reused without us.
The platform also has a strict deprecation rule: a direct provider API key is a sev-3 finding in the SOC checklist. Forcing teams off direct keys was the unfun part of the engagement; doing it route by route with measurable wins kept it survivable.
Outcome
- 71% inference cost per request at the steady state
- 4.1× traffic absorbed on the same hardware footprint (no GPU expansion through year-end)
- Cost-per-tenant dashboard live in 6 weeks; quarterly board pack now uses it
- p99 latency dropped from 4.2s to 2.1s across the migrated routes
- Zero direct provider keys in production after the migration completed
A four-person team at the client owns the platform now. We rotated off after a 60-day stabilisation window and a chaos-drill handoff. The CFO got their first answer to "how much does customer X cost us?" in week seven.