The problem
The client had been running a homegrown chatbot for two years. Tickets weren't going down. CSAT went down in cohorts that interacted with it. Their team was stuck between "ship a real LLM" and "stop hallucinating about our pricing."
The ask was narrower than it looked: deflect L1 tickets (password resets, billing FAQ, integration setup) without ever inventing policy.
What we built
A retrieval-grounded copilot with a hard constraint: no answer without a cited source from approved docs.
The architecture leans heavily on the model for understanding, and not at all for knowledge:
- Query understanding — classify intent, decide whether to retrieve, decide whether to escalate. Small, fast, and tunable.
- Retrieval — semantic + keyword hybrid over the help-center corpus, with a freshness prior so newly published docs are surfaced fast.
- Constrained generation — the model is given retrieved passages and instructed to answer only from them, with citation IDs inline. A post-processor strips any paragraph that lacks a verifiable citation.
- Honest fallback — when no passage clears a confidence threshold, the bot says so plainly and offers to open a ticket with context pre-filled.
Safety, not "guardrails"
We refused to ship "guardrails" as a layer of regex. Safety was enforced structurally:
- The model never sees pricing or billing data. Those answers come from a deterministic FAQ retrieval path with hardcoded responses.
- Account-specific questions (entitlements, current plan) are handled by a tool call to the client's API. The LLM doesn't answer them; it routes.
- Every response is logged with its retrieval set for auditability. We can replay any conversation.
Eval discipline
Before launch we built an eval suite of 1,800 real prior tickets, hand-labeled with ideal outcomes (deflect, escalate-with-context, escalate-cold). Every prompt change, every retriever tweak, every model swap was scored against it. Releases were gated on no regression in escalation precision.
Outcome
- 38% L1 deflection in the first three months, sustained over nine.
- +0.4 CSAT in cohorts that interacted with the copilot vs. the prior chatbot baseline.
- Zero compliance incidents — no invented policy, no leaked data, no fabricated pricing.
- p95 1.5s end-to-end including retrieval and generation.
The copilot is now extended by the client's product team. We rotated off after handoff and a 30-day stabilization window.