Sub-second semantic search across 12M financial filings

RAG over a decade of regulatory disclosures, serving an internal research team. Cut analyst time-to-answer from minutes to seconds with measurable retrieval quality.

The problem

Analysts were spending the first 30–45 minutes of every research task locating prior filings, transcripts, and analyst notes that touched on the same topic. Their existing keyword search was strict, brittle, and dependent on knowing the right ticker / cycle / filing type up front.

They wanted grounded answers — not summaries, not chat. Find me the five most relevant passages, with citations, ranked by usefulness for the question, in under a second.

What we built

A retrieval system, not an LLM application. The model is a small piece at the end of a longer pipeline:

question
  → query rewrite (HyDE-style, 1 LLM call, cached)
  → hybrid retrieval (BM25 + dense, RRF fusion)
  → cross-encoder rerank (top-50 → top-8)
  → grounded answer (cited, no synthesis without citation)

The bulk of the engineering went into the index, not the prompt: a tiered embedding strategy that re-embeds high-value documents weekly with a stronger model, and a dedup/canonicalization pass that collapses near-identical filings (10-K amendments, restatements) into a single retrievable canonical document with provenance pointers.

What made it work

Eval-first. Before writing a line of retrieval code, we hand-labeled 200 (question, ideal-passage) pairs from real analyst tickets. Every change ran against the set. Recall@10 was the only metric that mattered, and it kept telling us when "obvious wins" actually weren't.
Hybrid beat dense-only, by 9 points of recall@10. Reranking added 6 more. Each component had to justify itself with an eval run, not vibes.
Caching at every layer. Query rewrites, embeddings, rerank scores. p95 latency was a product feature; cache hit rate was a release blocker.
No answer without a citation. The UI doesn't render an answer without inline citation chips that link back to the source passage. That forces retrieval quality to be visible to analysts on every query.

Outcome

Six weeks from kickoff to internal beta. Twelve weeks to general rollout. Six months in:

84% reduction in median analyst time-to-answer for "find prior coverage of X"
41% absolute lift in citation coverage on a held-out eval (from 0.43 to 0.84 recall@10)
p95 latency 720ms end-to-end at peak load
Costs landed at roughly $0.0008 per query at the steady state

The system has been in production for over a year, owned and extended by the client's data team.