/ services / rag

RAG systems

Retrieval-augmented generation that retrieves the right thing. Hybrid search, reranking, and the discipline to refuse an answer when the evidence isn’t there.

When RAG fits, and when it doesn’t

RAG is the right answer when your knowledge changes faster than you can fine-tune, when you need provenance, or when domain terminology punishes general models. It’s the wrong answer when your "documents" are really structured data, or when the latency budget leaves no room for retrieval. We’ll tell you which one you have.

Heuristic. If the user’s question can be answered by a SQL query against your data, RAG is the wrong tool. Build the query path first.

Build the eval before the system. 100–300 hand-labeled (question, ideal-passage) pairs. Recall@10 is the contract.
Embeddings and generator are independent choices. OpenAI, Voyage, or open-weight for embeddings; Claude, GPT, or Gemini for grounded generation. The eval picks the pair, not the brand.
Hybrid beats dense-only. Almost always. Reranking adds another lift. Each one is an A/B against the eval, not a vibes call.
Caching is a product feature. Query rewrites, embeddings, and rerank scores all cached. p95 latency budgets are release blockers.
No answer without a citation. The UI is built so an analyst sees source passages inline; this forces retrieval quality to be visible on every query.

def ingest(doc: Document) -> None:
    canon = canonicalize(doc)            # merge restatements/amendments
    chunks = chunk(canon, target=512)
    vectors = embed(chunks, model="text-embedding-3-large")
    bm25.upsert(canon.id, chunks)
    pgvector.upsert(canon.id, vectors)
    audit.write(canon.provenance)

Have a retrieval problem?

If you’re shipping search, support, research, or document analysis on your own corpus, we should talk.

contact@zhironghuang.com See a financial-services case study →