Generative AI applications
Copilots, agents, and structured generation that ship to real customers, on Claude, OpenAI, or Gemini. Safety properties enforced by the architecture, not by regex layered on top after the fact.
In-product copilots
Help-center bots, in-app assistants, and structured workflows that route between LLM and deterministic code based on intent and confidence.
Tool-using agents
Bounded-scope agents with explicit tool inventories, retry semantics, and budget enforcement. No infinite loops, no surprise spend.
Structured generation
JSON/Pydantic-grounded outputs with schema validation and refusal paths, using each provider’s native structured-output mode (Claude tool use, OpenAI structured outputs, Gemini function calling). The model fills slots; the type system catches mistakes.
Safety properties we enforce structurally
Every "guardrail" you can name is an admission that the architecture lets unsafe states happen at all. We prefer to make the unsafe states unrepresentable.
- The model never sees secrets. Account, billing, and entitlement queries are tool calls; the LLM routes, it doesn’t answer.
- Pricing answers are deterministic. No paragraph of free-form generation about prices, periods, or refunds. Pulled from approved copy.
- No answer without a verifiable citation. Post-processing strips paragraphs that don’t cite an approved source.
- Honest fallback. Below the confidence threshold, the bot says so plainly and offers to escalate with conversation context pre-filled.
cases = load_dataset("real_tickets_v3.jsonl") # 1,800 hand-labeled
for case in cases:
response = pipeline(case.input)
score = grade(
response,
truth=case.expected,
checks=[deflect, escalate_with_context, escalate_cold],
)
metrics.observe(score)
release_gate(precision >= 0.95, recall >= 0.85)
- Build the evaluation set from real conversations. Not synthetic. Not augmented. The actual tickets.
- Every prompt change runs against it. No exceptions for "small tweaks."
- Releases gate on no regression in escalation precision. A bot that gets confidently wrong is worse than one that escalates honestly.
- Replay log. Every conversation is replayable from its retrieval set, for audits and incident review.
Building something users will rely on?
If your roadmap has the words "copilot," "assistant," or "agent" in it, and the words "production" and "compliance" anywhere near them, we should talk.