/ about

An engineering practice shipping production AI on Claude, OpenAI, and Gemini.

Two years operating. A short list of clients. Every system we ship is built on the major model providers — Claude, OpenAI, or Gemini — chosen per engagement based on what the data, latency, and compliance constraints actually demand.

The shape of the work

We started in 2024 because too many AI projects we were watching from the inside were stuck. Working demos, no production. The gap between "the model can do it" and "the system does it reliably at scale" turned out to be a real engineering discipline, and not a hot topic in any of the typical "AI strategy" conversations. That gap is what we work in.

Our clients are usually engineering or product leaders who’ve already tried something (a prototype, a vendor pilot, an internal hack) and want to make it real. Sometimes we're brought in to start something new. More often we're brought in to ship something that’s been almost-shipping for too long.

What we’re not. Not a strategy firm. Not a staff-aug shop. Not a "fractional CTO." We’re a small engineering practice that writes production code on the major model providers, sets up evaluations, runs incidents, and hands over working software.

Multi-provider, by design

No single model wins every benchmark, and no single provider wins every constraint. We work fluently across Claude, OpenAI, and Gemini so that the choice is driven by your problem — long context, tool use, multimodal input, latency budget, deployment region, data-residency — not by which API we happen to know.

In practice that often means a routed mix: a smaller model on the hot path, a larger one on the escalation tier, embeddings from whichever provider scores best on your eval set. The architecture stays provider-agnostic; the choices stay defensible.

Claude — long context, tool use, structured generation. Default where output quality has to clear an eval gate.
OpenAI — GPT models, embeddings, fine-tuning. Default for the hot path on most engagements.
Gemini — million-token windows, multimodal inputs, Vertex deployment. The right call for GCP-native workloads or extreme context.
Routed where it pays. A small model on the hot path, a frontier model on escalation, the right embedder for your eval set — under one provider abstraction your team can read.

Evaluations come first. If you can’t measure it, you’re not engineering it. Every project starts by labeling a real evaluation set.
Small team, deep ownership. One or two engineers per engagement, end to end. No layers. No "delivery managers." If something is broken, the person who can fix it is on the call.
Frontier models are an escalation tier. Most queries can be answered by something cheaper. Reach for the biggest hammer when nothing else works, not as a default.
Citations are required. A customer-facing answer without provenance is, by policy, a bug. This forces retrieval quality to stay visible.
Handoff is the deliverable. The team you hand off to is the actual customer. Documentation, runbooks, dashboards, and on-call materials are first-class artifacts.

Want to talk?

We answer within one business day. If we’re not the right fit for your problem, we’ll tell you and try to point you somewhere that is.

contact@zhironghuang.com +1 (561) 501-1280