The problem
The client was running a 22-person offshore team to key invoices into 14 separate ERP installations across nine operating units. Each unit had its own chart of accounts, vendor master, tax rules, and approval matrix. The team was the integration layer.
It worked, until it didn't. Volume grew 40% YoY. Error rates climbed. SLA breaches generated late-payment fees that started eating into the savings the offshore team was supposed to be producing. Every "AI invoice OCR" vendor pilot stalled on the same problem: the extraction was fine, the routing was the actual job.
What we built
A workflow engine, not a chatbot. The model is one of nine steps in a deterministic pipeline.
inbound (email, EDI, portal)
→ dedupe + canonical hash
→ OCR (printed / scanned / digital-native)
→ field extraction (vendor, PO, lines, tax, totals)
→ policy resolver (per-OU, per-vendor)
→ 3-way match (PO ↔ receipt ↔ invoice)
→ routing decision (auto-post / hold / escalate)
→ ERP write (idempotent, retried)
→ audit + replay log
Built on Temporal for durability. Every step is a versioned activity, every retry is observable, every failed run is replayable from the exact input bytes. Nothing is fire-and-forget anywhere in the system.
Where the LLM does (and does not) make decisions
- Yes to field extraction from messy formats: handwritten remittance notes, multi-page tax tables, vendor PDFs that change layout every quarter. Plus natural-language vendor policy lookups ("If vendor matches XYZ, route to OU-finance when total > $25k").
- No to the actual routing decision. That's a deterministic resolver compiled from the per-OU policy file. The model proposes; the resolver decides. Refusing to let the LLM make routing calls is what got compliance to sign off.
Every extracted field has a confidence score and a bounding box. Below threshold, the row goes to a human queue with the original PDF rendered next to the extracted values. Review takes about 12 seconds median; we measured.
Idempotency, replay, and never posting twice
ERP double-posting was the thing that would have killed the project. Every write to every ERP carries:
- A workflow-run UUID (Temporal-issued)
- A canonical invoice hash (vendor + invoice number + total + date, normalised)
- A per-ERP idempotency key
The ERP adapters check the hash table before writing. If the same canonical invoice has been seen before, the second attempt is a no-op with an audit log entry. We hit this in production within the first week. A flaky upstream EDI feed re-sent a batch, and zero double-posts occurred.
What we measured before we shipped
Before launch we built a labelled set of 4,200 historic invoices covering every vendor format, OU, and edge case the offshore team had flagged in the prior 18 months. Every prompt change, every extractor tweak, every policy-resolver edit ran against it. Two metrics gated releases:
- Field-level F1 ≥ 0.985 across vendor, total, line items, tax
- Routing precision ≥ 0.99 (a wrong auto-route is worse than a hold)
Outcome
- 92% touchless processing at steady state. AP automation pilots typically land near 40%.
- Mean time-to-post dropped from 3.4 days to 11 minutes.
- $2.1M annual run-rate savings after offsetting platform and LLM costs.
- Zero double-posts in 14 months of production.
- Late-payment fees down 78%.
The system has been owned by the client's finance-engineering team since handoff. We pair with them quarterly when new ERPs or vendor formats come online. The rest is theirs.