Recruiting pipeline: 6× recruiter throughput, without the spam

Inbound resume parsing, structured screening, and outreach drafting. With hard guardrails against the auto-spam failure mode that has tarred the category.

The problem

Recruiters at the client's enterprise customers were doing two jobs: deciding who to talk to, and writing 40+ outreach messages a day to the people they'd decided to talk to. The "decide" half was where their judgement lived. The "write" half was a mechanical tax that pushed median outreach quality to a generic floor. The generic floor is what makes candidates unsubscribe.

The category is full of "AI sourcing" tools that automate the spam. We were brought in because the platform team explicitly didn't want to ship one of those.

What we built

A pipeline with three stages, each with its own model, evaluation set, and refusal path:

inbound resume
  → structured extraction (skills, roles, dates, level, signal)
  → req-matching (scored, explained, ranked)
  → recruiter review (always — humans decide)
  → outreach drafting (only on approved candidates)
  → recruiter sends (always — humans send)

The system never sends a message. It never even drafts a message until a human has reviewed and approved a candidate. Every stage outputs a structured record; the recruiter UI presents the records, the human moves them forward.

What made it work

Structured extraction over free-form parsing. The model fills a Pydantic schema (ParsedResume) with required fields, optional fields, and an unknowns: list[str] bucket for things it couldn't categorise. Anything in unknowns is a hint to the recruiter, not a guess presented as fact.
Match scoring with citations. Every match score comes with three to five "evidence anchors": concrete spans from the resume that drove the score. Recruiters can sanity-check a 92 in five seconds.
Outreach drafts that quote the resume. Drafts cite a specific line from the resume in the first sentence. We measured that this single property, provable specificity, is what moved reply rate from 11% to 13%. Not any of the prompt-engineering knobs we tried first.
Hard refusal on protected attributes. Name, gender markers, photos, age, and graduation years are stripped before scoring. Re-attached only at the human-review step, never to the model used for scoring.

What we refused to ship

Auto-send. Every team that asked for it has been told no. The category's reputation is bad enough already.
Score-and-discard. All resumes are surfaced to recruiters. The score sorts the queue, it doesn't filter it. A recruiter who wants to see rank #800 can.
Multi-step "agent" workflows. Stages are explicit, named, observable, bounded. Nothing loops. No agent decides to "do more research." If a stage retries, it's a transient error, not a model decision.

How we measured it

The match-scoring eval was the hard one. We hand-built a set of 1,400 (resume, req, ideal-outcome) tuples with two senior recruiters, scored on a 4-point ordinal (pass, maybe, interview, top-of-pile). The model has to land within 1 ordinal step on the held-out set, and strictly above the prior baseline on the recruiters' own sample. Every model swap is a re-run.

For outreach drafts, the eval is offline plus online. An LLM-judge against a rubric (specificity, tone, length, no fabrications) for offline scoring, plus a live A/B on reply rate gated to single-digit percent of traffic.

Outcome

6.1× recruiter throughput on the combined screen + outreach motion
+18% reply rate on outreach versus the prior templated baseline
Zero policy violations flagged by the client's compliance team across nine months
0.0% auto-send rate (by design; the system has no such code path)
$1.4M ARR uplift for the client from a tier-up of the affected SKU

We didn't build a sourcing tool. We built something that lets a recruiter do six times the work without the part of the job they care about — choosing who to talk to — getting taken away from them. The platform team owns it now; we still meet once a quarter to look at the eval scores together.