Vol. I · No. 1 — Engineering Review 10-Day Take-Home · Submission

RAG.TheCustomer Service Issue

A field study in hybrid retrieval — policy PDFs alongside a live transactional database, blended into a single grounded answer with mandatory source attribution.
Author
Kadir Şahin
Filed
2026 · 04 · 27
Audience
Engineering review
Submission
The BriefTwo sources, one answer
02
— On the problem

Two heterogeneous
sources, blended.

A customer-service chatbot that classifies intent, retrieves from the right source(s), and synthesises a single grounded answer with citations. Three intent classes cover everything; one of them — hybrid — is the hard one. It requires reading policy from PDFs and data from the live SQLite DB in the same request, then weaving them together cleanly.

"My withdrawal is still pending after 5 days. What should I do? I'm user mariaanderson711." The hybrid case · Policy SLA + Maria's pending row
Non-negotiables
  1. Source attribution

    Every claim carries a marker — [SQL] or [doc.pdf p9].

  2. NL → SQL

    No hardcoded query templates.

  3. No black-box

    No end-to-end pipeline I don't own.

  4. One command

    make run — three interfaces, one entrypoint.

The ArchitectureHexagonal-lite
03
— System diagram

Composition root, three
interfaces, one use case.

The wire-up lives in src/wire.py; FastAPI, CLI, and Gradio each call the same RAGService. The router is the brain — everything around it is replaceable.

Client
api · cli · gradio
FastAPI
/ask · SSE
RAG Service
use case
TieredRouter
3 tiers
PDFRetriever
Chroma · BGE-small
Synthesizer
LLM · cited
DBRetriever
NL→SQL · SQLite
Answer
+ citations
Aux · NER
EntityExtractor (GLiNER)

Feeds usernames, dates, and amounts into the router; resolves PERSON via DB validation.

Aux · Fallback
LLM Tier-3 backstop

DeepSeek-R1, structured JSON. Activates only when embedding signal is uncertain (~11% of queries).

The PipelineThree signals, one decision
04
— On routing as evidence

Routing isn't classification.
It's an evidence stack.

  1. Entity-first augmentation

    GLiNER NER + DB validation. If a username resolves to a real members row, db is added unconditionally — entities are stronger evidence than vector similarity.

  2. Contrastive centroid embedding

    Per category, score the query against a positive and a negative anchor centroid. Multi-label: multiple categories may pass simultaneously.

  3. LLM Tier-3 fallback

    DeepSeek-R1, structured JSON. Activates only when embedding returns empty intents AND off_topic < 0.75. ~2s p50 — a backstop, not a default.

Contrastive scoring · per category, per query
# two centroids per category, computed once
score = cos(q, pos_centroid)
       α · cos(q, neg_centroid)

α = 0.05   # empirically calibrated

if score >= threshold[cat]:
    intents.add(cat)
Multi-label means a query can fire pdf and db simultaneously. The whole point
Footnote · NER evolution spaCy custom → GLiNER zero-shot We initially trained a custom spaCy NER on 88 hand-labelled queries (F1 = 0.824). Replaced with urchade/gliner_medium-v2.1 for three reasons →
(1) Zero-shotAdding a new entity label is a config change, not a re-train.
(2) TurkishspaCy en_core_web_sm had low TR recall — usernames are mixed TR/EN.
(3) Killed regexSemantic extraction let us drop regex heuristics from the entity layer entirely.
The TraceOne query, fully observed
05
— The central question

How is a routing decision made? A complete trace.

"My withdrawal is still pending after 5 days. What should I do? I'm user mariaanderson711." Query under trace
i.Entity extractionGLiNER · ~100ms
usernames = ["mariaanderson711"]
resolved ✓ via SELECT 1 FROM members
had_unresolved_person = False
ii.Contrastive scoringembedding · ~10ms
pdf_policy: 0.776 − 0.05·0.744 = 0.702
user_data : 0.822 − 0.05·0.746 = 0.747
off_topic  : 0.522
iii.Threshold checkmulti-label
pdf 0.702 ≥ 0.70 add "pdf"
db  0.747 ≥ 0.705 add "db"
off 0.522 < 0.72 skip
iv.Augmentation guardno-op
resolved username + "db" already in intents
no-op ✓
v.Dominance suppressionmargin = 0.07
|pdf − db| = 0.045 < 0.07
both kept ✓
vi.Off-topic precedenceguard
off_topic 0.522 < 0.72
no override ✓
vii.Confidence bandingmax(scores) = 0.747 ∈ [0.70, 0.80]
→ confidence = "medium"
Verdict — RouteDecision intents = ["pdf", "db"], tier = "embedding-medium-hybrid", confidence = "medium", scores = {pdf: 0.702, db: 0.747, off_topic: 0.522}
If embedding empty & off_topic < 0.75 → escalate to LLM Tier-3 (DeepSeek-R1, ~2s p50, 11% of queries) 8–15ms · steps i–vii · no LLM call
The MechanismContrastive centroid
06
— On negative anchors

Two centroids per category.
Subtract one from the other.

At init · per category
positive_centroid[cat] = mean(embed(p)
    for p in anchors[cat]["positive"])
negative_centroid[cat] = mean(embed(n)
    for n in anchors[cat]["negative"])
→ two 384-dim points per category in BGE space.
At inference · per query
pos_sim = cos(q, pos_centroid)
neg_sim = cos(q, neg_centroid)
score   = pos_sim  α · neg_sim   # α = 0.05
Geometric intuition
  • i.
    True positive — close to pos, far from neg → small subtraction → still above threshold.
  • ii.
    False positive — close to pos AND neg → large subtraction → falls below.
  • iii.
    Hybrid — close to BOTH category positives, far from BOTH negatives → both fire.
What it produces · pdf_policy threshold = 0.70
QueryposnegfinalDecision
"What is withdrawal time?"0.7940.7190.758fire pdf ✓
"Show me my last 5 bets"0.7300.8470.688reject ✓
"Withdrawal pending… Maria"0.7760.7440.702fire pdf ✓
Why α = 0.05 specifically
αTrue PDFWrong DBHybridVerdict
0.00.794 ✓0.730 ✗0.776 ✓false fire
0.050.758 ✓0.688 ✓0.702 ✓Adopted
0.100.722 ✓0.645 ✓0.665 ✗hybrid lost
0.150.686 ✗0.603 ✓0.628 ✗collapses

α = 0.05 — the sweet spot. Kills false positives without killing hybrid.

The Sweepα calibration
07
— Empirical operating point

α = 0.05 is the only safe choice.

Sweeping α from 0.05 to 0.15 across the 8-query benchmark. Past 0.07 we start over-suppressing legitimate matches; past 0.10 we lose benchmarks outright.

αBench passesLLM-fallbackFP pdfFN dbVerdict
0.058 / 829 %00Adopted
0.078 / 838 %01marginal
0.105 / 852 %03breaks 3
0.123 / 863 %04worse
0.152 / 879 %05collapses
What α controls
final = pos − α · neg

Low α — gentle penalty. Disambiguates only the genuine collisions; legitimate matches survive.

High α — over-suppression. Real PDF queries dip below threshold; recall falls; LLM fallback rises.

α = 0.05 has measurable margin on every benchmark — no query within 0.01 of its threshold. Operating headroom
The DefenseThree independent guards
08
— On privacy in depth

"Show me his bets" gets
rejected three different ways.

i.

LLM CLARIFY_NO_USER sentinel

When no resolvable user is in scope, the LLM emits the CLARIFY_NO_USER token. Catches natural-language attacks ("show me his bets", "anyone's withdrawals").

NL guard
ii.

GLiNER unresolved PERSON

NER picks up "his", "her", "John" — if no resolution to a real members row, the query is rejected before SQL generation.

NER guard
iii.

Structural SQL check

Structural regex check on the generated SQL: any SELECT … FROM members without an identity filter on username/member_id is rejected. Backstop against jailbreaks.

AST guard
Attack walkthrough
"show me his bets"attacker · no auth context
L1 → LLM sentinel → emits CLARIFY_NO_USER. REJECT
L2 → GLiNER → "his" extracted, unresolvable. REJECT
L3 → SQL AST → no identity filter. REJECT
⚠ Known cost

L3 fires after SQL generation — we pay LLM cost on attack queries. Pre-generation short-circuit is a roadmap item.

The NumbersEmpirical evaluation
09
— Routing precision · recall · F1

53 realistic queries.
Zero wrong routes.

22/25
Realistic CS · perfect
6/8
Benchmark · perfect
0
Wrong routes · all groups
Per-intent routing metrics · n = 53
IntentPrecisionRecallF1
pdf0.8750.9410.91
db0.9431.0000.97
off_topic1.0000.8570.92
What the failures actually are
  1. B4 · Bonus T&C

    Section absent from PDF index. Routed correctly; retrieval found nothing.

  2. B5 · KYC SLA tables

    Missing from manual. Same pattern as B4.

Misses are under-coverage of the corpus — not mis-routing.

The ProfileLatency & cost
10
— Where time actually goes

Embedding path serves
89% in < 50ms.

Pre-contrastive, the LLM fallback hit on 52% of queries. Post-contrastive: 11%. That's the major UX win — most users never see the 2-second tail.

Embedding-only
89% of queries
p50 16ms · p95 47ms
LLM fallback
11% of queries
p50 2.1s · p95 2.2s
Cold start
once
~10s
Fallback rate · before vs. after contrastive
52%
Pre-contrastive
11%
Post-contrastive
Four in five users who used to wait, now don't.Why this matters

Each LLM fallback costs ~2 seconds of perceived latency. Cutting fallback from 52% to 11% removes that tail from most sessions entirely.

The DemoHybrid query end-to-end
11
— Live trace

Two retrievers,
one cited answer.

Sequence · request → cited answer
i. Client
POST /ask "withdrawal pending… mariaanderson711"
ii. Server
entity_extract(q) → ["mariaanderson711"]
iii. Router
resolve username ✓  ·  route → ["pdf", "db"]
iv. PDF · Chroma
retrieve top-3 → operations_manual.pdf p9
v. DB · SQLite
sql_gen + execute → withdraws row 9f666ce0…
vi. Synthesize
merge(pdf, sql) → cited answer w/ markers
vii. Client
200 OK · streamed via SSE
Final answer · excerpt
Maria, your €2,950 wire-transfer withdrawal requested on 2026-02-21 is still pending due to a manual review backlog [SQL]. Policy SLA is 4 business hours for review escalation [doc_operations_manual.pdf p9]; you may contact support to flag the case for prioritisation.
Citation · DB
[SQL] withdraws.withdraw_id=
9f666ce0-6559-4b1e-8dad-…
amount=2950, currency=EUR
status=pending, requested_at=2026-02-21
Citation · PDF
[doc_operations_manual.pdf p9]
"Manual review SLA:
4 business hours"
The CaveatsHonest list
12
— What I'd flag in a code review

Six things this build
does not yet handle.

⚠ Language
Turkish queries pay a 2s penalty

BGE-small is English-primary. Turkish queries don't hit embedding thresholds and route via LLM Tier-3 (~2s p50). Structural fix on the roadmap.

⚠ Routing
Seven over-fires after patches

Harmless (correct intent set is a superset), but they cost ~200ms of unnecessary PDF rerank. Logged for follow-up.

⚠ NLU
Negation handling partial

BGE under-weights "not" / "never". "I have not received my withdrawal" routes correctly only via entity signal — embedding is near-identical to the positive case.

⚠ Privacy
Privacy guard is reactive

L3 SQL check fires after generation — we pay LLM cost on attack queries. Pre-generation short-circuit is on the priority list.

⚠ State
No multi-turn memory

Each query is stateless. "What about last month?" can't reference a prior turn. Out of scope here, but a real product needs it.

⚠ Corpus
Known coverage gaps

Bonus T&C, KYC SLA tables, and the bonuses table are missing from index/schema. B4/B5 misses are this — not routing.

The RoadmapPriorities, in order
13
— What's next

Five next
steps.

Ranked by expected impact. The first two close measurable gaps; #3 is the highest-leverage latency win; #4–5 are production hardening.

  1. SetFit classifier — MiniLM + tiny logistic head

    5ms inference, fine-tunable from existing positive/negative anchors. Replaces contrastive scoring while keeping the multi-label decision surface. Same data, better signal.

  2. Multilingual embedder — BGE-m3

    Closes the Turkish gap structurally. No more Tier-3 fallback for non-English queries. Drop-in replacement for BGE-small at ~2× the embedding cost.

  3. Domain-dictionary typo normaliser

    Most LLM fallbacks are typos / casual spelling. A small lexicon ("widthrawl" → "withdrawal") drops fallback rate from 11% → 5%. Single largest latency win available.

  4. Pre-generation privacy short-circuit

    Move structural identity check before SQL generation. Saves ~2s on attack queries and removes LLM cost from the abuse path.

  5. Caching layer + Prometheus metrics

    Query → answer cache (hot policy questions hit constantly), routing-decision histograms, p50/p95 dashboards. Production readiness, not novelty.