Vol. I · No. 1 — Engineering Review 10-Day Take-Home · Submission

A field study in hybrid retrieval — policy PDFs alongside a live transactional database, blended into a single grounded answer with mandatory source attribution.

Author: Kadir Şahin
Filed: 2026 · 04 · 27
Audience: Engineering review
Submission: gitlab.com/e.k.sahin/
rag-customer-support

The BriefTwo sources, one answer

— On the problem

Two heterogeneous
sources, blended.

A customer-service chatbot that classifies intent, retrieves from the right source(s), and synthesises a single grounded answer with citations. Three intent classes cover everything; one of them — hybrid — is the hard one. It requires reading policy from PDFs and data from the live SQLite DB in the same request, then weaving them together cleanly.

"My withdrawal is still pending after 5 days. What should I do? I'm user mariaanderson711." The hybrid case · Policy SLA + Maria's pending row

Non-negotiables

Source attribution
Every claim carries a marker — [SQL] or [doc.pdf p9].
NL → SQL
No hardcoded query templates.
No black-box
No end-to-end pipeline I don't own.
One command
make run — three interfaces, one entrypoint.

The ArchitectureHexagonal-lite

— System diagram

Composition root, three
interfaces, one use case.

The wire-up lives in src/wire.py; FastAPI, CLI, and Gradio each call the same RAGService. The router is the brain — everything around it is replaceable.

Client

api · cli · gradio

→

FastAPI

/ask · SSE

→

RAG Service

use case

→

TieredRouter

3 tiers

PDFRetriever

Chroma · BGE-small

→

Synthesizer

LLM · cited

DBRetriever

NL→SQL · SQLite

→

Answer

+ citations

Aux · NER

EntityExtractor (GLiNER)

Feeds usernames, dates, and amounts into the router; resolves PERSON via DB validation.

Aux · Fallback

LLM Tier-3 backstop

DeepSeek-R1, structured JSON. Activates only when embedding signal is uncertain (~11% of queries).

The PipelineThree signals, one decision

— On routing as evidence

Routing isn't classification.
It's an evidence stack.

Entity-first augmentation

GLiNER NER + DB validation. If a username resolves to a real members row, db is added unconditionally — entities are stronger evidence than vector similarity.
Contrastive centroid embedding

Per category, score the query against a positive and a negative anchor centroid. Multi-label: multiple categories may pass simultaneously.
LLM Tier-3 fallback

DeepSeek-R1, structured JSON. Activates only when embedding returns empty intents AND off_topic < 0.75. ~2s p50 — a backstop, not a default.

Contrastive scoring · per category, per query

# two centroids per category, computed once
score = cos(q, pos_centroid)
      − α · cos(q, neg_centroid)

α = 0.05   # empirically calibrated

if score >= threshold[cat]:
    intents.add(cat)

Multi-label means a query can fire pdf and db simultaneously. The whole point

Footnote · NER evolution spaCy custom → GLiNER zero-shot We initially trained a custom spaCy NER on 88 hand-labelled queries (F1 = 0.824). Replaced with urchade/gliner_medium-v2.1 for three reasons →

(1) Zero-shotAdding a new entity label is a config change, not a re-train.

(2) TurkishspaCy en_core_web_sm had low TR recall — usernames are mixed TR/EN.

(3) Killed regexSemantic extraction let us drop regex heuristics from the entity layer entirely.

The TraceOne query, fully observed

— The central question

How is a routing decision made? A complete trace.

"My withdrawal is still pending after 5 days. What should I do? I'm user mariaanderson711." Query under trace

i.Entity extractionGLiNER · ~100ms

usernames = ["mariaanderson711"]
→ resolved ✓ via SELECT 1 FROM members
had_unresolved_person = False

ii.Contrastive scoringembedding · ~10ms

pdf_policy: 0.776 − 0.05·0.744 = 0.702
user_data : 0.822 − 0.05·0.746 = 0.747
off_topic : 0.522

iii.Threshold checkmulti-label

pdf 0.702 ≥ 0.70 ✓ add "pdf"
db 0.747 ≥ 0.705 ✓ add "db"
off 0.522 < 0.72 ✗ skip

iv.Augmentation guardno-op

resolved username + "db" already in intents
→ no-op ✓

v.Dominance suppressionmargin = 0.07

|pdf − db| = 0.045 < 0.07
→ both kept ✓

vi.Off-topic precedenceguard

off_topic 0.522 < 0.72
→ no override ✓

vii.Confidence bandingmax(scores) = 0.747 ∈ [0.70, 0.80]

→ confidence = "medium"

Verdict — RouteDecision intents = ["pdf", "db"], tier = "embedding-medium-hybrid", confidence = "medium", scores = {pdf: 0.702, db: 0.747, off_topic: 0.522}

If embedding empty & off_topic < 0.75 → escalate to LLM Tier-3 (DeepSeek-R1, ~2s p50, 11% of queries) 8–15ms · steps i–vii · no LLM call

The MechanismContrastive centroid

— On negative anchors

Two centroids per category.
Subtract one from the other.

At init · per category

positive_centroid[cat] = mean(embed(p)
    for p in anchors[cat]["positive"])
negative_centroid[cat] = mean(embed(n)
    for n in anchors[cat]["negative"])

→ two 384-dim points per category in BGE space.

At inference · per query

pos_sim = cos(q, pos_centroid)
neg_sim = cos(q, neg_centroid)
score   = pos_sim − α · neg_sim   # α = 0.05

Geometric intuition

i.
True positive — close to pos, far from neg → small subtraction → still above threshold.
ii.
False positive — close to pos AND neg → large subtraction → falls below.
iii.
Hybrid — close to BOTH category positives, far from BOTH negatives → both fire.

What it produces · pdf_policy threshold = 0.70

Query	pos	neg	final	Decision
"What is withdrawal time?"	0.794	0.719	0.758	fire pdf ✓
"Show me my last 5 bets"	0.730	0.847	0.688	reject ✓
"Withdrawal pending… Maria"	0.776	0.744	0.702	fire pdf ✓

Why α = 0.05 specifically

α	True PDF	Wrong DB	Hybrid	Verdict
0.0	0.794 ✓	0.730 ✗	0.776 ✓	false fire
0.05	0.758 ✓	0.688 ✓	0.702 ✓	Adopted
0.10	0.722 ✓	0.645 ✓	0.665 ✗	hybrid lost
0.15	0.686 ✗	0.603 ✓	0.628 ✗	collapses

α = 0.05 — the sweet spot. Kills false positives without killing hybrid.

The Sweepα calibration

— Empirical operating point

α = 0.05 is the only safe choice.

Sweeping α from 0.05 to 0.15 across the 8-query benchmark. Past 0.07 we start over-suppressing legitimate matches; past 0.10 we lose benchmarks outright.

α	Bench passes	LLM-fallback	FN db	Verdict
0.05	8 / 8	29 %	0	Adopted
0.07	8 / 8	38 %	1	marginal
0.10	5 / 8	52 %	3	breaks 3
0.12	3 / 8	63 %	4	worse
0.15	2 / 8	79 %	5	collapses

What α controls

final = pos − α · neg

Low α — gentle penalty. Disambiguates only the genuine collisions; legitimate matches survive.

High α — over-suppression. Real PDF queries dip below threshold; recall falls; LLM fallback rises.

α = 0.05 has measurable margin on every benchmark — no query within 0.01 of its threshold. Operating headroom

The DefenseThree independent guards

— On privacy in depth

"Show me his bets" gets
rejected three different ways.

LLM CLARIFY_NO_USER sentinel

When no resolvable user is in scope, the LLM emits the CLARIFY_NO_USER token. Catches natural-language attacks ("show me his bets", "anyone's withdrawals").

NL guard

ii.

GLiNER unresolved PERSON

NER picks up "his", "her", "John" — if no resolution to a real members row, the query is rejected before SQL generation.

NER guard

iii.

Structural SQL check

Structural regex check on the generated SQL: any SELECT … FROM members without an identity filter on username/member_id is rejected. Backstop against jailbreaks.

AST guard

Attack walkthrough

"show me his bets"attacker · no auth context

L1 → LLM sentinel → emits CLARIFY_NO_USER. REJECT

L2 → GLiNER → "his" extracted, unresolvable. REJECT

L3 → SQL AST → no identity filter. REJECT

⚠ Known cost

L3 fires after SQL generation — we pay LLM cost on attack queries. Pre-generation short-circuit is a roadmap item.

The NumbersEmpirical evaluation

— Routing precision · recall · F1

53 realistic queries.
Zero wrong routes.

22/25

Realistic CS · perfect

6/8

Benchmark · perfect

Wrong routes · all groups

Per-intent routing metrics · n = 53

Intent	Precision	Recall	F1
pdf	0.875	0.941	0.91
db	0.943	1.000	0.97
off_topic	1.000	0.857	0.92

What the failures actually are

B4 · Bonus T&C
Section absent from PDF index. Routed correctly; retrieval found nothing.
B5 · KYC SLA tables
Missing from manual. Same pattern as B4.

Misses are under-coverage of the corpus — not mis-routing.

The ProfileLatency & cost

— Where time actually goes

Embedding path serves
89% in < 50ms.

Pre-contrastive, the LLM fallback hit on 52% of queries. Post-contrastive: 11%. That's the major UX win — most users never see the 2-second tail.

Embedding-only

89% of queries

p50 16ms · p95 47ms

LLM fallback

11% of queries

p50 2.1s · p95 2.2s

Cold start

once

~10s

Fallback rate · before vs. after contrastive

52%

Pre-contrastive

11%

Post-contrastive

Four in five users who used to wait, now don't.Why this matters

Each LLM fallback costs ~2 seconds of perceived latency. Cutting fallback from 52% to 11% removes that tail from most sessions entirely.

The DemoHybrid query end-to-end

— Live trace

Two retrievers,
one cited answer.

Sequence · request → cited answer

i.→ Client

POST /ask "withdrawal pending… mariaanderson711"

ii. Server

entity_extract(q) → ["mariaanderson711"]

iii. Router

resolve username ✓ · route → ["pdf", "db"]

iv.→ PDF · Chroma

retrieve top-3 → operations_manual.pdf p9

v.→ DB · SQLite

sql_gen + execute → withdraws row 9f666ce0…

vi. Synthesize

merge(pdf, sql) → cited answer w/ markers

vii.← Client

200 OK · streamed via SSE

Final answer · excerpt

Maria, your €2,950 wire-transfer withdrawal requested on 2026-02-21 is still pending due to a manual review backlog [SQL]. Policy SLA is 4 business hours for review escalation [doc_operations_manual.pdf p9]; you may contact support to flag the case for prioritisation.

Citation · DB

[SQL] withdraws.withdraw_id=
9f666ce0-6559-4b1e-8dad-…
amount=2950, currency=EUR
status=pending, requested_at=2026-02-21

Citation · PDF

[doc_operations_manual.pdf p9]
"Manual review SLA:
4 business hours"

The CaveatsHonest list

— What I'd flag in a code review

Six things this build
does not yet handle.

⚠ Language

Turkish queries pay a 2s penalty

BGE-small is English-primary. Turkish queries don't hit embedding thresholds and route via LLM Tier-3 (~2s p50). Structural fix on the roadmap.

⚠ Routing

Seven over-fires after patches

Harmless (correct intent set is a superset), but they cost ~200ms of unnecessary PDF rerank. Logged for follow-up.

⚠ NLU

Negation handling partial

BGE under-weights "not" / "never". "I have not received my withdrawal" routes correctly only via entity signal — embedding is near-identical to the positive case.

⚠ Privacy

Privacy guard is reactive

L3 SQL check fires after generation — we pay LLM cost on attack queries. Pre-generation short-circuit is on the priority list.

⚠ State

No multi-turn memory

Each query is stateless. "What about last month?" can't reference a prior turn. Out of scope here, but a real product needs it.

⚠ Corpus

Known coverage gaps

Bonus T&C, KYC SLA tables, and the bonuses table are missing from index/schema. B4/B5 misses are this — not routing.

The RoadmapPriorities, in order

— What's next

Five next
steps.

Ranked by expected impact. The first two close measurable gaps; #3 is the highest-leverage latency win; #4–5 are production hardening.

SetFit classifier — MiniLM + tiny logistic head
5ms inference, fine-tunable from existing positive/negative anchors. Replaces contrastive scoring while keeping the multi-label decision surface. Same data, better signal.
Multilingual embedder — BGE-m3
Closes the Turkish gap structurally. No more Tier-3 fallback for non-English queries. Drop-in replacement for BGE-small at ~2× the embedding cost.
Domain-dictionary typo normaliser
Most LLM fallbacks are typos / casual spelling. A small lexicon ("widthrawl" → "withdrawal") drops fallback rate from 11% → 5%. Single largest latency win available.
Pre-generation privacy short-circuit
Move structural identity check before SQL generation. Saves ~2s on attack queries and removes LLM cost from the abuse path.
Caching layer + Prometheus metrics
Query → answer cache (hot policy questions hit constantly), routing-decision histograms, p50/p95 dashboards. Production readiness, not novelty.

RAG Chatbot Deck

RAG.TheCustomer Service Issue

Two heterogeneous
sources, blended.

Composition root, three
interfaces, one use case.

Routing isn't classification.
It's an evidence stack.

How is a routing decision made? A complete trace.

Two centroids per category.
Subtract one from the other.

α = 0.05 is the only safe choice.

"Show me his bets" gets
rejected three different ways.

LLM CLARIFY_NO_USER sentinel

GLiNER unresolved PERSON

Structural SQL check

53 realistic queries.
Zero wrong routes.

Embedding path serves
89% in < 50ms.

Two retrievers,
one cited answer.

Six things this build
does not yet handle.

Five next
steps.

RAG.TheCustomer Service Issue

Two heterogeneoussources, blended.

Composition root, threeinterfaces, one use case.

Routing isn't classification.It's an evidence stack.

How is a routing decision made? A complete trace.

Two centroids per category.Subtract one from the other.

α = 0.05 is the only safe choice.

"Show me his bets" getsrejected three different ways.

LLM CLARIFY_NO_USER sentinel

GLiNER unresolved PERSON

Structural SQL check

53 realistic queries.Zero wrong routes.

Embedding path serves89% in < 50ms.

Two retrievers,one cited answer.

Six things this builddoes not yet handle.

Five nextsteps.

Two heterogeneous
sources, blended.

Composition root, three
interfaces, one use case.

Routing isn't classification.
It's an evidence stack.

Two centroids per category.
Subtract one from the other.

"Show me his bets" gets
rejected three different ways.

53 realistic queries.
Zero wrong routes.

Embedding path serves
89% in < 50ms.

Two retrievers,
one cited answer.

Six things this build
does not yet handle.

Five next
steps.