Reviewer access only. Enter the credentials shared with you.
A customer-service chatbot that classifies intent, retrieves from the right source(s), and synthesises a single grounded answer with citations. Three intent classes cover everything; one of them — hybrid — is the hard one. It requires reading policy from PDFs and data from the live SQLite DB in the same request, then weaving them together cleanly.
Every claim carries a marker — [SQL] or [doc.pdf p9].
No hardcoded query templates.
No end-to-end pipeline I don't own.
make run — three interfaces, one entrypoint.
The wire-up lives in src/wire.py; FastAPI, CLI, and Gradio each call the same RAGService. The router is the brain — everything around it is replaceable.
Feeds usernames, dates, and amounts into the router; resolves PERSON via DB validation.
DeepSeek-R1, structured JSON. Activates only when embedding signal is uncertain (~11% of queries).
GLiNER NER + DB validation. If a username resolves to a real members row, db is added unconditionally — entities are stronger evidence than vector similarity.
Per category, score the query against a positive and a negative anchor centroid. Multi-label: multiple categories may pass simultaneously.
DeepSeek-R1, structured JSON. Activates only when embedding returns empty intents AND off_topic < 0.75. ~2s p50 — a backstop, not a default.
# each category has a positive AND a negative anchor list: # pos = queries that ARE this category # neg = queries that LOOK similar but ARE NOT # both averaged into a single centroid vector, once at boot. score = cos(q, pos_centroid) − α · cos(q, neg_centroid) α = 0.05 # empirically calibrated if score >= threshold[cat]: intents.add(cat)
positive_centroid[cat] = mean(embed(p)
for p in anchors[cat]["positive"])
negative_centroid[cat] = mean(embed(n)
for n in anchors[cat]["negative"])
pos_sim = cos(q, pos_centroid) neg_sim = cos(q, neg_centroid) score = pos_sim − α · neg_sim # α = 0.05
| Query | pos | neg | final | Decision |
|---|---|---|---|---|
| "What is withdrawal time?" | 0.794 | 0.719 | 0.758 | fire pdf ✓ |
| "Show me my last 5 bets" | 0.730 | 0.847 | 0.688 | reject ✓ |
| "Withdrawal pending… Maria" | 0.776 | 0.744 | 0.702 | fire pdf ✓ |
| α | True PDF | Wrong DB | Hybrid | Verdict |
|---|---|---|---|---|
| 0.0 | 0.794 ✓ | 0.730 ✗ | 0.776 ✓ | false fire |
| 0.05 | 0.758 ✓ | 0.688 ✓ | 0.702 ✓ | Adopted |
| 0.10 | 0.722 ✓ | 0.645 ✓ | 0.665 ✗ | hybrid lost |
| 0.15 | 0.686 ✗ | 0.603 ✓ | 0.628 ✗ | collapses |
α = 0.05 — the sweet spot. Kills false positives without killing hybrid.
Sweeping α from 0.05 to 0.15 across the 8-query benchmark. Past 0.07 we start over-suppressing legitimate matches; past 0.10 we lose benchmarks outright.
| α | Bench passes | LLM-fallback | FP pdf | FN db | Verdict |
|---|---|---|---|---|---|
| 0.05 | 8 / 8 | 29 % | 0 | 0 | Adopted |
| 0.07 | 8 / 8 | 38 % | 0 | 1 | marginal |
| 0.10 | 5 / 8 | 52 % | 0 | 3 | breaks 3 |
| 0.12 | 3 / 8 | 63 % | 0 | 4 | worse |
| 0.15 | 2 / 8 | 79 % | 0 | 5 | collapses |
final = pos − α · neg
Low α — gentle penalty. Disambiguates only the genuine collisions; legitimate matches survive.
High α — over-suppression. Real PDF queries dip below threshold; recall falls; LLM fallback rises.
When no resolvable user is in scope, the LLM emits the CLARIFY_NO_USER token. Catches natural-language attacks ("show me his bets", "anyone's withdrawals").
NER picks up "his", "her", "John" — if no resolution to a real members row, the query is rejected before SQL generation.
Structural regex check on the generated SQL: any SELECT … FROM members without an identity filter on username/member_id is rejected. Backstop against jailbreaks.
L3 fires after SQL generation — we pay LLM cost on attack queries. Pre-generation short-circuit is a roadmap item.
| Intent | Precision | Recall | F1 |
|---|---|---|---|
| 0.875 | 0.941 | 0.91 | |
| db | 0.943 | 1.000 | 0.97 |
| off_topic | 1.000 | 0.857 | 0.92 |
Section absent from PDF index. Routed correctly; retrieval found nothing.
Missing from manual. Same pattern as B4.
Misses are under-coverage of the corpus — not mis-routing.
Pre-contrastive, the LLM fallback hit on 52% of queries. Post-contrastive: 11%. That's the major UX win — most users never see the 2-second tail.
Each LLM fallback costs ~2 seconds of perceived latency. Cutting fallback from 52% to 11% removes that tail from most sessions entirely.
BGE-small is English-primary. Turkish queries don't hit embedding thresholds and route via LLM Tier-3 (~2s p50). Structural fix on the roadmap.
Harmless (correct intent set is a superset), but they cost ~200ms of unnecessary PDF rerank. Logged for follow-up.
BGE under-weights "not" / "never". "I have not received my withdrawal" routes correctly only via entity signal — embedding is near-identical to the positive case.
L3 SQL check fires after generation — we pay LLM cost on attack queries. Pre-generation short-circuit is on the priority list.
Each query is stateless. "What about last month?" can't reference a prior turn. Out of scope here, but a real product needs it.
Bonus T&C, KYC SLA tables, and the bonuses table are missing from index/schema. B4/B5 misses are this — not routing.
Ranked by expected impact. The first two close measurable gaps; #3 is the highest-leverage latency win; #4–5 are production hardening.
5ms inference, fine-tunable from existing positive/negative anchors. Replaces contrastive scoring while keeping the multi-label decision surface. Same data, better signal.
Closes the Turkish gap structurally. No more Tier-3 fallback for non-English queries. Drop-in replacement for BGE-small at ~2× the embedding cost.
Most LLM fallbacks are typos / casual spelling. A small lexicon ("widthrawl" → "withdrawal") drops fallback rate from 11% → 5%. Single largest latency win available.
Move structural identity check before SQL generation. Saves ~2s on attack queries and removes LLM cost from the abuse path.
Query → answer cache (hot policy questions hit constantly), routing-decision histograms, p50/p95 dashboards. Production readiness, not novelty.