Ask questions about any codebase or document collection and get cited answers in seconds. Refuses to hallucinate.
Solve Track 06 · Retrieval. Naive RAG — chunk, embed, cosine search, stuff top-5 into prompt — fails on exact keywords and surfaces poor-quality context. This engine runs hybrid search (BM25 keyword + Qdrant vector), fuses both ranked lists via Reciprocal Rank Fusion, then reranks the candidate pool with a cross-encoder. The result: consistently better context before Claude ever sees the question. Four independently testable stages. 20 tests, all mocked, no live services needed to run the suite.
The Ingestor reads any file directory, applies word-level chunking with overlap, embeds with all-MiniLM-L6-v2, and upserts into Qdrant. The 512-word / 64-word-overlap parameters are tuned for this embedding model's token budget. Overlap guarantees that any 512-word window of the source is fully represented in at least one chunk.
from src import RAGEngine
engine = RAGEngine()
chunks_stored = engine.ingest("./my-codebase")
# Walks all .txt, .md, .py, .js, .ts, .json files recursively
# Returns total number of chunks upserted to Qdrant
Each chunk gets an MD5 ID derived from file_path + "::" + chunk_idx. Re-ingesting the same file is idempotent: upsert overwrites by ID rather than duplicating. Payload stores the original file path and chunk index alongside the text, which the Answerer uses for source citations.
Cosine-only retrieval misses exact keyword queries. BM25-only retrieval misses semantic paraphrases. Running both and fusing the ranked lists catches what either alone misses. Reciprocal Rank Fusion doesn't need score normalisation — it only uses rank positions, so mixing cosine distances with BM25 scores is safe:
# From retriever.py — the RRF implementation
@staticmethod
def _rrf(vec_hits, bm25_hits, rrf_k=60):
rrf_scores = {}
for rank, doc in enumerate(vec_hits):
k = (doc["file"], doc["chunk"])
rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1)
for rank, doc in enumerate(bm25_hits):
k = (doc["file"], doc["chunk"])
rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1)
# A doc in both lists scores higher than a doc in either list alone
return sorted(seen.values(), key=lambda d: rrf_scores.get(key(d), 0.0), reverse=True)
The BM25 index is built lazily on first hybrid search by scrolling all Qdrant payloads — no separate index file, no rebuild step. Force a rebuild by setting retriever._bm25 = None.
After RRF, the top-50 candidates go to the Reranker. Unlike the bi-encoder used during retrieval (query and passage encoded separately, then cosine compared), a cross-encoder sees both query and passage together through full attention. Far better relevance judgment, but too slow to run on the full index — that's why you fetch 50 and rerank to 5.
# From reranker.py
from sentence_transformers import CrossEncoder
class Reranker:
def __init__(self, model="cross-encoder/ms-marco-MiniLM-L6-v2"):
self.model = CrossEncoder(model)
def rerank(self, query, candidates, top_k=5):
pairs = [(query, c["text"]) for c in candidates]
scores = self.model.predict(pairs) # joint query+passage scoring
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
for score, candidate in ranked[:top_k]:
out = candidate.copy()
out["rerank_score"] = round(float(score), 4)
yield out
The Answerer passes the reranked chunks to Claude with an explicit contract: answer only from the provided context, cite every claim with [1][2] notation, and return the exact string "Not found in the provided documents." when the context doesn't support an answer. The "not found" path is critical — without it, the LLM produces plausible-sounding hallucinations. With it, you get an honest signal that retrieval failed, which you can act on (lower score threshold, expand chunk size, reindex).
result = engine.ask("How does the authentication flow work?")
# {
# "answer": "The authentication flow uses Clerk v7... [1][3]",
# "sources": ["src/auth/middleware.py", "docs/auth-design.md"],
# "model": "claude-opus-4-8-20260528",
# "tokens": {"input": 1240, "output": 180}
# }
Retrieval quality improvements need measurement. The Evaluator implements three metrics covering different failure modes. Pass evaluate=True to engine.ask() to get them inline:
| Metric | What it catches | How computed |
|---|---|---|
| faithfulness | Hallucinated claims | LLM judge: "What fraction of claims in this answer are grounded in this context?" |
| answer_relevance | Off-topic answers | cosine_sim(embed(query), embed(answer)) using all-MiniLM-L6-v2 |
| context_precision | Poor retrieval quality | mean(rerank_scores) of the returned chunks — higher = more relevant candidates surfaced |
result = engine.ask("What does the circuit breaker do?", evaluate=True)
# result["eval"] = {
# "faithfulness": 0.95, # 95% of claims grounded in context
# "answer_relevance": 0.88, # answer closely addresses the question
# "context_precision": 0.81, # reranker returned high-quality chunks
# }
# 1. Start Qdrant docker run -p 6333:6333 qdrant/qdrant # 2. Install git clone https://github.com/shubham0086/rag-knowledge-engine cd rag-knowledge-engine pip install -r requirements.txt # 3. Set API key export ANTHROPIC_API_KEY=sk-ant-... # 4. Run tests (no live services needed — all mocked) pytest tests/ -v
| Test file | Count | What's covered |
|---|---|---|
| test_ingestor.py | 6 | Chunking with overlap, doc ID determinism, file ingestion, directory walk |
| test_retriever.py | 6 | Vector search, RRF merge of two ranked lists, hybrid BM25 build, reranker attachment |
| test_reranker.py | 3 | Score sort order, top-k slice, empty candidate list |
| test_answerer.py | 2 | Citation answer on populated context, "not found" on empty context |
| test_evaluator.py | 3 | Faithfulness from LLM judge, answer_relevance on identical strings ~1.0, context_precision uses rerank_score |
Why BM25 from scratch, not Qdrant's native sparse vectors? Qdrant's sparse vector support requires building a sparse vector at ingest time. The rank-bm25 approach builds the keyword index lazily from existing dense payloads — no re-ingestion required to add hybrid search to an existing collection. For a new project, Qdrant's native sparse mode is cleaner; for retrofitting, scrolling payloads is pragmatic.
Why all-MiniLM-L6-v2 and not a larger model? 384 dimensions, CPU inference, fast enough for local development on any machine. The embedding model constant is isolated in src/ingestor.py as EMBED_MODEL — change one line to swap to text-embedding-3-large for production. The trade-off: re-ingest everything after changing the model, since vectors become incompatible.
Why Claude for faithfulness evaluation? Faithfulness is a claim-grounding judgment that requires reading two documents (the answer and the context) and reasoning about entailment. An embedding cosine score can't do this. Claude with a structured prompt is more reliable than a fine-tuned NLI classifier for the small batches typical in RAG eval loops.
RAG Knowledge Engine is Solve Track 06. The production retriever engine was extracted from AgentKernel (equilibrium). The Research Agent (Solve 05 pipeline track) uses a similar web-retrieval pipeline but fetches from URLs rather than a local document store. The deep write-up on production RAG patterns, including the stat that 80% of RAG failures originate in chunking and not the LLM, is in the blog: RAG in Production: What Actually Works in 2026.
BM25 scores degrade when the corpus contains very short chunks or highly repetitive text — the IDF weights flatten out and keyword discrimination drops. The lazy scroll to build the BM25 index means the first hybrid search on a large collection has noticeable latency (scrolling 100k+ chunks takes seconds). For large collections, pre-build the index at server startup. The cross-encoder adds 200-400ms per query batch on CPU — acceptable for knowledge-base Q&A, too slow for sub-100ms latency requirements.