RAG Knowledge Engine | Shubham Prajapati

Stage 1: Ingestion and chunking

The Ingestor reads any file directory, applies word-level chunking with overlap, embeds with all-MiniLM-L6-v2, and upserts into Qdrant. The 512-word / 64-word-overlap parameters are tuned for this embedding model's token budget. Overlap guarantees that any 512-word window of the source is fully represented in at least one chunk.

from src import RAGEngine

engine = RAGEngine()
chunks_stored = engine.ingest("./my-codebase")
# Walks all .txt, .md, .py, .js, .ts, .json files recursively
# Returns total number of chunks upserted to Qdrant

Each chunk gets an MD5 ID derived from file_path + "::" + chunk_idx. Re-ingesting the same file is idempotent: upsert overwrites by ID rather than duplicating. Payload stores the original file path and chunk index alongside the text, which the Answerer uses for source citations.

Stage 2: Hybrid retrieval with RRF

Cosine-only retrieval misses exact keyword queries. BM25-only retrieval misses semantic paraphrases. Running both and fusing the ranked lists catches what either alone misses. Reciprocal Rank Fusion doesn't need score normalisation — it only uses rank positions, so mixing cosine distances with BM25 scores is safe:

# From retriever.py — the RRF implementation
@staticmethod
def _rrf(vec_hits, bm25_hits, rrf_k=60):
    rrf_scores = {}
    for rank, doc in enumerate(vec_hits):
        k = (doc["file"], doc["chunk"])
        rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1)
    for rank, doc in enumerate(bm25_hits):
        k = (doc["file"], doc["chunk"])
        rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1)
    # A doc in both lists scores higher than a doc in either list alone
    return sorted(seen.values(), key=lambda d: rrf_scores.get(key(d), 0.0), reverse=True)

The BM25 index is built lazily on first hybrid search by scrolling all Qdrant payloads — no separate index file, no rebuild step. Force a rebuild by setting retriever._bm25 = None.

Stage 2 continued: cross-encoder reranking

After RRF, the top-50 candidates go to the Reranker. Unlike the bi-encoder used during retrieval (query and passage encoded separately, then cosine compared), a cross-encoder sees both query and passage together through full attention. Far better relevance judgment, but too slow to run on the full index — that's why you fetch 50 and rerank to 5.

# From reranker.py
from sentence_transformers import CrossEncoder

class Reranker:
    def __init__(self, model="cross-encoder/ms-marco-MiniLM-L6-v2"):
        self.model = CrossEncoder(model)

    def rerank(self, query, candidates, top_k=5):
        pairs = [(query, c["text"]) for c in candidates]
        scores = self.model.predict(pairs)   # joint query+passage scoring
        ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
        for score, candidate in ranked[:top_k]:
            out = candidate.copy()
            out["rerank_score"] = round(float(score), 4)
            yield out

Stage 3: Answerer with anti-hallucination contract

The Answerer passes the reranked chunks to Claude with an explicit contract: answer only from the provided context, cite every claim with [1][2] notation, and return the exact string "Not found in the provided documents." when the context doesn't support an answer. The "not found" path is critical — without it, the LLM produces plausible-sounding hallucinations. With it, you get an honest signal that retrieval failed, which you can act on (lower score threshold, expand chunk size, reindex).

result = engine.ask("How does the authentication flow work?")
# {
#   "answer": "The authentication flow uses Clerk v7... [1][3]",
#   "sources": ["src/auth/middleware.py", "docs/auth-design.md"],
#   "model": "claude-opus-4-8-20260528",
#   "tokens": {"input": 1240, "output": 180}
# }

Stage 4: RAGAS-style evaluation

Retrieval quality improvements need measurement. The Evaluator implements three metrics covering different failure modes. Pass evaluate=True to engine.ask() to get them inline:

Metric	What it catches	How computed
faithfulness	Hallucinated claims	LLM judge: "What fraction of claims in this answer are grounded in this context?"
answer_relevance	Off-topic answers	cosine_sim(embed(query), embed(answer)) using all-MiniLM-L6-v2
context_precision	Poor retrieval quality	mean(rerank_scores) of the returned chunks — higher = more relevant candidates surfaced

result = engine.ask("What does the circuit breaker do?", evaluate=True)
# result["eval"] = {
#   "faithfulness":       0.95,   # 95% of claims grounded in context
#   "answer_relevance":   0.88,   # answer closely addresses the question
#   "context_precision":  0.81,   # reranker returned high-quality chunks
# }

Interactive: Retrieval Mode Comparison

Pick a query type and see how naive cosine vs hybrid+rerank perform on candidate selection.

Query type

Exact keyword ("RFC 2119 MUST NOT") Semantic paraphrase ("how does auth work?") Mixed ("rate limit error code 429")

Retrieval comparison

Select a query type and click Compare.

How to run it

# 1. Start Qdrant
docker run -p 6333:6333 qdrant/qdrant

# 2. Install
git clone https://github.com/shubham0086/rag-knowledge-engine
cd rag-knowledge-engine
pip install -r requirements.txt

# 3. Set API key
export ANTHROPIC_API_KEY=sk-ant-...

# 4. Run tests (no live services needed — all mocked)
pytest tests/ -v

Test coverage breakdown

Test file	Count	What's covered
test_ingestor.py	6	Chunking with overlap, doc ID determinism, file ingestion, directory walk
test_retriever.py	6	Vector search, RRF merge of two ranked lists, hybrid BM25 build, reranker attachment
test_reranker.py	3	Score sort order, top-k slice, empty candidate list
test_answerer.py	2	Citation answer on populated context, "not found" on empty context
test_evaluator.py	3	Faithfulness from LLM judge, answer_relevance on identical strings ~1.0, context_precision uses rerank_score

Design decisions

Why BM25 from scratch, not Qdrant's native sparse vectors? Qdrant's sparse vector support requires building a sparse vector at ingest time. The rank-bm25 approach builds the keyword index lazily from existing dense payloads — no re-ingestion required to add hybrid search to an existing collection. For a new project, Qdrant's native sparse mode is cleaner; for retrofitting, scrolling payloads is pragmatic.

Why all-MiniLM-L6-v2 and not a larger model? 384 dimensions, CPU inference, fast enough for local development on any machine. The embedding model constant is isolated in src/ingestor.py as EMBED_MODEL — change one line to swap to text-embedding-3-large for production. The trade-off: re-ingest everything after changing the model, since vectors become incompatible.

Why Claude for faithfulness evaluation? Faithfulness is a claim-grounding judgment that requires reading two documents (the answer and the context) and reasoning about entailment. An embedding cosine score can't do this. Claude with a structured prompt is more reliable than a fine-tuned NLI classifier for the small batches typical in RAG eval loops.

Where this fits

RAG Knowledge Engine is Solve Track 06. The production retriever engine was extracted from AgentKernel (equilibrium). The Research Agent (Solve 05 pipeline track) uses a similar web-retrieval pipeline but fetches from URLs rather than a local document store.

Honest framing

BM25 scores degrade when the corpus contains very short chunks or highly repetitive text — the IDF weights flatten out and keyword discrimination drops. The lazy scroll to build the BM25 index means the first hybrid search on a large collection has noticeable latency (scrolling 100k+ chunks takes seconds). For large collections, pre-build the index at server startup. The cross-encoder adds 200-400ms per query batch on CPU — acceptable for knowledge-base Q&A, too slow for sub-100ms latency requirements.