Naive RAG — chunk documents, embed them, cosine-similarity search, stuff the top-5 into the prompt — was always a prototype, not a production system. By 2026, the pattern looks very different. Enterprise RAG is now the single most-requested AI engineering skill in job postings, and the gap between what teams think RAG is and what production RAG requires is where most projects fail.
This is what I've learned building the RAG Knowledge Engine and applying retrieval in production agent pipelines.
1. Where 80% of RAG Failures Come From
Teams spend weeks tuning prompts and swapping LLMs when the actual problem is upstream. According to production post-mortems analyzed in 2026 RAG audit reports, 80% of retrieval quality failures trace back to the ingestion and chunking layer — not the model.
2. Chunking Strategy: The Foundation
Fixed-size character chunking (e.g., every 1000 chars) is the default in most tutorials and the worst option for real documents. The right strategy depends on your source material:
- Markdown / structured docs: Chunk at headers. Each section is a semantic unit. Don't split mid-section.
- Code: Chunk at function or class boundaries. A function is the natural retrieval unit for code search.
- Dense prose (research, legal): Sentence-level chunking with 20-30% overlap. Overlap preserves context at chunk boundaries.
- General text: 512-token chunks with 64-token overlap. The sweet spot for
all-MiniLM-L6-v2and most embedding models.
def chunk_text(text: str, size: int = 512, overlap: int = 64) -> list[str]:
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i : i + size])
chunks.append(chunk)
i += size - overlap
return chunks
The overlap matters. Without it, a sentence split across two chunks will fail retrieval for both — the embedding of either half doesn't capture the full meaning. With 64-token overlap, you guarantee any 512-token window of the source is fully represented in at least one chunk.
3. Hybrid Search: Why Cosine Alone Fails
Cosine similarity on dense embeddings handles semantic queries well but fails on exact terms — product names, error codes, version numbers, specific function names. BM25 (keyword search) handles exact terms but misses paraphrase and intent. Hybrid search runs both in parallel and fuses the results.
def reciprocal_rank_fusion(vector_results, bm25_results, k=60):
scores = {}
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
for rank, doc in enumerate(bm25_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Qdrant supports hybrid search natively via query_batch_points with both dense and sparse vectors. Pinecone's hybrid mode uses a alpha parameter to weight vector vs. keyword. Either way, hybrid retrieval consistently outperforms cosine-only on recall across real-world document collections.
4. Reranking: The Final Quality Gate
Hybrid search gives you a better candidate pool. Reranking gives you a smaller, higher-quality context window for the LLM. The production pattern:
- Retrieve the top 50-100 candidates via hybrid search.
- Pass those candidates to a cross-encoder reranker (Cohere Rerank 3.5, or BGE-Reranker for self-hosted).
- Take the top 5-10 reranked results into the LLM context window.
Cross-encoders process the query and each candidate together — much slower than bi-encoder embedding, but significantly higher relevance judgment. You run it only on the filtered candidate set, so latency is bounded.
5. Vector Database Selection in 2026
The two dominant choices for production RAG are Qdrant and Pinecone. They're optimized for different constraints:
The one gotcha on Pinecone Serverless: cold-start latency. If your RAG endpoint isn't queried for 15+ minutes, the first query can take 3-6 seconds. For user-facing apps with low query frequency, the dedicated tier is mandatory.
6. The Answerer: Anti-Hallucination Contract
The LLM's job in a RAG system is narrow: synthesize an answer from the provided chunks, cite sources, and refuse to go outside the context. This contract must be explicit in the system prompt.
SYSTEM = """You are a knowledge-base assistant.
Rules:
1. Answer ONLY from the document excerpts provided below.
2. Every claim must cite its source using [1], [2], etc.
3. If the answer is not in the excerpts, respond exactly:
"Not found in the provided documents."
4. Never speculate or use prior knowledge."""
The "not found" escape hatch is critical. Without it, the LLM will hallucinate an answer that sounds plausible. With it, you get an honest signal that your retrieval failed — which you can act on (expand search, lower score threshold, reindex).
7. Evaluation: How You Know It's Working
A RAG system without evals is blind. The minimum viable eval set:
- Faithfulness: Is every claim in the answer grounded in the retrieved chunks? Check with an LLM judge comparing answer to retrieved context.
- Answer relevance: Does the answer address the question? Another LLM judge call.
- Context recall: Did the retriever surface all chunks needed to answer the question? Requires a labeled dataset of question → relevant doc mapping.
RAGAS (an open-source RAG eval framework) automates all three. Run it on a curated 50-100 question set before every significant change to your chunking strategy, embedding model, or prompts. If faithfulness drops below 0.85, your context window is being overwhelmed by noise — reduce top-k or tighten the reranker cutoff.
What I Built
My RAG Knowledge Engine implements the core three-stage pipeline: ingestor (word-level chunker + SentenceTransformer embeddings + Qdrant upsert), retriever (semantic search with citation numbering), and answerer (Claude with strict anti-hallucination prompt). 20 tests, all mocked — no live services needed to run the suite. It's the foundation I use for any project requiring grounded knowledge retrieval.