Building a RAG pipeline that works in a demo is straightforward. Building one that works reliably in production, on real documents, under latency constraints, is a different problem entirely.
This post covers three areas where naive RAG implementations consistently fall short: chunking strategy, retrieval architecture, and context window positioning. Everything here comes from building and iterating on a production system. No Novum AI name, just the patterns.
Part 1: Why chunking strategy determines everything downstream
Most RAG tutorials start with a simple approach: split your documents into fixed-size chunks of 500 or 800 characters, embed them, store them in a vector database. It works well enough to demo. It falls apart in production.
The core problem is that fixed-size chunking ignores document structure entirely. A chunk boundary might land in the middle of a table, split a numbered list across two chunks, or merge two semantically unrelated sections because they happened to be adjacent. The retrieval system then operates on these incoherent fragments and produces incoherent results.
Layout-aware chunking
The better approach is to let the document's own structure define chunk boundaries. Most business documents -- PDFs, reports, policy documents, product specs -- are already structured with headings, subheadings, tables, and paragraphs. That structure is meaningful. A section under the heading "Pricing" contains pricing information. A section under "Returns Policy" contains returns information. Splitting at heading boundaries produces semantically coherent chunks.
In practice, this means extracting the document as structured Markdown first. PyMuPDF4LLM does this well for PDFs -- it preserves heading hierarchy, table structure, and paragraph boundaries as Markdown elements. From there you split on H1 and H2 headings to create parent chunks at the section level.
Hierarchical chunking
The problem with section-level chunks is that sections can be long. A section with 2,000 tokens is too large to embed meaningfully -- the embedding averages over too much content and loses precision on specific facts within the section.
The solution is hierarchical chunking: parent chunks at the section level, child chunks at the paragraph level (around 300 tokens with 50-token overlap). During retrieval you search over child chunks for precision, then expand to the parent chunk for context. This is sometimes called parent-child retrieval or small-to-big retrieval.
The intuition is: find the specific paragraph that answers the question, then give the LLM the full section for context. You get the precision of small chunks and the coherence of large chunks.
Tables are a special case
Tables should always be kept as standalone chunks and never split across chunk boundaries. A table split in half is worse than no table at all -- the LLM receives partial information and may treat it as complete.
The same logic applies to code blocks, numbered lists, and any other structured element where the parts only make sense together.
Part 2: Why dense search alone fails, and what to do about it
Dense vector search works by embedding both the query and the documents into a shared vector space, then finding documents closest to the query by cosine similarity. This handles semantic similarity well. "How do I handle a pricing objection" retrieves "when customers push back on cost" even though there is no keyword overlap.
The limitation becomes obvious when users ask about specific things: a product name, a pricing tier, a model number, a specific clause number. Dense search looks for semantic neighbors, not exact matches. A document that mentions "Professional Plan" multiple times may not surface as the top result for a query about "Professional Plan" if there are other documents that are semantically similar to the query topic.
Sparse retrieval: BM25
BM25 (Best Match 25) is a keyword-based ranking algorithm that has been a backbone of information retrieval for decades. It scores documents based on term frequency and inverse document frequency -- essentially, how often the query terms appear in a document weighted by how rare those terms are across the corpus.
BM25 is excellent at exactly the thing dense search is bad at: finding documents that contain specific keywords. It is bad at the thing dense search is good at: handling semantic variation.
Hybrid retrieval
The solution is to run both simultaneously and combine the scores.
A hybrid query hits both the dense index and the sparse index in parallel, retrieves candidates from each, and then normalizes and combines their scores into a single ranking. Pinecone supports hybrid retrieval natively with a single query parameter controlling the weighting between dense and sparse results.
The weighting matters. In a system where documents contain both narrative content and specific factual information (product names, prices, IDs), a reasonable starting point is 0.7 weight on dense similarity and 0.3 on sparse keyword match. This gives the semantic understanding primacy while ensuring exact matches surface reliably.
In practice, hybrid retrieval eliminates a whole class of failures where the system cannot find information that is clearly in the document, just described with different language than the query.
Cross-encoder re-ranking
After hybrid retrieval returns the top 20 candidates, there is an additional step worth discussing: cross-encoder re-ranking.
A bi-encoder (what powers dense search) embeds the query and document separately, then compares them. It is fast because document embeddings are pre-computed. A cross-encoder takes both the query and the document as a single input, processes them together through the full transformer, and outputs a relevance score. It is more accurate because it sees the relationship between the query and the document in full context, not just as two separate vectors.
Cross-encoders are too slow to use for initial retrieval across a large corpus -- you cannot run a full forward pass for every document at query time. But they are practical for re-ranking a small candidate set. Running a cross-encoder on the top 20 results from hybrid retrieval to produce a final top 5 meaningfully improves result quality at acceptable latency cost.
The model I used is ms-marco-MiniLM-L-6-v2, a 22M parameter cross-encoder trained on the MS MARCO passage ranking dataset. It is small enough to self-host in a Lambda layer and fast enough to add minimal latency to the pipeline.
Part 3: Context window position changes the answer
This is perhaps the least obvious insight in RAG engineering.
There is a paper called Lost in the Middle (Liu et al., 2023) that documents a consistent pattern across LLMs: models attend more strongly to content at the beginning and end of the context window than to content in the middle. When relevant information is placed in the middle of a long context, recall drops significantly compared to placing it at the beginning or end.
The practical implication for RAG is direct: after ranking retrieved chunks by relevance, place the highest-scoring chunk last in the prompt, immediately before the generation instruction. The LLM will attend to it most strongly.
This is a zero-cost change. It requires reordering an array. In testing, it improved factual accuracy on specific queries noticeably -- the LLM was less likely to ignore the most relevant chunk in favor of earlier content.
Noise in context hurts more than missing context
A related insight: when the retrieval system returns a low-confidence result, it is better to tell the LLM that no relevant context was found than to inject a low-relevance chunk.
The naive implementation always injects the top result regardless of score. This means that on queries where nothing relevant exists in the knowledge base, the LLM receives unrelated context and either hallucinates or confuses it with the query.
Setting a relevance threshold (0.3 works as a starting point) and explicitly instructing the LLM to acknowledge when no relevant context is available produces more honest and useful outputs. A system that says "I do not have specific information on that" is more trustworthy than one that confabulates an answer from loosely related context.
What this looks like end to end
A production RAG pipeline incorporating these ideas looks roughly like this:
- Documents are ingested with layout-aware hierarchical chunking. Parent chunks at section level, child chunks at paragraph level, tables as standalone chunks.
- At query time, hybrid retrieval runs dense and sparse search in parallel, combining scores into a ranked candidate list.
- A cross-encoder re-ranker reduces the top 20 candidates to the top 5.
- Results below a relevance threshold are dropped rather than injected.
- Surviving chunks are assembled with the highest-scoring chunk positioned last in the prompt.
- The LLM generates against this context.
Each of these steps addresses a specific failure mode of naive RAG. None of them are dramatically complex to implement. The compounding effect on recall and reliability is significant.
Further reading
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)
- ms-marco-MiniLM-L-6-v2 on Hugging Face
- Pinecone hybrid search documentation
- PyMuPDF4LLM documentation