Your RAG pipeline retrieves irrelevant chunks 40% of the time. You've tried better embedding models, rerankers, and metadata filtering — but the root problem is that your chunks are the wrong size. Too small and each chunk lacks enough context to be meaningful. Too large and one relevant sentence gets buried in 900 tokens of noise.
Here's how to diagnose and fix your chunking strategy systematically.
Why Chunk Size Matters More Than Your Embedding Model
Embeddings capture semantic meaning — but only within the text they encode. A 512-token chunk about database indexing will embed differently from a 128-token excerpt of the same paragraph. The embedding isn't wrong; the input unit is.
The retrieval accuracy problem compounds at query time: if your query is 12 tokens and your chunks are 1,024 tokens, the similarity score gets diluted by the 1,012 tokens that aren't relevant to the query. Smaller chunks produce sharper embeddings for specific queries; larger chunks capture broader conceptual themes.
Benchmarks from LlamaIndex and LangChain's 2025 evals consistently show that 256-512 token chunks outperform 1,024-token chunks by 15-25% on precision@3 for question-answering tasks over technical documentation.
The Tradeoff Table
| Chunk Size | Strengths | Weaknesses | Best For |
|---|---|---|---|
| 128 tokens | Sharp embeddings, precise retrieval | Loses multi-sentence context | FAQ lookups, single-fact retrieval |
| 256 tokens | Good balance, fast indexing | May split mid-explanation | API docs, code comments |
| 512 tokens | Captures paragraph-level context | Slightly diluted embeddings | Technical articles, blog posts |
| 1,024 tokens | Preserves section-level meaning | Slow indexing, noisy embeddings | Legal contracts, long-form reports |
| 2,048 tokens | Full sections intact | Rarely retrieves relevantly | Should almost never be used |
For most RAG applications over technical documentation, 512 tokens is the empirically best default. Start there, measure, then adjust.
Overlap: The Number You're Probably Getting Wrong
Chunk overlap ensures that information near chunk boundaries doesn't disappear. If a key sentence falls between chunk N and chunk N+1, zero overlap means neither chunk captures it fully.
Standard recommendation: 10-20% overlap.
For 512-token chunks, that means 51-102 tokens of overlap. Concretely:
- 0% overlap: fast to index, high risk of missing boundary content
- 10% overlap (51 tokens): catches most boundary cases, minimal index size increase
- 20% overlap (102 tokens): safer for dense technical content, ~20% more vectors to store
Beyond 20% overlap, you're paying storage costs for diminishing retrieval gains. At 50% overlap, you're essentially chunking twice and getting no benefit.
Fixed-Size vs. Recursive vs. Semantic Chunking
Fixed-size chunking splits on character or token count regardless of sentence or paragraph boundaries. It's fast and predictable, but produces chunks that start or end mid-sentence. Most production systems shouldn't use pure fixed-size chunking.
Recursive character splitting (the default in LangChain's RecursiveCharacterTextSplitter) first tries to split on \n\n (paragraphs), then \n (lines), then . (sentences), then (words), only resorting to character-level splitting as a last resort. This preserves natural boundaries while respecting a maximum chunk size. For most use cases, this is the right approach.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens
chunk_overlap=51, # ~10% of chunk_size
length_function=len, # swap for token counter in production
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)
Semantic chunking groups sentences by embedding similarity rather than character count. Adjacent sentences with high cosine similarity stay in the same chunk; a semantic "break" creates a new chunk. This produces the most coherent chunks but is 5-10x slower to index and produces variable-length chunks that can be hard to reason about.
Use semantic chunking when: your documents have dense, multi-topic sections that fixed-size splitting consistently breaks poorly. Avoid it for: high-volume pipelines where indexing speed matters, or documents with already clear paragraph structure.
How to Diagnose Your Current Chunking
Run this evaluation on 50-100 representative queries from your production logs:
- For each query, retrieve the top 5 chunks
- Have a human (or GPT-4o) rate each retrieved chunk: 0 (irrelevant), 1 (partially relevant), 2 (directly answers the query)
- Calculate mean relevance score across all queries at your current chunk size
- Re-chunk the same document corpus at 256 and 512 tokens, repeat the evaluation
If your 1,024-token chunks score 1.1 mean relevance and 512-token chunks score 1.6, that's a 45% relative improvement in retrieval quality — achievable without changing your embedding model or adding a reranker.
Metadata Filtering Compounds Chunking Quality
Good chunk size alone won't save you if you're retrieving across unrelated documents. Add metadata to every chunk at index time: document ID, section title, content type, date. Filter at query time:
retriever = vectorstore.as_retriever(
search_kwargs={
"filter": {"content_type": "technical_docs"},
"k": 5
}
)
Metadata filtering narrows the candidate set before similarity search, effectively eliminating entire categories of false positives. Pair it with correctly-sized chunks and you can often hit 85%+ relevance@3 on focused corpora.
Practical Sizing by Document Type
- Support tickets and chat logs: 128-256 tokens, 0% overlap (each message is self-contained)
- API documentation: 256-512 tokens, 10% overlap
- Long-form articles and blog posts: 512 tokens, 15% overlap
- Legal documents and contracts: 512-1,024 tokens, 20% overlap (clause context is critical)
- Code files: chunk by function or class boundary, not token count — a 50-line function is one semantic unit
Use the RAG Chunk Size Calculator to get specific recommendations for your embedding model's context window and your target document type.
RAG Chunk Size Calculator
Calculate optimal chunk size and overlap for your embedding model and document types.