RAG Chunking Strategy: Fix 40% Retrieval Failures

Your RAG pipeline retrieves irrelevant chunks 40% of the time. You've tried better embedding models, rerankers, and metadata filtering — but the root problem is that your chunks are the wrong size. Too small and each chunk lacks enough context to be meaningful. Too large and one relevant sentence gets buried in 900 tokens of noise.

Here's how to diagnose and fix your chunking strategy systematically.

Why Chunk Size Matters More Than Your Embedding Model

Embeddings capture semantic meaning — but only within the text they encode. A 512-token chunk about database indexing will embed differently from a 128-token excerpt of the same paragraph. The embedding isn't wrong; the input unit is.

The retrieval accuracy problem compounds at query time: if your query is 12 tokens and your chunks are 1,024 tokens, the similarity score gets diluted by the 1,012 tokens that aren't relevant to the query. Smaller chunks produce sharper embeddings for specific queries; larger chunks capture broader conceptual themes.

Benchmarks from LlamaIndex and LangChain's 2025 evals consistently show that 256-512 token chunks outperform 1,024-token chunks by 15-25% on precision@3 for question-answering tasks over technical documentation.

The Tradeoff Table

Chunk Size	Strengths	Weaknesses	Best For
128 tokens	Sharp embeddings, precise retrieval	Loses multi-sentence context	FAQ lookups, single-fact retrieval
256 tokens	Good balance, fast indexing	May split mid-explanation	API docs, code comments
512 tokens	Captures paragraph-level context	Slightly diluted embeddings	Technical articles, blog posts
1,024 tokens	Preserves section-level meaning	Slow indexing, noisy embeddings	Legal contracts, long-form reports
2,048 tokens	Full sections intact	Rarely retrieves relevantly	Should almost never be used

For most RAG applications over technical documentation, 512 tokens is the empirically best default. Start there, measure, then adjust.

Overlap: The Number You're Probably Getting Wrong

Chunk overlap ensures that information near chunk boundaries doesn't disappear. If a key sentence falls between chunk N and chunk N+1, zero overlap means neither chunk captures it fully.

Standard recommendation: 10-20% overlap.

For 512-token chunks, that means 51-102 tokens of overlap. Concretely:

0% overlap: fast to index, high risk of missing boundary content
10% overlap (51 tokens): catches most boundary cases, minimal index size increase
20% overlap (102 tokens): safer for dense technical content, ~20% more vectors to store

Beyond 20% overlap, you're paying storage costs for diminishing retrieval gains. At 50% overlap, you're essentially chunking twice and getting no benefit.

Fixed-Size vs. Recursive vs. Semantic Chunking

Fixed-size chunking splits on character or token count regardless of sentence or paragraph boundaries. It's fast and predictable, but produces chunks that start or end mid-sentence. Most production systems shouldn't use pure fixed-size chunking.

Recursive character splitting (the default in LangChain's RecursiveCharacterTextSplitter) first tries to split on \n\n (paragraphs), then \n (lines), then . (sentences), then (words), only resorting to character-level splitting as a last resort. This preserves natural boundaries while respecting a maximum chunk size. For most use cases, this is the right approach.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # tokens
    chunk_overlap=51,     # ~10% of chunk_size
    length_function=len,  # swap for token counter in production
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)

Semantic chunking groups sentences by embedding similarity rather than character count. Adjacent sentences with high cosine similarity stay in the same chunk; a semantic "break" creates a new chunk. This produces the most coherent chunks but is 5-10x slower to index and produces variable-length chunks that can be hard to reason about.

Use semantic chunking when: your documents have dense, multi-topic sections that fixed-size splitting consistently breaks poorly. Avoid it for: high-volume pipelines where indexing speed matters, or documents with already clear paragraph structure.

How to Diagnose Your Current Chunking

Run this evaluation on 50-100 representative queries from your production logs:

For each query, retrieve the top 5 chunks
Have a human (or GPT-4o) rate each retrieved chunk: 0 (irrelevant), 1 (partially relevant), 2 (directly answers the query)
Calculate mean relevance score across all queries at your current chunk size
Re-chunk the same document corpus at 256 and 512 tokens, repeat the evaluation

If your 1,024-token chunks score 1.1 mean relevance and 512-token chunks score 1.6, that's a 45% relative improvement in retrieval quality — achievable without changing your embedding model or adding a reranker.

Metadata Filtering Compounds Chunking Quality

Good chunk size alone won't save you if you're retrieving across unrelated documents. Add metadata to every chunk at index time: document ID, section title, content type, date. Filter at query time:

retriever = vectorstore.as_retriever(
    search_kwargs={
        "filter": {"content_type": "technical_docs"},
        "k": 5
    }
)

Metadata filtering narrows the candidate set before similarity search, effectively eliminating entire categories of false positives. Pair it with correctly-sized chunks and you can often hit 85%+ relevance@3 on focused corpora.

Practical Sizing by Document Type

Support tickets and chat logs: 128-256 tokens, 0% overlap (each message is self-contained)
API documentation: 256-512 tokens, 10% overlap
Long-form articles and blog posts: 512 tokens, 15% overlap
Legal documents and contracts: 512-1,024 tokens, 20% overlap (clause context is critical)
Code files: chunk by function or class boundary, not token count — a 50-line function is one semantic unit

Use the RAG Chunk Size Calculator to get specific recommendations for your embedding model's context window and your target document type.

RAG Chunk Size Calculator

Calculate optimal chunk size and overlap for your embedding model and document types.

Try this tool →

Why Chunk Size Matters More Than Your Embedding Model

The Tradeoff Table

Overlap: The Number You're Probably Getting Wrong

Fixed-Size vs. Recursive vs. Semantic Chunking

How to Diagnose Your Current Chunking

Metadata Filtering Compounds Chunking Quality

Practical Sizing by Document Type

RAG Chunk Size Calculator

More Free Tools

API Cost Calculator

Context Window Calculator

Vector Database Sizing Calculator

AI Agent Cost Estimator

System Prompt Tokenizer