RAG Architectures: What Nobody Tells You About Chunking
The vector search is the easy part. Getting chunking right is where most RAG implementations quietly fail.
Everyone building RAG systems focuses on the retrieval model — the embedding quality, the vector store, the similarity metric. These matter. But in my experience, chunking strategy has more impact on output quality than any retrieval parameter, and it gets a fraction of the attention.
Fixed-size chunking (512 tokens, 20% overlap) is a reasonable baseline and works fine for dense technical documentation. It breaks down on narrative content — legal contracts, meeting transcripts — where meaning spans paragraphs. For those, I use semantic chunking: split on natural boundaries, then merge small adjacent chunks until you hit a token budget.
The other under-discussed lever is query rewriting. A user query like 'what does the contract say about termination' is semantically distant from the clause it's asking about. A simple LLM call to expand the query into 3–5 hypothetical answer fragments before retrieval (HyDE) dramatically improves recall without adding latency you'd notice.
Metadata filtering deserves its own mention. In a multi-tenant knowledge base, namespace-level filters cut retrieval time in half and eliminate cross-tenant bleed. Always store doc_id, section, date, and source as filterable metadata fields — you will need them.