RAG & Retrieval
Augment LLMs with external knowledge using Retrieval-Augmented Generation — vector databases, chunking strategies, hybrid search, and reranking.
Overview
Large language models are frozen snapshots. Once training ends, their knowledge is fixed at a cutoff date — they cannot access your company's internal documentation, yesterday's news, or any data that was private during training. Asking a model about these topics without additional context produces confident-sounding but incorrect answers: hallucinations.
Retrieval-Augmented Generation (RAG) solves this by turning the question "what do you know?" into "let me look that up." Before generating a response, the system retrieves relevant documents from an external knowledge base and injects them into the prompt as context. The model then answers using the retrieved evidence, not just its parametric memory.
This approach has three practical advantages: knowledge can be updated without retraining, the model can cite sources, and factual accuracy on domain-specific questions improves substantially. RAG has become the default architecture for production LLM systems that need to work with real, current, or private data.
The RAG Pipeline
RAG has two distinct phases: indexing (run once, or on document updates) and querying (run on every user request).
Loading diagram...
During indexing, each document is split into smaller chunks, each chunk is converted to an embedding vector using an embedding model, and the vectors are stored alongside the original text in a vector database.
During querying, the user's question is embedded with the same model, the vector store finds the chunks whose embeddings are closest to the query vector (approximate nearest neighbor search), and the top-K retrieved chunks are prepended to the LLM prompt as context.
Document Chunking Strategies
How you split documents has a larger impact on retrieval quality than most teams expect. A chunk that cuts a sentence in half, or that mixes two unrelated topics, produces an embedding that does not cleanly represent either topic — and the vector search will retrieve it unreliably.
Fixed-Size Chunking
Split every N characters (or tokens), with an optional overlap between adjacent chunks. Simple to implement and fast. It completely ignores document structure, so a chunk may start mid-sentence and end mid-paragraph.
Recursive Character Splitting
Split on paragraph boundaries first, then sentence boundaries if the paragraph is still too long, then word boundaries if needed. This is the default strategy in LangChain's RecursiveCharacterTextSplitter. It produces more semantically coherent chunks than fixed-size splitting at the cost of variable chunk sizes.
Semantic Chunking
Embed consecutive sentences and split wherever the cosine similarity between adjacent sentence embeddings drops sharply — indicating a topic change. Produces the most semantically cohesive chunks but requires a full pass of embeddings before chunking, making it slower and more expensive.
| Strategy | Pros | Cons | Best for |
|---|---|---|---|
| Fixed-size | Simple, fast, predictable | Ignores semantics, cuts sentences | Quick prototypes, structured data |
| Recursive character | Respects natural boundaries, good default | Variable sizes can complicate batching | General prose, documentation |
| Semantic | Most coherent chunks, topic-aligned | Slow and expensive to compute | Long-form content, research papers |
Chunk size is a hyperparameter
There is no universally correct chunk size. Smaller chunks (128–256 tokens) give precise retrieval but lose surrounding context. Larger chunks (512–1024 tokens) preserve context but dilute the embedding signal with off-topic content. A practical starting point is 512 tokens with a 50-token overlap, then tune based on retrieval evaluation metrics like recall@K.
Vector Databases
A vector database stores embedding vectors alongside their source documents and supports approximate nearest neighbor (ANN) search — finding the K vectors most similar to a query vector in milliseconds, even across millions of stored embeddings.
The main options differ in deployment model and feature set:
- Pinecone — fully managed cloud service; no infrastructure to operate; strong SDK support; not open-source
- Qdrant — open-source, self-hosted or managed cloud; built in Rust; supports payload filtering alongside vector search
- pgvector — a Postgres extension; adds a
vectorcolumn type and ANN index to any existing Postgres database; ideal if you already run Postgres and want to avoid a separate service - Chroma — open-source, embeds in-process (no server required); the easiest option for local development and testing
All vector databases expose the same two core operations: upsert (store a vector with an ID and optional metadata) and query (return the K nearest vectors to a given query vector, optionally filtered by metadata).
Similarity Search
The most common similarity measure for embeddings is cosine similarity:
similarity(a, b) = (a · b) / (||a|| × ||b||)Cosine similarity measures the angle between two vectors, ignoring their magnitude. This matters because embedding models often produce vectors of varying lengths — a longer vector does not mean a "stronger" concept, it just reflects numeric scaling. Cosine similarity normalizes this out and returns a score between -1 and 1, where 1 means identical direction.
Dot product (unnormalized) is sometimes used instead, particularly when embeddings are trained with dot product as the similarity objective. If you normalize embeddings to unit length before storing them — which most embedding APIs do by default — dot product and cosine similarity are equivalent.
Hybrid Search
Pure vector search has a well-known failure mode: it excels at semantic similarity but can miss exact keyword matches. If a user asks about "PCI-DSS compliance" or a specific product SKU like "XR-7700B", vector search may rank semantically related documents above the document that contains the exact term.
Hybrid search combines vector search with BM25 keyword search (a probabilistic relevance model derived from TF-IDF) and merges the two ranked lists using Reciprocal Rank Fusion (RRF).
Loading diagram...
RRF assigns each document a score of 1 / (k + rank) from each retriever (where k is a smoothing constant, typically 60), then sums the scores across retrievers. Documents that rank well in both retrievers score highest. The formula is robust to different score scales — you never need to normalize BM25 scores against cosine similarity scores.
Reranking
ANN search is fast because it uses approximate algorithms and compressed indexes. This speed comes at a cost: the top-50 results from vector search are good candidates, but not necessarily in the best possible order. A cross-encoder reranker can correct this.
The two-stage approach works as follows:
- Stage 1 (retrieve) — fast ANN search returns top-50 candidate documents from the vector store
- Stage 2 (rerank) — a cross-encoder model scores each (query, document) pair jointly, attending to both texts simultaneously, and re-orders the 50 candidates; you keep only the top-5
Cross-encoders are significantly more accurate than bi-encoders (the embedding models used in stage 1) because they see both texts at once rather than comparing pre-computed vectors. They are also significantly slower — which is why you only apply them to the small candidate set, not the full index.
Popular reranking models include cross-encoder/ms-marco-MiniLM-L-6-v2 (open-source, fast) and Cohere Rerank (managed API).
Code Examples
Both examples build a minimal end-to-end RAG pipeline: index a set of documents, then answer a question using retrieved context.
# pip install openai chromadb
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
def index_documents(docs: list[str]) -> None:
"""Embed and store a list of document strings."""
response = client.embeddings.create(
input=docs,
model="text-embedding-3-small",
)
embeddings = [item.embedding for item in response.data]
collection.add(
documents=docs,
embeddings=embeddings,
ids=[f"doc_{i}" for i in range(len(docs))],
)
def rag_query(question: str, top_k: int = 3) -> str:
"""Retrieve relevant docs and generate an answer."""
# Embed the question with the same model used for indexing
query_response = client.embeddings.create(
input=[question],
model="text-embedding-3-small",
)
query_embedding = query_response.data[0].embedding
# Find the top-K most similar document chunks
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
)
context = "\n\n".join(results["documents"][0])
# Generate an answer grounded in the retrieved context
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Answer the question using only the context below.\n\n{context}",
},
{"role": "user", "content": question},
],
)
return completion.choices[0].message.content
# --- Demo ---
sample_docs = [
"RAG stands for Retrieval-Augmented Generation. It combines a retrieval system with a language model.",
"Vector databases store embeddings and support approximate nearest neighbor search.",
"BM25 is a keyword-based ranking algorithm derived from TF-IDF used in hybrid search.",
"Chunking splits documents into smaller pieces before embedding to improve retrieval precision.",
"Reranking is a two-stage approach: fast ANN retrieval followed by a cross-encoder reranker.",
]
index_documents(sample_docs)
answer = rag_query("What is the purpose of reranking in RAG?")
print(answer)RAG vs Fine-tuning
RAG and fine-tuning are often presented as competing approaches, but they solve different problems. Choosing the wrong one wastes significant time and compute.
Decision guide: RAG or fine-tuning?
Use RAG when:
- The knowledge changes frequently (news, documentation, internal data)
- The data is private and cannot be included in training
- You need the model to cite sources or attribute claims
- Factual recall accuracy is the primary concern
Use fine-tuning when:
- You need the model to adopt a specific style, tone, or persona
- The task involves domain-specific vocabulary the base model tokenizes poorly
- You need a specific output format the model repeatedly fails to follow
- Inference latency matters and you want a smaller specialized model
The two approaches are also composable: a fine-tuned model that has learned a domain's vocabulary and output format can be combined with RAG for factual grounding. Many production systems use both.
Advanced Patterns
As RAG systems mature beyond the basic pipeline, several advanced techniques address specific failure modes.
HyDE (Hypothetical Document Embeddings) — instead of embedding the raw query, prompt the LLM to generate a hypothetical answer document first, then embed that document for retrieval. Hypothetical documents are often closer in embedding space to real relevant documents than short, keyword-style queries are.
Self-RAG — a model trained with a special workflow where it generates retrieval tokens mid-generation, decides when to retrieve, evaluates the relevance and accuracy of retrieved passages, and can critique its own output. Introduced by Asai et al. (2023); useful when the decision of whether to retrieve at all is non-trivial.
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) — clusters document chunks, generates abstractive summaries of each cluster, then clusters and summarizes again recursively. The result is a tree of summaries at multiple levels of granularity, enabling retrieval at both fine and coarse levels depending on the question type.
Key Papers & Resources
"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — Lewis et al., 2020 (arXiv:2005.11401). The paper that named and formalized RAG. Introduced the RAG-Token and RAG-Sequence models using DPR for retrieval and BART for generation. Essential reading for understanding the theoretical foundation.
"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" — Asai et al., 2023 (arXiv:2310.11511). Introduces adaptive retrieval — the model learns when to retrieve and how to evaluate retrieved passages.
"RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval" — Sarthi et al., 2024 (arXiv:2401.18059). Hierarchical summarization for long-context retrieval.
LangChain RAG documentation — python.langchain.com/docs/use_cases/question_answering. Practical implementation guide covering splitters, vector stores, and retrieval chains.
LlamaIndex — docs.llamaindex.ai. A framework focused specifically on RAG and data indexing for LLMs; includes built-in support for many chunking strategies, vector stores, and rerankers.
Connections
Tokenization & Embeddings — embeddings are the core data structure in every RAG system. The quality of the embedding model sets a hard ceiling on retrieval recall. Understanding how sentence embeddings are produced clarifies why embedding model choice matters as much as the vector database choice.
Agents & Tool Use — in agentic systems, RAG is often implemented as a retrieval tool that the agent calls when it needs external information. The agent decides when to retrieve rather than retrieving on every turn, which avoids unnecessary latency and context window consumption.
Prompt Engineering — retrieved context is injected into the system prompt or user message. How that context is formatted — whether documents are labeled, truncated, or reordered — affects generation quality. Prompt engineering skills apply directly to the context assembly step of RAG.
Fine-tuning & RLHF
How LLMs are adapted from base models to helpful assistants — supervised fine-tuning, RLHF, DPO, and parameter-efficient methods like LoRA.
Prompt Engineering
Techniques for effectively communicating with LLMs — few-shot prompting, chain-of-thought reasoning, system prompts, and structured output.