Track
Chunk · Embed · Rerank · Ground
An LLM knows what it was trained on — and nothing else. Ask it about recent events, proprietary documents, or your own codebase, and it will either guess wrong or say it doesn't know. Retrieval-Augmented Generation (RAG) solves this by giving the LLM access to external data at query time.
The idea is simple: when a question comes in, search a knowledge base for relevant documents, stuff them into the LLM's context window, and let the LLM answer from that material. The LLM doesn't need to know the answer — it just needs to read.
RAG follows a series of stages: Query → Retrieve → Rerank → Ground → Answer.
The user's question is first optionally rewritten for better search recall. Then we search a document store for candidate chunks using keyword or vector search. A second-stage relevance model reranks the candidates, putting the most relevant ones first. The top chunks are formatted with citations so the LLM can reference them. Finally, the LLM reads the grounded context and produces an answer with source attribution.
Initial retrieval (especially from vector search) returns candidates ranked by similarity, but the top results aren't always the most relevant. A reranker applies a more expensive, more accurate model to re-score the top-K candidates.
Score every candidate, sort by score descending, take the top N. The reranker looks at the full query-chunk pair together, giving more accurate relevance judgments than embedding similarity alone.
Concrete Example
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = words[start:end]
chunks.append(" ".join(chunk))
start = end - overlap
return chunks
def rerank(query, candidates, reranker, top_n):
scored = [(reranker(query, c["text"]), c)
for c in candidates]
scored.sort(key=lambda x: x[0], reverse=True)
return [c for _, c in scored[:top_n]]Chunking splits documents into overlapping pieces so no information falls at boundaries. Reranking scores each candidate with a cross-encoder, sorts by relevance, and returns the top N. Together they ensure the LLM gets the most relevant content within its window.
Split documents into overlapping pieces that fit the LLM's context window.
Convert chunks and queries to vectors, find nearest neighbors by similarity.
Apply a second-stage relevance model to sharpen precision at the top of results.
Reformulate vague queries for better retrieval recall.
Include retrieved chunks with source citations for attributable answers.
6 problems. Sign in to start solving.
Sign in to open a workspace and solve these problems.