One-line definition
TF-IDF scores a term in a document as term frequency × inverse document frequency — frequent-in-this-doc but rare-across-the-corpus terms score highest. BM25 is the saturated, length-normalized refinement of the same idea and is the standard lexical retrieval baseline in search and hybrid RAG.
Why it matters
Despite dense embeddings, lexical retrieval has not gone away — hybrid retrieval (BM25 + dense) almost always beats either alone, especially for rare terms, exact identifiers, codes, and out-of-domain queries where embeddings generalize poorly. Any RAG or search interview expects you to know BM25 as the baseline, why it works, and where it fails. It’s also the cheapest big win: a tuned BM25 stage in front of a reranker is hard to beat per dollar.
Term frequency and its problem
Raw count: a term appearing 10× isn’t 10× as relevant as appearing once — relevance saturates. So TF is damped, e.g. , before being used. This saturation idea is exactly what BM25 formalizes.
Inverse document frequency
Rare terms are more discriminative. IDF down-weights common terms (“the”, “data”) and up-weights rare ones:
where is the number of documents and is how many contain term . TF-IDF is then , summed over query terms.
BM25 — the production formula
BM25 (Best Match 25, from the Okapi system) adds two things TF-IDF lacks: bounded TF saturation and document-length normalization:
- : term frequency in the document.
- (≈1.2–2.0): controls TF saturation — as the term’s contribution asymptotes to , so spamming a keyword has diminishing returns.
- (≈0.75): controls length normalization via — long documents don’t win just by containing more words.
- BM25 typically uses a smoothed IDF, .
BM25 is the default scorer in Lucene / Elasticsearch / OpenSearch.
TF-IDF vs BM25
| TF saturation | Length normalization | Tunable | Status | |
|---|---|---|---|---|
| TF-IDF | logarithmic, unbounded-ish | implicit (cosine on normalized vectors) | no | teaching baseline |
| BM25 | explicit, bounded by | explicit via | production lexical baseline |
When people say “lexical baseline,” they mean BM25, not raw TF-IDF.
Lexical vs dense (and why hybrid wins)
| Lexical (BM25) | Dense (embeddings) | |
|---|---|---|
| Matches | exact tokens | semantic similarity |
| Rare terms / IDs / codes | strong | weak |
| Synonyms / paraphrase | weak | strong |
| Out-of-domain | robust | degrades |
| Index | inverted index, cheap | ANN over vectors |
Hybrid retrieval fuses the two — commonly Reciprocal Rank Fusion (RRF) over the two ranked lists, or a weighted score combination — then a cross-encoder reranks the top candidates. This is the standard production retrieval stack.
What an interviewer expects you to say
- Define TF-IDF = TF × IDF, and explain why TF is damped (relevance saturates) and why IDF weights rare terms (discriminativeness).
- Explain that BM25 adds bounded TF saturation () and length normalization (), and that it — not TF-IDF — is the real lexical baseline.
- Compare lexical vs dense and argue for hybrid retrieval + reranking, calling out that lexical wins on rare terms, IDs, and out-of-domain queries.
- Bonus: mention RRF for fusion and that BM25 is the Lucene/Elasticsearch default.
Common confusions
- “Embeddings made BM25 obsolete.” No — BM25 still beats dense retrieval on exact-match-heavy queries and out-of-domain corpora, which is why hybrid is standard.
- “TF-IDF and BM25 are the same.” BM25’s saturation and length normalization matter a lot in practice; raw TF-IDF over-rewards keyword stuffing and long documents.
- “IDF measures importance.” It measures rarity / discriminativeness in the corpus, which correlates with but isn’t the same as semantic importance.
- “Higher term frequency is always better.” BM25 explicitly caps the benefit; that’s the point of .
- “BM25 understands meaning.” It’s purely lexical — no synonyms, no semantics. That gap is exactly what dense retrieval fills.
Related: RAG overview, Designing a RAG system that actually works, Approximate nearest neighbors, Word embeddings, Two-tower retrieval.