TF-IDF and BM25

The classical lexical retrieval scores that still anchor production search and RAG. Why term frequency needs damping, why inverse document frequency weights rare terms, and why BM25 — not raw TF-IDF — is the lexical baseline you must beat.

Reviewed May 31, 2026 · 4 min read

One-line definition

TF-IDF scores a term in a document as term frequency × inverse document frequency — frequent-in-this-doc but rare-across-the-corpus terms score highest. BM25 is the saturated, length-normalized refinement of the same idea and is the standard lexical retrieval baseline in search and hybrid RAG.

Why it matters

Despite dense embeddings, lexical retrieval has not gone away — hybrid retrieval (BM25 + dense) almost always beats either alone, especially for rare terms, exact identifiers, codes, and out-of-domain queries where embeddings generalize poorly. Any RAG or search interview expects you to know BM25 as the baseline, why it works, and where it fails. It’s also the cheapest big win: a tuned BM25 stage in front of a reranker is hard to beat per dollar.

Term frequency and its problem

Raw count: a term appearing 10× isn’t 10× as relevant as appearing once — relevance saturates. So TF is damped, e.g. $1 + lo g (count)$ , before being used. This saturation idea is exactly what BM25 formalizes.

Inverse document frequency

Rare terms are more discriminative. IDF down-weights common terms (“the”, “data”) and up-weights rare ones:

idf (t) = lo g \frac{N}{df ( t )},

where $N$ is the number of documents and $df (t)$ is how many contain term $t$ . TF-IDF is then $tf (t, d) \cdot idf (t)$ , summed over query terms.

BM25 — the production formula

BM25 (Best Match 25, from the Okapi system) adds two things TF-IDF lacks: bounded TF saturation and document-length normalization:

BM25 (q, d) = t \in q \sum idf (t) \cdot \frac{f ( t , d ) ( k _{1} + 1 )}{f ( t , d ) + k _{1} ( 1 - b + b \frac{∣ d ∣}{avgdl} )},

$f (t, d)$ : term frequency in the document.
$k_{1}$ (≈1.2–2.0): controls TF saturation — as $f \to \infty$ the term’s contribution asymptotes to $idf \cdot (k_{1} + 1)$ , so spamming a keyword has diminishing returns.
$b$ (≈0.75): controls length normalization via $∣ d ∣/ avgdl$ — long documents don’t win just by containing more words.
BM25 typically uses a smoothed IDF, $lo g \frac{N - df + 0.5}{df + 0.5}$ .

BM25 is the default scorer in Lucene / Elasticsearch / OpenSearch.

TF-IDF vs BM25

	TF saturation	Length normalization	Tunable	Status
TF-IDF	logarithmic, unbounded-ish	implicit (cosine on normalized vectors)	no	teaching baseline
BM25	explicit, bounded by $k_{1}$	explicit via $b$	$k_{1}, b$	production lexical baseline

When people say “lexical baseline,” they mean BM25, not raw TF-IDF.

Lexical vs dense (and why hybrid wins)

	Lexical (BM25)	Dense (embeddings)
Matches	exact tokens	semantic similarity
Rare terms / IDs / codes	strong	weak
Synonyms / paraphrase	weak	strong
Out-of-domain	robust	degrades
Index	inverted index, cheap	ANN over vectors

Hybrid retrieval fuses the two — commonly Reciprocal Rank Fusion (RRF) over the two ranked lists, or a weighted score combination — then a cross-encoder reranks the top candidates. This is the standard production retrieval stack.

What an interviewer expects you to say

Define TF-IDF = TF × IDF, and explain why TF is damped (relevance saturates) and why IDF weights rare terms (discriminativeness).
Explain that BM25 adds bounded TF saturation ( $k_{1}$ ) and length normalization ( $b$ ), and that it — not TF-IDF — is the real lexical baseline.
Compare lexical vs dense and argue for hybrid retrieval + reranking, calling out that lexical wins on rare terms, IDs, and out-of-domain queries.
Bonus: mention RRF for fusion and that BM25 is the Lucene/Elasticsearch default.

Common confusions

“Embeddings made BM25 obsolete.” No — BM25 still beats dense retrieval on exact-match-heavy queries and out-of-domain corpora, which is why hybrid is standard.
“TF-IDF and BM25 are the same.” BM25’s saturation and length normalization matter a lot in practice; raw TF-IDF over-rewards keyword stuffing and long documents.
“IDF measures importance.” It measures rarity / discriminativeness in the corpus, which correlates with but isn’t the same as semantic importance.
“Higher term frequency is always better.” BM25 explicitly caps the benefit; that’s the point of $k_{1}$ .
“BM25 understands meaning.” It’s purely lexical — no synonyms, no semantics. That gap is exactly what dense retrieval fills.