Naive Bayes

A trivially simple generative classifier that assumes features are conditionally independent given the class. Fast, parameter-light, surprisingly hard to beat on text.

Reviewed April 4, 2026 · 3 min read

One-line definition

Naive Bayes models $p (y ∣ x) \propto p (y) \prod_{j} p (x_{j} ∣ y)$ , assuming features $x_{j}$ are conditionally independent given the class $y$ . Trained by counting (closed-form MLE for each conditional).

Why it matters

Naive Bayes is the cheapest possible probabilistic classifier. Closed-form MLE, $O (n d)$ training, $O (d)$ prediction, no hyperparameter tuning. Despite the obviously wrong independence assumption, it works remarkably well as a baseline on:

Text classification (spam filtering, topic categorization, sentiment): bag-of-words features.
Tiny-data classification where logistic regression overfits.
Initial baselines that should be beaten before claiming victory with a fancier model.

It is also conceptually important. The canonical example of a generative classifier (model $p (x, y)$ ) versus the discriminative logistic regression (model $p (y ∣ x)$ directly).

The model

By Bayes’ rule:

p (y ∣ x) = \frac{p ( y ) p ( x ∣ y )}{p ( x )} .

The “naive” assumption: $p (x ∣ y) = \prod_{j} p (x_{j} ∣ y)$ . Then for prediction:

\overset{y}{^} = ar g y max p (y) j \prod p (x_{j} ∣ y) = ar g y max lo g p (y) + j \sum lo g p (x_{j} ∣ y) .

Sum logs to avoid underflow.

Variants

Variant	$p (x_{j} ∣ y)$	Use case
Multinomial	Multinomial over token counts	Text (bag-of-words)
Bernoulli	Bernoulli (binary) per feature	Text (presence/absence)
Gaussian	$N (μ_{j, y}, σ_{j, y}^{2})$	Continuous features
Categorical	Categorical (multinoulli)	Discrete features

Training

Just count.

For multinomial NB on text:

$p (y = c) = count (c) / N$ .
$p (word w ∣ y = c) = \frac{count ( w in class c ) + α}{total tokens in class c + α V}$ (with Laplace / additive smoothing $α$ , typically 1.0; $V$ = vocabulary size).

Without smoothing, any unseen word in a class gives $p (x ∣ y) = 0$ and $p (y ∣ x) = 0$ , breaking inference.

Why the independence assumption isn’t fatal

Even though words are clearly correlated, naive Bayes can still rank classes correctly. The independence assumption gives biased probability estimates (overconfident. Predicted probabilities tend to be near 0 or 1) but the argmax is often right.

For pure classification accuracy, NB is competitive. For calibrated probabilities, prefer logistic regression with proper regularization.

Generative vs. discriminative

Naive Bayes models the joint $p (x, y)$ . Logistic regression models $p (y ∣ x)$ directly. Asymptotic results (Ng & Jordan, 2002):

For small $n$ , NB usually wins (less variance from the strong assumption).
For large $n$ , logistic catches up and surpasses NB (the assumption hurts at large scale).

Where it shows up in 2026

Spam filters in low-resource embedded systems.
Quick text baselines before training a transformer.
Document filtering in retrieval pipelines (cheap pre-filter).

For most modern NLP, neural classifiers dominate. NB persists in resource-constrained settings and as a reliable benchmark.

Common pitfalls

Forgetting to smooth. Without Laplace smoothing, any test document with a vocabulary token never seen in a class gets that class’s posterior set to 0.
Using NB on highly correlated features. Probabilities become very poorly calibrated; predicted class can still be okay but never trust the probability.
Mixing variants. Multinomial NB on continuous features is wrong; use Gaussian NB (or discretize first).
Comparing NB on raw counts vs. tfidf vs. binarized. Different preprocessing changes the model class; compare apples to apples.