One-line definition
Naive Bayes models , assuming features are conditionally independent given the class . Trained by counting (closed-form MLE for each conditional).
Why it matters
Naive Bayes is the cheapest possible probabilistic classifier. Closed-form MLE, training, prediction, no hyperparameter tuning. Despite the obviously wrong independence assumption, it works remarkably well as a baseline on:
- Text classification (spam filtering, topic categorization, sentiment): bag-of-words features.
- Tiny-data classification where logistic regression overfits.
- Initial baselines that should be beaten before claiming victory with a fancier model.
It is also conceptually important. The canonical example of a generative classifier (model ) versus the discriminative logistic regression (model directly).
The model
By Bayes’ rule:
The “naive” assumption: . Then for prediction:
Sum logs to avoid underflow.
Variants
| Variant | Use case | |
|---|---|---|
| Multinomial | Multinomial over token counts | Text (bag-of-words) |
| Bernoulli | Bernoulli (binary) per feature | Text (presence/absence) |
| Gaussian | Continuous features | |
| Categorical | Categorical (multinoulli) | Discrete features |
Training
Just count.
For multinomial NB on text:
- .
- (with Laplace / additive smoothing , typically 1.0; = vocabulary size).
Without smoothing, any unseen word in a class gives and , breaking inference.
Why the independence assumption isn’t fatal
Even though words are clearly correlated, naive Bayes can still rank classes correctly. The independence assumption gives biased probability estimates (overconfident. Predicted probabilities tend to be near 0 or 1) but the argmax is often right.
For pure classification accuracy, NB is competitive. For calibrated probabilities, prefer logistic regression with proper regularization.
Generative vs. discriminative
Naive Bayes models the joint . Logistic regression models directly. Asymptotic results (Ng & Jordan, 2002):
- For small , NB usually wins (less variance from the strong assumption).
- For large , logistic catches up and surpasses NB (the assumption hurts at large scale).
Where it shows up in 2026
- Spam filters in low-resource embedded systems.
- Quick text baselines before training a transformer.
- Document filtering in retrieval pipelines (cheap pre-filter).
For most modern NLP, neural classifiers dominate. NB persists in resource-constrained settings and as a reliable benchmark.
Common pitfalls
- Forgetting to smooth. Without Laplace smoothing, any test document with a vocabulary token never seen in a class gets that class’s posterior set to 0.
- Using NB on highly correlated features. Probabilities become very poorly calibrated; predicted class can still be okay but never trust the probability.
- Mixing variants. Multinomial NB on continuous features is wrong; use Gaussian NB (or discretize first).
- Comparing NB on raw counts vs. tfidf vs. binarized. Different preprocessing changes the model class; compare apples to apples.