One-line definition
A random forest (Breiman, 2001) trains an ensemble of decision trees on bootstrap samples of the data (“bagging”) with each split restricted to a random subset of features, then averages predictions (regression) or takes a majority vote (classification).
Why it matters
Single decision trees are high-variance: small data shifts cause large changes in splits. Random forests average over many trees, dropping variance roughly as . They are:
- A reliable out-of-the-box baseline for tabular data. Minimal tuning, strong performance.
- Hard to overfit with enough trees and reasonable depth limits.
- Parallel-friendly: trees train independently.
GBDT (xgboost, lightgbm, catboost) usually beats RF in modern tabular benchmarks, but RF remains useful as a fast baseline and for problems where you want low variance with minimal tuning.
The two randomizations
- Bagging (Bootstrap Aggregating): each tree trains on a bootstrap sample (sample with replacement, same size as original). Each tree sees ~63% of the unique training rows.
- Random feature subsets per split: at each split, only randomly chosen features are considered as candidates. Standard for classification, for regression.
Both reduce correlation between trees. Bagging alone gives “tree bagging”; the second randomization is what makes it a random forest.
Out-of-bag (OOB) estimation
Each tree’s bootstrap sample omits ~37% of the data. Predict each row using only trees that didn’t see it (its out-of-bag trees) and average. This gives a free held-out estimate of generalization error. No separate validation set needed.
OOB estimates are typically very close to k-fold CV estimates. Useful for hyperparameter search without a separate split.
Variance reduction analysis
For an ensemble of trees with pairwise correlation and per-tree variance :
As , variance approaches . The two randomizations work by reducing ; without them the trees are too similar and the second term dominates.
Bias-variance picture
- Each tree is grown deep (low bias, high variance).
- Averaging cuts variance.
- Bias of the ensemble = bias of a single tree (averaging unbiased-ish quantities).
- Bigger forests almost never overfit; you might get diminishing returns past 200–500 trees but rarely get worse.
Hyperparameters that matter
| Parameter | Default | Notes |
|---|---|---|
| n_estimators | 100–500 | More is usually better; saturates around 200–500. |
| max_features | (clf), (reg) | Smaller = more decorrelation, more bias. |
| max_depth / min_samples_leaf | None / 1 | Often left unbounded; constrain on huge data. |
| bootstrap | True | False = random subspace method, lighter on memory. |
When to use vs. alternatives
- Tabular baseline: RF is the fastest “good enough” model.
- GBDT (xgboost, lightgbm): usually 1–3% better but needs more tuning.
- Logistic regression: better when the truth is approximately linear or interpretability matters.
- Neural nets: rarely beat tree-based models on small/medium tabular data.
Common pitfalls
- Comparing RF against a single tree. Trivial; that’s not the comparison.
- Treating feature importances as causal. They are correlational; correlated features split the importance among themselves.
- Tuning on training accuracy. Use OOB or CV; trees can fit training perfectly.
- Skipping permutation importance. Built-in importance favors high-cardinality features; permutation importance is more honest.
Related
- Decision trees. The base learner.
- Gradient boosting. Sequential boosting alternative.