Skip to content
mentorship

concepts

Random forests

Bag deep decision trees plus random feature subsets per split. Variance averaging beats any single tree; the dominant out-of-the-box ensemble before GBDT.

Reviewed · 3 min read

One-line definition

A random forest (Breiman, 2001) trains an ensemble of decision trees on bootstrap samples of the data (“bagging”) with each split restricted to a random subset of features, then averages predictions (regression) or takes a majority vote (classification).

Why it matters

Single decision trees are high-variance: small data shifts cause large changes in splits. Random forests average over many trees, dropping variance roughly as . They are:

  • A reliable out-of-the-box baseline for tabular data. Minimal tuning, strong performance.
  • Hard to overfit with enough trees and reasonable depth limits.
  • Parallel-friendly: trees train independently.

GBDT (xgboost, lightgbm, catboost) usually beats RF in modern tabular benchmarks, but RF remains useful as a fast baseline and for problems where you want low variance with minimal tuning.

The two randomizations

  1. Bagging (Bootstrap Aggregating): each tree trains on a bootstrap sample (sample with replacement, same size as original). Each tree sees ~63% of the unique training rows.
  2. Random feature subsets per split: at each split, only randomly chosen features are considered as candidates. Standard for classification, for regression.

Both reduce correlation between trees. Bagging alone gives “tree bagging”; the second randomization is what makes it a random forest.

Out-of-bag (OOB) estimation

Each tree’s bootstrap sample omits ~37% of the data. Predict each row using only trees that didn’t see it (its out-of-bag trees) and average. This gives a free held-out estimate of generalization error. No separate validation set needed.

OOB estimates are typically very close to k-fold CV estimates. Useful for hyperparameter search without a separate split.

Variance reduction analysis

For an ensemble of trees with pairwise correlation and per-tree variance :

As , variance approaches . The two randomizations work by reducing ; without them the trees are too similar and the second term dominates.

Bias-variance picture

  • Each tree is grown deep (low bias, high variance).
  • Averaging cuts variance.
  • Bias of the ensemble = bias of a single tree (averaging unbiased-ish quantities).
  • Bigger forests almost never overfit; you might get diminishing returns past 200–500 trees but rarely get worse.

Hyperparameters that matter

ParameterDefaultNotes
n_estimators100–500More is usually better; saturates around 200–500.
max_features (clf), (reg)Smaller = more decorrelation, more bias.
max_depth / min_samples_leafNone / 1Often left unbounded; constrain on huge data.
bootstrapTrueFalse = random subspace method, lighter on memory.

When to use vs. alternatives

  • Tabular baseline: RF is the fastest “good enough” model.
  • GBDT (xgboost, lightgbm): usually 1–3% better but needs more tuning.
  • Logistic regression: better when the truth is approximately linear or interpretability matters.
  • Neural nets: rarely beat tree-based models on small/medium tabular data.

Common pitfalls

  • Comparing RF against a single tree. Trivial; that’s not the comparison.
  • Treating feature importances as causal. They are correlational; correlated features split the importance among themselves.
  • Tuning on training accuracy. Use OOB or CV; trees can fit training perfectly.
  • Skipping permutation importance. Built-in importance favors high-cardinality features; permutation importance is more honest.