One-line definition
A decision tree partitions the feature space by a sequence of axis-aligned threshold tests on individual features, assigning a constant prediction (class probability for classification, mean target for regression) to each leaf. Trained greedily by choosing splits that maximize information gain or equivalently minimize a purity / impurity criterion.
Why it matters
Single trees are rarely the production model. Variance is too high. But they are the building block of the dominant tabular learners: gradient boosting (xgboost, lightgbm, catboost) and random forests. Understanding tree training is the prerequisite for using or tuning either.
Trees naturally handle missing values, mixed-type features, non-linear interactions, and require essentially no feature scaling. This is why they remain the strongest baseline on heterogeneous tabular data.
The algorithm (CART, Breiman et al., 1984)
For each node:
- For each feature and each candidate threshold :
- Split data into (left child) and (right child).
- Compute the impurity of each child (Gini, entropy, or MSE).
- Compute the weighted impurity reduction.
- Pick the with maximum reduction.
- Recurse on each child until a stopping criterion (max depth, min samples per leaf, no further reduction).
The greedy choice is locally optimal but not globally; finding the globally optimal tree is NP-hard.
Split criteria
For classification at a node with class proportions :
- Gini impurity: . Equals expected error of random labeling proportional to .
- Entropy: . Information gain = parent entropy minus weighted child entropies.
Gini and entropy give nearly identical splits in practice; Gini is slightly cheaper.
For regression: variance of the target within the node, equivalent to MSE under constant prediction.
Stopping and pruning
Stop splitting when:
- Max depth reached.
- Number of samples at the node falls below a threshold.
- Best impurity reduction is below a threshold.
- All samples have the same target.
Pruning: train deep, then collapse branches that don’t reduce held-out error. Cost-complexity pruning (CCP) trades depth against penalty .
In practice with ensembles (boosting, forests), individual trees are kept shallow (depth 6–8 for boosting, depth limited or full for forests with bagging variance averaging).
Strengths and weaknesses
Strengths:
- Handles non-linear interactions for free.
- No feature scaling needed.
- Handles missing values (with surrogate splits or built-in handling in xgboost/lightgbm).
- Handles mixed numeric / categorical (with appropriate encoding).
- Interpretable: every prediction is a path of explicit rules.
Weaknesses:
- High variance: small data perturbations change splits dramatically.
- Greedy: locally optimal, not globally.
- Axis-aligned splits make it hard to capture rotated decision boundaries (need oblique trees or feature engineering).
- Single trees are weak. Usually combined into ensembles.
Categorical features
Two main approaches:
- One-hot: each level becomes a binary feature. Slow; trees grow many useless branches.
- Native handling (lightgbm, catboost): pre-sort levels by target mean, then split as if the feature were numeric. Much faster and often more accurate.
Common pitfalls
- Letting trees grow unbounded on small data. Memorizes training set; high variance.
- Treating categorical features as numeric without one-hot or native handling. Tree splits are ordered; “encoding” a categorical with arbitrary integer codes imposes a meaningless order.
- Comparing single trees against ensembles unfairly. Single trees should not beat boosted forests; check you are evaluating the right model class.
- Reading feature importance from a single split. Importance from a single tree is noisy; use ensemble averages or permutation importance.
Related
- Random forests. Bagged ensembles of trees.
- Gradient boosting. Boosted ensembles of trees.