Skip to content
mentorship

concepts

Gradient boosting (xgboost, lightgbm, catboost)

Train trees sequentially, each one fitting the gradient of the loss with respect to the current ensemble's prediction. The dominant tabular learner in 2026.

Reviewed · 3 min read

One-line definition

Gradient boosting builds an ensemble by repeatedly fitting a weak learner (typically a small decision tree) to the negative gradient of the loss with respect to the current ensemble’s prediction, and adding it with a small step size (the learning rate).

Why it matters

Gradient-boosted decision trees (GBDT) are the dominant model class for tabular data in 2026. xgboost, lightgbm, and catboost win the majority of Kaggle tabular competitions and are heavily used in production at scale (search ranking, ad CTR, fraud, credit risk). Knowing the algorithm at a level that distinguishes you from “I called xgboost.fit” is a core senior-ML expectation.

The algorithm (Friedman, 2001)

Initialize with a constant prediction (mean target for regression, log-odds prior for classification). Then for :

  1. Compute the negative gradient (pseudo-residuals) at each training point:
  2. Fit a regression tree to .
  3. Optimize the leaf values to minimize in the new ensemble (line search per leaf).
  4. Update: .

For squared error, . Literal residuals. For other losses (logistic, Huber, ranking) the residuals are the loss gradients.

Newton boosting (xgboost)

xgboost uses second-order information: each new tree minimizes a Taylor expansion of the loss including both the gradient and Hessian (diagonal). This gives faster convergence and tighter leaf-value updates than first-order GBDT.

Why it works

GBDT has a self-correcting property: each tree fixes the mistakes of the current ensemble. Combined with a small learning rate ( to ), this gives gradual, stable improvement.

The bias-variance picture flips compared to random forests:

  • RF: low-bias trees, average → low variance.
  • GBDT: shallow (high-bias) trees, but each adapts to current residuals → ensemble has low bias.

The four implementations

LibraryDistinguishing feature
xgboostMature; great parallelization; sparsity-aware splits; default for many shops.
lightgbmHistogram-based splits → much faster on large data; native categorical handling; leaf-wise growth.
catboostBest out-of-the-box on categorical-heavy data; ordered boosting reduces target-leakage in categorical encodings.
sklearn GradientBoostingClassifierSimple, slow on large data; mostly for teaching.

For new projects in 2026: lightgbm for raw speed, catboost for categorical-heavy data, xgboost for everything else.

Hyperparameters that matter

ParameterTypicalEffect
learning_rate0.05–0.1Smaller → more trees, better generalization, more compute.
n_estimators (or num_round)500–2000Use early stopping on validation.
max_depth4–8Most important regularizer.
min_child_weight / min_data_in_leafvariesPrevent overfitting to small leaves.
subsample0.7–0.9Stochastic gradient boosting; row sampling.
colsample_bytree0.5–1.0Feature subsampling per tree.
reg_alpha, reg_lambda0–10L1, L2 on leaf weights (xgboost).

Use early stopping on a validation set: train until validation loss stops improving for rounds.

When GBDT wins

  • Mixed numeric + categorical features.
  • Non-linear interactions matter.
  • Modest sample size ( to ).
  • Heterogeneous feature scales.

When GBDT loses

  • High-dimensional unstructured data (text, images, audio): use neural nets.
  • Truly tiny data (): logistic / linear with strong priors.
  • Ranking with millions of items per query: dedicated learning-to-rank stacks (often still GBDT under the hood with pairwise / listwise losses).

Common pitfalls

  • No early stopping → overfit. Always stop on validation.
  • Tuning learning rate without retuning n_estimators. They trade off; halving LR roughly doubles needed trees.
  • Default categorical handling = one-hot. For lightgbm/catboost, declare categorical features explicitly to use native splits.
  • Comparing against a single tree. That’s not the right baseline; compare against RF and a strong logistic.