Skip to content
mentorship

concepts

Expected Calibration Error (ECE)

How well do predicted probabilities match empirical frequencies? Bin predictions by confidence, compare bin-mean confidence to bin-accuracy.

Reviewed · 3 min read

One-line definition

Expected Calibration Error measures how well a model’s predicted probabilities match empirical accuracies. Bin predictions by predicted confidence, and compute the weighted average of :

Why it matters

A classifier that scores well on accuracy can still produce wildly miscalibrated probabilities. Predicting “90% confident” when only 60% of such predictions are correct. Calibration matters whenever:

  • The probability is used downstream (decision thresholds, expected-cost calculations, risk scoring).
  • A human reads the probability (medical diagnosis, fraud alerts).
  • The model is combined with other signals (Bayesian fusion).

ECE is the standard single-number calibration metric.

The mechanism

For a binary classifier producing scores on examples with true labels :

  1. Bin the predictions by confidence. Standard: equal-width bins covering .
  2. For each bin :
    • (average predicted probability).
    • (empirical accuracy).
  3. Aggregate: ECE = weighted average of bin gaps.

A perfectly calibrated model has ECE = 0: every bin’s empirical accuracy equals its average predicted confidence. Common modern deep classifiers have ECE 0.05–0.20. Predicted confidence is systematically inflated.

Reliability diagram

The visual companion to ECE: plot bin accuracy vs. bin confidence. Perfect calibration is the diagonal . Above-diagonal: under-confident. Below-diagonal: over-confident (the typical deep-net failure).

Always plot the reliability diagram alongside reporting ECE. Single ECE number can hide dramatic per-bin issues.

Variants

  • Maximum Calibration Error (MCE): . Worst-case bin gap.
  • Adaptive ECE: equal-frequency bins instead of equal-width. Stable when predictions concentrate near 0 or 1.
  • Class-wise ECE: per-class calibration; matters in multi-class.
  • Top-label ECE (multi-class): compute ECE on the predicted-class probability only.

Why deep nets are miscalibrated

Modern neural networks (Guo et al., 2017) are typically overconfident:

  • Trained on cross-entropy, which keeps pushing logits toward on training data.
  • Standard regularization (weight decay, dropout) helps generalization but doesn’t fix calibration.
  • Modern architectures (ResNets, transformers) are more miscalibrated than older small models.

Calibration methods

Post-hoc rescaling of predicted probabilities, learned on a held-out set:

  • Temperature scaling (Guo et al., 2017): divide logits by a single learned scalar . Cheap; preserves accuracy; usually halves ECE. The default modern choice.
  • Platt scaling: fit a logistic regression on the logits. Used historically with SVMs.
  • Isotonic regression: fit a non-parametric monotonic mapping. More flexible; can overfit.
  • Vector / matrix scaling: per-class temperature. More parameters; risk of overfitting if calibration set is small.

Calibration is lossless on accuracy (monotonic transformations preserve argmax). There’s no reason not to do it.

Limits of ECE

  • Binning artifact: ECE depends on bin choice. Adaptive ECE is more stable.
  • Confidence vs. probability of correctness: ECE on multi-class typically uses only the top-class probability, ignoring whether the full distribution is well-calibrated.
  • Doesn’t measure sharpness: a model that predicts for everything has ECE = 0 if base rate is , but is useless. Combine ECE with proper scoring rules (Brier score, log loss).

Common pitfalls

  • Computing ECE on training data. Always use held-out data.
  • Reporting ECE without a reliability diagram. Visual asymmetries can be invisible in the single number.
  • Skipping calibration in production. Temperature scaling is one line of code and often halves ECE.
  • Confusing calibration with accuracy. A 95%-accuracy model with ECE 0.20 is still untrustworthy whenever the probability matters.