Skip to content
mentorship

concepts

Confusion matrix and classification metrics

The 2x2 (or KxK) table of predictions vs. truth that every classification metric is computed from. The Rosetta stone of binary classification.

Reviewed · 4 min read

One-line definition

A confusion matrix is a table where rows are true classes and columns are predicted classes (or vice versa, depending on convention). Each cell counts how many examples had that (true, predicted) pair. Every standard classification metric. Accuracy, precision, recall, specificity, F1, balanced accuracy. Is a function of the confusion-matrix counts.

Why it matters

Reporting a single classification metric is almost always insufficient. The confusion matrix:

  • Shows which errors the model makes (e.g., “confuses cat with dog 30% of the time”).
  • Reveals class imbalance issues that aggregate metrics hide.
  • Is the source of truth for choosing decision thresholds.

A senior ML practitioner should always look at the confusion matrix before reporting a metric.

The binary confusion matrix

Predicted 0Predicted 1
Actual 0TN (True Neg)FP (False Pos, Type I)
Actual 1FN (False Neg, Type II)TP (True Pos)

All binary classification metrics are derived from these four numbers:

MetricFormulaMeaning
AccuracyOverall correctness
Precision (PPV)Of predicted positives, fraction correct
Recall (TPR, Sensitivity)Of actual positives, fraction caught
Specificity (TNR)Of actual negatives, fraction correctly rejected
F1Harmonic mean of precision and recall
FPRFalse alarm rate;
FNRMiss rate;
NPVNegative predictive value
Balanced accuracyAverage of recall on each class
Matthews Corr Coefsee belowSingle-number summary that handles imbalance

Matthews Correlation Coefficient (MCC):

Range: . = random; = perfect; = inverse-perfect. Robust to class imbalance in a way accuracy is not.

Multi-class confusion matrix

For classes: matrix. Diagonal entries are correct predictions; off-diagonal are confusions.

Per-class metrics: treat class as positive, all others as negative; compute binary metrics from the row and column sums.

Aggregate to system metrics by:

  • Macro: average per-class metric (treats all classes equally).
  • Micro: aggregate counts across classes (dominated by majority).
  • Weighted: macro weighted by support.

Visualizing the confusion matrix

Always visualize on multi-class problems:

  • Heat-map with diagonal highlighted.
  • Normalize by row (recall per class) or by column (precision per class). Choose based on what you need to debug.
  • Sort rows/columns by frequency or by similarity for easier reading.

Threshold dependence

For probabilistic classifiers, the confusion matrix depends on the decision threshold. As you sweep the threshold from 0 to 1:

  • TP and FP both decrease (fewer predicted positives).
  • The confusion matrix sweeps through different tuples.
  • This is what produces the ROC and PR curves.

For deployment, pick the threshold that optimizes your real-world objective (precision at a recall floor, expected utility, etc.).

What to report

A complete classification report should include:

  1. Confusion matrix (raw counts and row-normalized).
  2. Per-class precision, recall, F1.
  3. Macro-averaged metrics for overall performance.
  4. One operating-point metric (e.g., accuracy at threshold 0.5, or P@R=0.95).
  5. Threshold-free curves: ROC-AUC and PR-AUC.

scikit-learn’s classification_report and confusion_matrix produce most of this; pair with roc_auc_score and average_precision_score.

Common pitfalls

  • Reporting a single number for a multi-class problem. Always include the per-class breakdown.
  • Confusion matrix rows-vs-columns confusion. Different libraries / textbooks use different conventions; explicitly label “rows = true” or “rows = predicted” in plots.
  • Computing precision/recall on training data. Always evaluate on held-out data.
  • Choosing threshold by maximizing F1 on the test set. That’s tuning on test; use validation.
  • Treating per-class F1 as comparable across classes with very different prevalences. Macro-average can mask huge per-class disparities; look at the table.