Confusion matrix and classification metrics

One-line definition

A confusion matrix is a $K \times K$ table where rows are true classes and columns are predicted classes (or vice versa, depending on convention). Each cell counts how many examples had that (true, predicted) pair. Every standard classification metric. Accuracy, precision, recall, specificity, F1, balanced accuracy. Is a function of the confusion-matrix counts.

Why it matters

Reporting a single classification metric is almost always insufficient. The confusion matrix:

Shows which errors the model makes (e.g., “confuses cat with dog 30% of the time”).
Reveals class imbalance issues that aggregate metrics hide.
Is the source of truth for choosing decision thresholds.

A senior ML practitioner should always look at the confusion matrix before reporting a metric.

The binary confusion matrix

	Predicted 0	Predicted 1
Actual 0	TN (True Neg)	FP (False Pos, Type I)
Actual 1	FN (False Neg, Type II)	TP (True Pos)

All binary classification metrics are derived from these four numbers:

Metric	Formula	Meaning
Accuracy	$(T P + T N) / N$	Overall correctness
Precision (PPV)	$T P / (T P + F P)$	Of predicted positives, fraction correct
Recall (TPR, Sensitivity)	$T P / (T P + F N)$	Of actual positives, fraction caught
Specificity (TNR)	$T N / (T N + F P)$	Of actual negatives, fraction correctly rejected
F1	$2 P R / (P + R)$	Harmonic mean of precision and recall
FPR	$F P / (F P + T N)$	False alarm rate; $1 - specificity$
FNR	$F N / (F N + T P)$	Miss rate; $1 - recall$
NPV	$T N / (T N + F N)$	Negative predictive value
Balanced accuracy	$(T P R + T N R) /2$	Average of recall on each class
Matthews Corr Coef	see below	Single-number summary that handles imbalance

Matthews Correlation Coefficient (MCC):

MCC = \frac{T P \cdot T N - F P \cdot F N}{( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )} .

Range: $[- 1, + 1]$ . $0$ = random; $+ 1$ = perfect; $- 1$ = inverse-perfect. Robust to class imbalance in a way accuracy is not.

Multi-class confusion matrix

For $K$ classes: $K \times K$ matrix. Diagonal entries are correct predictions; off-diagonal are confusions.

Per-class metrics: treat class $k$ as positive, all others as negative; compute binary metrics from the row and column sums.

Aggregate to system metrics by:

Macro: average per-class metric (treats all classes equally).
Micro: aggregate counts across classes (dominated by majority).
Weighted: macro weighted by support.

Visualizing the confusion matrix

Always visualize on multi-class problems:

Heat-map with diagonal highlighted.
Normalize by row (recall per class) or by column (precision per class). Choose based on what you need to debug.
Sort rows/columns by frequency or by similarity for easier reading.

Threshold dependence

For probabilistic classifiers, the confusion matrix depends on the decision threshold. As you sweep the threshold from 0 to 1:

TP and FP both decrease (fewer predicted positives).
The confusion matrix sweeps through different $(T P, F P, F N, T N)$ tuples.
This is what produces the ROC and PR curves.

For deployment, pick the threshold that optimizes your real-world objective (precision at a recall floor, expected utility, etc.).

What to report

A complete classification report should include:

Confusion matrix (raw counts and row-normalized).
Per-class precision, recall, F1.
Macro-averaged metrics for overall performance.
One operating-point metric (e.g., accuracy at threshold 0.5, or P@R=0.95).
Threshold-free curves: ROC-AUC and PR-AUC.

scikit-learn’s classification_report and confusion_matrix produce most of this; pair with roc_auc_score and average_precision_score.

Common pitfalls

Reporting a single number for a multi-class problem. Always include the per-class breakdown.
Confusion matrix rows-vs-columns confusion. Different libraries / textbooks use different conventions; explicitly label “rows = true” or “rows = predicted” in plots.
Computing precision/recall on training data. Always evaluate on held-out data.
Choosing threshold by maximizing F1 on the test set. That’s tuning on test; use validation.
Treating per-class F1 as comparable across classes with very different prevalences. Macro-average can mask huge per-class disparities; look at the table.

Precision, recall, F1. Single-metric details.
ROC, PR, AUC. Threshold-free metrics.