One-line definition
For a binary classifier producing scores, the ROC curve plots true positive rate (TPR) vs. false positive rate (FPR) as the decision threshold sweeps from to . The PR curve plots precision vs. recall over the same sweep. The area under each curve (AUC) is a single-number summary independent of any threshold choice.
Why it matters
A confusion matrix at one threshold tells you about that threshold; the curves and their AUCs summarize the classifier’s separability across all thresholds. Picking the wrong curve for your imbalance regime gives you a misleading optimism (ROC-AUC near 1 on heavy imbalance is common but uninformative).
The four base counts
| Predicted positive | Predicted negative | |
|---|---|---|
| Actually positive | TP | FN |
| Actually negative | FP | TN |
From these:
ROC curve and ROC-AUC
- X-axis: FPR (false alarm rate among true negatives).
- Y-axis: TPR (recall on positives).
- Random classifier: diagonal from (0,0) to (1,1), AUC = 0.5.
- Perfect classifier: top-left corner, AUC = 1.0.
Probabilistic interpretation: ROC-AUC is the probability that a uniformly random positive example scores higher than a uniformly random negative example.
ROC-AUC is invariant to class balance because TPR and FPR are normalized within each class. Sounds great. But this is also its failure mode. On heavy imbalance (1% positives) a model that’s “good at not flagging the 99% obvious negatives” gets a high ROC-AUC even when its top-1000 predictions are mostly wrong.
PR curve and PR-AUC
- X-axis: Recall.
- Y-axis: Precision.
- Random classifier: horizontal line at (the positive class prior). AUC = .
- Perfect classifier: top-right corner.
Probabilistic interpretation: PR-AUC is the average precision over recall thresholds.
PR-AUC depends explicitly on class balance. That’s why it is more honest than ROC-AUC under imbalance. Halving the positive prior halves the random-baseline PR-AUC.
Which to report
| Setting | Default metric |
|---|---|
| Balanced or near-balanced classification | ROC-AUC |
| Imbalanced (positives < 5%) | PR-AUC, plus precision@k |
| Search / retrieval / recsys | PR-AUC, recall@k, NDCG |
| Fraud / anomaly detection | PR-AUC, precision@high-recall point |
| Medical screening | PR-AUC + sensitivity at fixed specificity |
For interview answers: if you give ROC-AUC alone for a fraud problem (1% positives), expect a follow-up.
Numerical example
A model on a 1%-positive dataset:
- Random baseline: ROC-AUC = 0.50, PR-AUC = 0.01.
- “Good” model: ROC-AUC = 0.98, PR-AUC = 0.30.
The ROC-AUC of 0.98 is impressive-looking but ordinary; PR-AUC of 0.30 is the honest score. Always report both when imbalance is significant.
Relationship between the two
- Concavity / shape: ROC is monotonically non-decreasing; PR can be jagged and non-monotonic.
- Pareto-equivalent for ranking: a model dominates another in ROC iff it dominates in PR (Davis & Goadrich, 2006).
- AUCs are not directly comparable across the two curves. Same model has very different ROC-AUC and PR-AUC values.
Common pitfalls
- Reporting ROC-AUC on a 99/1 dataset and calling 0.95 “great.” It’s likely not.
- Comparing PR-AUC across datasets with different positive priors. They live on different baselines.
- Treating high AUC as a deployment-ready signal. AUC measures separability across all thresholds; production needs one threshold, picked from a confusion matrix at the operating point.
- Confusing ROC-AUC with “accuracy.” They are different quantities; accuracy is threshold-dependent, AUC is not.
Related
- Calibration. Predicted probabilities should match empirical frequencies.
- Class imbalance. Handling imbalanced training data.