One-line definition
Mixup (Zhang et al., 2018) trains the model on convex combinations of pairs of training examples: and with . CutMix (Yun et al., 2019) instead pastes a rectangular patch from onto and mixes labels by the area ratio.
Why it matters
Both techniques regularize by training on examples between the original training points. Empirically:
- Improve top-1 accuracy on ImageNet by ~1–2% over baseline.
- Improve calibration (predicted probabilities track accuracy better).
- Improve robustness to label noise and adversarial perturbations.
- Standard in modern image classification recipes (timm, ConvNeXt, ViT-style training).
Less common in NLP (token mixing is non-trivial) and in pretraining (large data already covers the input space well). Sometimes used in audio (mix waveforms or spectrograms) and tabular (interpolate features).
Mechanism
Mixup
For each batch, sample once (or per sample) with small (typical 0.2–0.4). For paired examples and :
Train normally on with cross-entropy. The label is a soft target.
CutMix
Sample . Pick a random rectangle in of area (e.g., width and height times the image). Paste the corresponding region from into . Mix labels by the area ratio .
The resulting image has a clear local boundary (no blending). Models trained on CutMix often produce more localized class activations.
Choosing
| Setting | Mixup | CutMix |
|---|---|---|
| ImageNet from scratch | 0.2 | 1.0 |
| Small datasets | 0.2 (more aggressive Mixup hurts) | 1.0 |
| ViT training | 0.2 + CutMix 1.0 (used together) | . |
gives near-original samples (almost no mixing); gives (always equally mixed). between 0.2 and 1 is the empirical sweet spot.
Why it works (intuition)
- Vicinal risk minimization (Chapelle et al., 2001): training on a vicinity around each point regularizes the decision boundary.
- Empirically: smoother decision functions, better calibration, less overconfidence on out-of-distribution inputs.
- Equivalent to an implicit form of weight regularization in the linear case.
Common pitfalls
- Mixing labels but not inputs. Some implementations mix targets without mixing inputs; this is just label noise, not Mixup.
- Combining with strong cropping. Mixup + RandomResizedCrop + label smoothing + AutoAugment is the modern recipe but can over-regularize small datasets.
- Using on detection / segmentation directly. Class labels mix easily; bounding boxes do not. Variants like Mosaic (YOLOv4) handle this.
- Forgetting to disable for evaluation. Eval should use clean images.
Related
- Label smoothing. Another way to soften targets.
- Dropout. Stochastic activation regularization.
- Regularization. Overview.