Skip to content
mentorship

concepts

Mixup and CutMix

Two data-augmentation schemes that train on convex combinations of pairs of inputs and their labels. Strong regularization for image classification; sometimes used in audio and tabular.

Reviewed · 3 min read

One-line definition

Mixup (Zhang et al., 2018) trains the model on convex combinations of pairs of training examples: and with . CutMix (Yun et al., 2019) instead pastes a rectangular patch from onto and mixes labels by the area ratio.

Why it matters

Both techniques regularize by training on examples between the original training points. Empirically:

  • Improve top-1 accuracy on ImageNet by ~1–2% over baseline.
  • Improve calibration (predicted probabilities track accuracy better).
  • Improve robustness to label noise and adversarial perturbations.
  • Standard in modern image classification recipes (timm, ConvNeXt, ViT-style training).

Less common in NLP (token mixing is non-trivial) and in pretraining (large data already covers the input space well). Sometimes used in audio (mix waveforms or spectrograms) and tabular (interpolate features).

Mechanism

Mixup

For each batch, sample once (or per sample) with small (typical 0.2–0.4). For paired examples and :

Train normally on with cross-entropy. The label is a soft target.

CutMix

Sample . Pick a random rectangle in of area (e.g., width and height times the image). Paste the corresponding region from into . Mix labels by the area ratio .

The resulting image has a clear local boundary (no blending). Models trained on CutMix often produce more localized class activations.

Choosing

SettingMixup CutMix
ImageNet from scratch0.21.0
Small datasets0.2 (more aggressive Mixup hurts)1.0
ViT training0.2 + CutMix 1.0 (used together).

gives near-original samples (almost no mixing); gives (always equally mixed). between 0.2 and 1 is the empirical sweet spot.

Why it works (intuition)

  • Vicinal risk minimization (Chapelle et al., 2001): training on a vicinity around each point regularizes the decision boundary.
  • Empirically: smoother decision functions, better calibration, less overconfidence on out-of-distribution inputs.
  • Equivalent to an implicit form of weight regularization in the linear case.

Common pitfalls

  • Mixing labels but not inputs. Some implementations mix targets without mixing inputs; this is just label noise, not Mixup.
  • Combining with strong cropping. Mixup + RandomResizedCrop + label smoothing + AutoAugment is the modern recipe but can over-regularize small datasets.
  • Using on detection / segmentation directly. Class labels mix easily; bounding boxes do not. Variants like Mosaic (YOLOv4) handle this.
  • Forgetting to disable for evaluation. Eval should use clean images.