Mixup and CutMix

One-line definition

Mixup (Zhang et al., 2018) trains the model on convex combinations of pairs of training examples: $\tilde{x} = λ x_{i} + (1 - λ) x_{j}$ and $\tilde{y} = λ y_{i} + (1 - λ) y_{j}$ with $λ \sim Beta (α, α)$ . CutMix (Yun et al., 2019) instead pastes a rectangular patch from $x_{j}$ onto $x_{i}$ and mixes labels by the area ratio.

Why it matters

Both techniques regularize by training on examples between the original training points. Empirically:

Improve top-1 accuracy on ImageNet by ~1–2% over baseline.
Improve calibration (predicted probabilities track accuracy better).
Improve robustness to label noise and adversarial perturbations.
Standard in modern image classification recipes (timm, ConvNeXt, ViT-style training).

Less common in NLP (token mixing is non-trivial) and in pretraining (large data already covers the input space well). Sometimes used in audio (mix waveforms or spectrograms) and tabular (interpolate features).

Mechanism

Mixup

For each batch, sample $λ \sim Beta (α, α)$ once (or per sample) with $α$ small (typical 0.2–0.4). For paired examples $(x_{i}, y_{i})$ and $(x_{j}, y_{j})$ :

\tilde{x} = λ x_{i} + (1 - λ) x_{j}, \tilde{y} = λ y_{i} + (1 - λ) y_{j} .

Train normally on $(\tilde{x}, \tilde{y})$ with cross-entropy. The label is a soft target.

CutMix

Sample $λ \sim Beta (α, α)$ . Pick a random rectangle in $x_{i}$ of area $1 - λ$ (e.g., width and height $1 - λ$ times the image). Paste the corresponding region from $x_{j}$ into $x_{i}$ . Mix labels by the area ratio $λ$ .

The resulting image has a clear local boundary (no blending). Models trained on CutMix often produce more localized class activations.

Choosing $α$

Setting	Mixup $α$	CutMix $α$
ImageNet from scratch	0.2	1.0
Small datasets	0.2 (more aggressive Mixup hurts)	1.0
ViT training	0.2 + CutMix 1.0 (used together)	.

$α \to 0$ gives near-original samples (almost no mixing); $α \to \infty$ gives $λ \approx 0.5$ (always equally mixed). $α$ between 0.2 and 1 is the empirical sweet spot.

Why it works (intuition)

Vicinal risk minimization (Chapelle et al., 2001): training on a vicinity around each point regularizes the decision boundary.
Empirically: smoother decision functions, better calibration, less overconfidence on out-of-distribution inputs.
Equivalent to an implicit form of weight regularization in the linear case.

Common pitfalls

Mixing labels but not inputs. Some implementations mix targets without mixing inputs; this is just label noise, not Mixup.
Combining with strong cropping. Mixup + RandomResizedCrop + label smoothing + AutoAugment is the modern recipe but can over-regularize small datasets.
Using on detection / segmentation directly. Class labels mix easily; bounding boxes do not. Variants like Mosaic (YOLOv4) handle this.
Forgetting to disable for evaluation. Eval should use clean images.

Label smoothing. Another way to soften targets.
Dropout. Stochastic activation regularization.
Regularization. Overview.