Dropout

One-line definition

During training, set each activation in a chosen layer to zero independently with probability $p$ , and scale the surviving activations by $1/ (1 - p)$ so the expected output is unchanged. At inference, dropout is off; activations are used as is.

Why it matters

Dropout (Srivastava et al., 2014) is a cheap stochastic regularizer that empirically reduces overfitting in many settings. It can be interpreted as approximate ensembling over an exponentially large collection of subnetworks: each training step samples a different subnetwork; inference averages by using the full network with calibrated activations.

Modern usage is more selective than in 2015. Large transformer pretraining (Llama, GPT) uses dropout sparingly or not at all because (a) the data is essentially infinite relative to model capacity and (b) training instability concerns dominate over regularization. Dropout is still standard in vision (CNN and ViT FFNs), in fine-tuning small models, and in any setup where the train/test gap is large.

The mechanism (inverted dropout)

For an input $x \in R^{d}$ :

Sample a binary mask $m \in {0, 1}^{d}$ with $m_{i} \sim Bernoulli (1 - p)$ independently.
Compute $y = m ⊙ x \cdot \frac{1}{1 - p}$ .

The $\frac{1}{1 - p}$ scaling at training time means inference can run with no scaling. Just $y = x$ . (“Inverted dropout” is now the standard implementation; the original 2014 paper scaled at inference instead.)

Where to apply

Architecture	Where dropout typically goes	Typical $p$
CNN classifier	After FC layers, after some conv blocks	0.5 (FC), 0.1–0.3 (conv)
Transformer (small / fine-tune)	Inside FFN, on attention weights, on residual outputs	0.1
Transformer (large pretrain, e.g. Llama)	None or 0.0	.
RNN	Between layers (not within timesteps); see Variational Dropout	0.2–0.5
Embedding	Sometimes on input embeddings	0.1

Variants

Spatial dropout (Tompson et al., 2015): drop entire feature maps in CNNs instead of single activations.
DropConnect (Wan et al., 2013): drop weights instead of activations.
Variational dropout for RNNs (Gal & Ghahramani, 2016): same dropout mask across all timesteps in a sequence.
Stochastic Depth (Huang et al., 2016): drop entire residual blocks. Used in deep ResNets and ViTs.

Relation to other regularization

Dropout is roughly equivalent to weight noise plus a small effective weight decay. It is not a substitute for weight decay, batch/layer norm, or data augmentation; the regularization effects compose.

Common pitfalls

Forgetting to call model.eval() at inference. PyTorch’s nn.Dropout only deactivates in eval mode. Leaving it active at inference adds noise to predictions.
Using high $p$ in a transformer without regularization need. $p = 0.5$ in a transformer FFN cripples a well-tuned model.
Dropping the same layer twice. Stacking dropout with batch norm can interact badly; ResNet authors recommend BN inside, dropout outside the residual block.
Treating dropout as a substitute for more data. It tightens the train-test gap; it doesn’t make a fundamentally over-parameterized model generalize.