One-line definition
During training, set each activation in a chosen layer to zero independently with probability , and scale the surviving activations by so the expected output is unchanged. At inference, dropout is off; activations are used as is.
Why it matters
Dropout (Srivastava et al., 2014) is a cheap stochastic regularizer that empirically reduces overfitting in many settings. It can be interpreted as approximate ensembling over an exponentially large collection of subnetworks: each training step samples a different subnetwork; inference averages by using the full network with calibrated activations.
Modern usage is more selective than in 2015. Large transformer pretraining (Llama, GPT) uses dropout sparingly or not at all because (a) the data is essentially infinite relative to model capacity and (b) training instability concerns dominate over regularization. Dropout is still standard in vision (CNN and ViT FFNs), in fine-tuning small models, and in any setup where the train/test gap is large.
The mechanism (inverted dropout)
For an input :
- Sample a binary mask with independently.
- Compute .
The scaling at training time means inference can run with no scaling. Just . (“Inverted dropout” is now the standard implementation; the original 2014 paper scaled at inference instead.)
Where to apply
| Architecture | Where dropout typically goes | Typical |
|---|---|---|
| CNN classifier | After FC layers, after some conv blocks | 0.5 (FC), 0.1–0.3 (conv) |
| Transformer (small / fine-tune) | Inside FFN, on attention weights, on residual outputs | 0.1 |
| Transformer (large pretrain, e.g. Llama) | None or 0.0 | . |
| RNN | Between layers (not within timesteps); see Variational Dropout | 0.2–0.5 |
| Embedding | Sometimes on input embeddings | 0.1 |
Variants
- Spatial dropout (Tompson et al., 2015): drop entire feature maps in CNNs instead of single activations.
- DropConnect (Wan et al., 2013): drop weights instead of activations.
- Variational dropout for RNNs (Gal & Ghahramani, 2016): same dropout mask across all timesteps in a sequence.
- Stochastic Depth (Huang et al., 2016): drop entire residual blocks. Used in deep ResNets and ViTs.
Relation to other regularization
Dropout is roughly equivalent to weight noise plus a small effective weight decay. It is not a substitute for weight decay, batch/layer norm, or data augmentation; the regularization effects compose.
Common pitfalls
- Forgetting to call
model.eval()at inference. PyTorch’snn.Dropoutonly deactivates in eval mode. Leaving it active at inference adds noise to predictions. - Using high in a transformer without regularization need. in a transformer FFN cripples a well-tuned model.
- Dropping the same layer twice. Stacking dropout with batch norm can interact badly; ResNet authors recommend BN inside, dropout outside the residual block.
- Treating dropout as a substitute for more data. It tightens the train-test gap; it doesn’t make a fundamentally over-parameterized model generalize.