Skip to content
mentorship

concepts

Dropout

Randomly zero out a fraction of activations during training. The simplest stochastic regularizer; still standard in vision and many NLP architectures.

Reviewed · 3 min read

One-line definition

During training, set each activation in a chosen layer to zero independently with probability , and scale the surviving activations by so the expected output is unchanged. At inference, dropout is off; activations are used as is.

Why it matters

Dropout (Srivastava et al., 2014) is a cheap stochastic regularizer that empirically reduces overfitting in many settings. It can be interpreted as approximate ensembling over an exponentially large collection of subnetworks: each training step samples a different subnetwork; inference averages by using the full network with calibrated activations.

Modern usage is more selective than in 2015. Large transformer pretraining (Llama, GPT) uses dropout sparingly or not at all because (a) the data is essentially infinite relative to model capacity and (b) training instability concerns dominate over regularization. Dropout is still standard in vision (CNN and ViT FFNs), in fine-tuning small models, and in any setup where the train/test gap is large.

The mechanism (inverted dropout)

For an input :

  1. Sample a binary mask with independently.
  2. Compute .

The scaling at training time means inference can run with no scaling. Just . (“Inverted dropout” is now the standard implementation; the original 2014 paper scaled at inference instead.)

Where to apply

ArchitectureWhere dropout typically goesTypical
CNN classifierAfter FC layers, after some conv blocks0.5 (FC), 0.1–0.3 (conv)
Transformer (small / fine-tune)Inside FFN, on attention weights, on residual outputs0.1
Transformer (large pretrain, e.g. Llama)None or 0.0.
RNNBetween layers (not within timesteps); see Variational Dropout0.2–0.5
EmbeddingSometimes on input embeddings0.1

Variants

Relation to other regularization

Dropout is roughly equivalent to weight noise plus a small effective weight decay. It is not a substitute for weight decay, batch/layer norm, or data augmentation; the regularization effects compose.

Common pitfalls

  • Forgetting to call model.eval() at inference. PyTorch’s nn.Dropout only deactivates in eval mode. Leaving it active at inference adds noise to predictions.
  • Using high in a transformer without regularization need. in a transformer FFN cripples a well-tuned model.
  • Dropping the same layer twice. Stacking dropout with batch norm can interact badly; ResNet authors recommend BN inside, dropout outside the residual block.
  • Treating dropout as a substitute for more data. It tightens the train-test gap; it doesn’t make a fundamentally over-parameterized model generalize.