Skip to content
mentorship

concepts

Activation functions

ReLU, GELU, swish, sigmoid, tanh. What each does, why GELU/swish replaced ReLU in transformers, and when to use which.

Reviewed · 3 min read

One-line definition

An activation function is a (usually) elementwise nonlinearity applied between linear layers in a neural network. Without it, stacking linear layers collapses to a single linear layer (no expressive power gain). The choice of activation shapes optimization, gradient flow, and final accuracy.

The standard family

ActivationFormulaRangeUse today
SigmoidOutput of binary classifier; gates in LSTMs/GRUs. Hidden layers: avoid (saturating gradients).
TanhRNN hidden (legacy); zero-centered version of sigmoid.
ReLUDefault for CNNs and MLPs; cheap, fast.
Leaky ReLU, Avoids “dying ReLU” by leaking negative values.
ELU for , for Smooth and zero-centered. Slightly slower than ReLU.
GELU where is standard normal CDFDefault in transformers (BERT, GPT-1/2/3).
Swish / SiLUDefault in modern decoder LLMs (Llama, Mistral).
SoftmaxsimplexOutput of multi-class classifier; not used in hidden layers.

Why ReLU won (then why GELU/swish replaced it)

ReLU (Nair & Hinton, 2010) revived deep learning by solving vanishing gradients in deep CNNs:

  • Gradient is exactly 1 in the active region (no saturation).
  • Computationally trivial: a single .
  • Sparse activations (~half are zero): biological intuition + computational efficiency.

But ReLU has the dying ReLU problem: a neuron stuck at has gradient 0 forever and never recovers.

GELU (Hendrycks & Gimpel, 2016) and swish / SiLU (Ramachandran et al., 2017) are smooth versions of ReLU that have non-zero gradient everywhere. Empirically improve transformer training over ReLU; the gain is small (~1% perplexity) but consistent.

In 2026:

  • CNNs / MLPs: still mostly ReLU.
  • Transformers: GELU (BERT-era) or SwiGLU (modern Llama-style decoders).
  • RNNs: tanh / sigmoid for gates (legacy); RNNs are largely deprecated for new work.

SwiGLU and gated activations

Modern decoder LLMs (Llama 1/2/3, Mistral, Qwen) use SwiGLU in the FFN:

Two parallel linear projections, one passed through swish, then elementwise product, then a third linear projection. Slightly more parameters per FFN block than the original design, but better training dynamics. GLU = “gated linear unit.” Now standard.

When to use which output activation

TaskOutput activation
Binary classificationSigmoid
Multi-class classificationSoftmax
Multi-label classificationSigmoid (independent binary heads)
Regression (unbounded)Identity (no activation)
Regression (bounded )Sigmoid
Probability over a discrete distributionSoftmax
Embedding outputIdentity, then L2-normalize

Common pitfalls

  • Putting ReLU on the output. A regression with non-negative range should use ReLU or softplus on output; for general regression, no activation.
  • Sigmoid in hidden layers of deep nets. Saturates → vanishing gradients → no learning past a few layers.
  • Picking exotic activations to chase 0.5% accuracy. The activation choice rarely matters compared to data, regularization, and architecture.
  • Forgetting GELU has two forms. Exact (CDF-based) and approximate (tanh-based polynomial). They give visibly different outputs near 0; check your framework.