One-line definition
An activation function is a (usually) elementwise nonlinearity applied between linear layers in a neural network. Without it, stacking linear layers collapses to a single linear layer (no expressive power gain). The choice of activation shapes optimization, gradient flow, and final accuracy.
The standard family
| Activation | Formula | Range | Use today |
|---|---|---|---|
| Sigmoid | Output of binary classifier; gates in LSTMs/GRUs. Hidden layers: avoid (saturating gradients). | ||
| Tanh | RNN hidden (legacy); zero-centered version of sigmoid. | ||
| ReLU | Default for CNNs and MLPs; cheap, fast. | ||
| Leaky ReLU | , | Avoids “dying ReLU” by leaking negative values. | |
| ELU | for , for | Smooth and zero-centered. Slightly slower than ReLU. | |
| GELU | where is standard normal CDF | Default in transformers (BERT, GPT-1/2/3). | |
| Swish / SiLU | Default in modern decoder LLMs (Llama, Mistral). | ||
| Softmax | simplex | Output of multi-class classifier; not used in hidden layers. |
Why ReLU won (then why GELU/swish replaced it)
ReLU (Nair & Hinton, 2010) revived deep learning by solving vanishing gradients in deep CNNs:
- Gradient is exactly 1 in the active region (no saturation).
- Computationally trivial: a single .
- Sparse activations (~half are zero): biological intuition + computational efficiency.
But ReLU has the dying ReLU problem: a neuron stuck at has gradient 0 forever and never recovers.
GELU (Hendrycks & Gimpel, 2016) and swish / SiLU (Ramachandran et al., 2017) are smooth versions of ReLU that have non-zero gradient everywhere. Empirically improve transformer training over ReLU; the gain is small (~1% perplexity) but consistent.
In 2026:
- CNNs / MLPs: still mostly ReLU.
- Transformers: GELU (BERT-era) or SwiGLU (modern Llama-style decoders).
- RNNs: tanh / sigmoid for gates (legacy); RNNs are largely deprecated for new work.
SwiGLU and gated activations
Modern decoder LLMs (Llama 1/2/3, Mistral, Qwen) use SwiGLU in the FFN:
Two parallel linear projections, one passed through swish, then elementwise product, then a third linear projection. Slightly more parameters per FFN block than the original design, but better training dynamics. GLU = “gated linear unit.” Now standard.
When to use which output activation
| Task | Output activation |
|---|---|
| Binary classification | Sigmoid |
| Multi-class classification | Softmax |
| Multi-label classification | Sigmoid (independent binary heads) |
| Regression (unbounded) | Identity (no activation) |
| Regression (bounded ) | Sigmoid |
| Probability over a discrete distribution | Softmax |
| Embedding output | Identity, then L2-normalize |
Common pitfalls
- Putting ReLU on the output. A regression with non-negative range should use ReLU or softplus on output; for general regression, no activation.
- Sigmoid in hidden layers of deep nets. Saturates → vanishing gradients → no learning past a few layers.
- Picking exotic activations to chase 0.5% accuracy. The activation choice rarely matters compared to data, regularization, and architecture.
- Forgetting GELU has two forms. Exact (CDF-based) and approximate (tanh-based polynomial). They give visibly different outputs near 0; check your framework.
Related
- Weight initialization. Initialization is activation-dependent.
- Exploding and vanishing gradients. What motivates ReLU and gated activations.