One-line definition
ResNet (He et al., 2015) introduced residual connections. Adding the input of a block to its output, so each block computes rather than . This solved the degradation problem of very deep networks and enabled the first widely-trainable 50–152 layer CNNs.
Why it matters
Before ResNet, networks past ~20 layers showed worse training accuracy than shallower ones. Not from overfitting but from optimization difficulty. ResNet (2015) tied SoTA on ImageNet with 152 layers and won the competition. The residual idea is now ubiquitous: every transformer, every modern CNN, every U-Net.
ResNet-50 remains the default transfer-learning backbone in 2026. Pretrained checkpoints widely available, well-understood behavior, fast inference.
The residual block
The building block of ResNet:
input x
↓
Conv 3x3 → BN → ReLU
↓
Conv 3x3 → BN
↓
+ x (identity skip)
↓
ReLU
That + x is the residual connection. If the spatial / channel dimensions change, is projected through a conv first.
Bottleneck block (ResNet-50/101/152)
To make very deep networks compute-efficient:
input x
↓
Conv 1x1 → BN → ReLU (reduce channels: 256 → 64)
↓
Conv 3x3 → BN → ReLU (compute at low channel count)
↓
Conv 1x1 → BN (expand channels: 64 → 256)
↓
+ x
↓
ReLU
The bottleneck cuts compute by ~4× per block compared to two convs at full channel count.
Why residuals work
Three intertwined explanations:
- Easier to learn the identity. If a layer’s optimal contribution is “do nothing,” it’s easier to drive than to learn the identity through a stack of conv + BN + ReLU.
- Gradient highway. Backprop through has Jacobian . The identity term lets gradients flow back uninterrupted, preventing vanishing across depth.
- Implicit ensemble (Veit et al., 2016): a depth- ResNet behaves like an ensemble of paths of varying depth. Shallow paths learn quickly and provide signal; deep paths refine.
See residual connections for the more detailed gradient analysis.
ResNet variants
| Variant | Depth | Block type | Notes |
|---|---|---|---|
| ResNet-18, ResNet-34 | 18, 34 | Basic ( + ) | Smaller models |
| ResNet-50 | 50 | Bottleneck | Most common; 25M params |
| ResNet-101, ResNet-152 | 101, 152 | Bottleneck | Higher accuracy; slower |
| ResNeXt | varies | Grouped conv bottleneck | More parallelism within blocks |
| Wide ResNet | 16-50 | Basic, wider channels | Depth-vs-width tradeoff |
Pre-activation ResNet
He et al. (2016) revisited the block ordering and proposed pre-activation:
x → BN → ReLU → Conv → BN → ReLU → Conv → + x
Normalization and activation come before the conv. Cleaner identity mapping, slightly better training. Used in some later models; classic ResNet-50 is post-activation.
The same pre/post norm question recurs in transformers (see residual connections).
ResNet’s legacy in 2026
Despite ViT and ConvNeXt being SoTA on ImageNet, ResNet-50 is still:
- The default backbone for object detection (Faster R-CNN, Mask R-CNN), segmentation (DeepLab, Mask R-CNN), pose estimation.
- The default transfer-learning starting point for low-data domains (medical imaging, satellite, manufacturing).
- The default comparison baseline in vision papers.
Why: pre-trained weights are widely available, behavior is well-understood, and the inductive biases (translation equivariance, locality) help when data is limited.
Common pitfalls
- Skipping the projection on dimension changes. When channels change, you must project via conv before adding.
- Using ResNet without pretrained weights for small data. Training from scratch on small datasets gives poor results; almost always pretrain on ImageNet first.
- Adding BN after the residual sum. The residual sum should usually come before the final activation/normalization in a block; placement matters.
- Treating ResNet-50 as outdated. In 2026 it’s still the practical default for many transfer-learning workloads.
Related
- Residual connections. The conceptual foundation.
- CNN architecture. Broader CNN context.
- Vision transformers. Modern alternative.