ResNet

Residual connections enabled networks deeper than 30 layers to train. Still the dominant backbone for transfer learning in 2026.

Reviewed November 5, 2025 · 3 min read

One-line definition

ResNet (He et al., 2015) introduced residual connections. Adding the input of a block to its output, so each block computes $y = x + f (x)$ rather than $y = f (x)$ . This solved the degradation problem of very deep networks and enabled the first widely-trainable 50–152 layer CNNs.

Why it matters

Before ResNet, networks past ~20 layers showed worse training accuracy than shallower ones. Not from overfitting but from optimization difficulty. ResNet (2015) tied SoTA on ImageNet with 152 layers and won the competition. The residual idea is now ubiquitous: every transformer, every modern CNN, every U-Net.

ResNet-50 remains the default transfer-learning backbone in 2026. Pretrained checkpoints widely available, well-understood behavior, fast inference.

The residual block

The building block of ResNet:

input x
  ↓
Conv 3x3 → BN → ReLU
  ↓
Conv 3x3 → BN
  ↓
+ x   (identity skip)
  ↓
ReLU

That + x is the residual connection. If the spatial / channel dimensions change, $x$ is projected through a $1 \times 1$ conv first.

Bottleneck block (ResNet-50/101/152)

To make very deep networks compute-efficient:

input x
  ↓
Conv 1x1 → BN → ReLU   (reduce channels: 256 → 64)
  ↓
Conv 3x3 → BN → ReLU   (compute at low channel count)
  ↓
Conv 1x1 → BN          (expand channels: 64 → 256)
  ↓
+ x
  ↓
ReLU

The bottleneck cuts compute by ~4× per block compared to two $3 \times 3$ convs at full channel count.

Why residuals work

Three intertwined explanations:

Easier to learn the identity. If a layer’s optimal contribution is “do nothing,” it’s easier to drive $f (x) \to 0$ than to learn the identity through a stack of conv + BN + ReLU.
Gradient highway. Backprop through $y = x + f (x)$ has Jacobian $I + \partial f / \partial x$ . The identity term lets gradients flow back uninterrupted, preventing vanishing across depth.
Implicit ensemble (Veit et al., 2016): a depth- $N$ ResNet behaves like an ensemble of $2^{N}$ paths of varying depth. Shallow paths learn quickly and provide signal; deep paths refine.

See residual connections for the more detailed gradient analysis.

ResNet variants

Variant	Depth	Block type	Notes
ResNet-18, ResNet-34	18, 34	Basic ( $3 \times 3$ + $3 \times 3$ )	Smaller models
ResNet-50	50	Bottleneck	Most common; 25M params
ResNet-101, ResNet-152	101, 152	Bottleneck	Higher accuracy; slower
ResNeXt	varies	Grouped conv bottleneck	More parallelism within blocks
Wide ResNet	16-50	Basic, wider channels	Depth-vs-width tradeoff

Pre-activation ResNet

He et al. (2016) revisited the block ordering and proposed pre-activation:

x → BN → ReLU → Conv → BN → ReLU → Conv → + x

Normalization and activation come before the conv. Cleaner identity mapping, slightly better training. Used in some later models; classic ResNet-50 is post-activation.

The same pre/post norm question recurs in transformers (see residual connections).

ResNet’s legacy in 2026

Despite ViT and ConvNeXt being SoTA on ImageNet, ResNet-50 is still:

The default backbone for object detection (Faster R-CNN, Mask R-CNN), segmentation (DeepLab, Mask R-CNN), pose estimation.
The default transfer-learning starting point for low-data domains (medical imaging, satellite, manufacturing).
The default comparison baseline in vision papers.

Why: pre-trained weights are widely available, behavior is well-understood, and the inductive biases (translation equivariance, locality) help when data is limited.

Common pitfalls

Skipping the projection on dimension changes. When channels change, you must project $x$ via $1 \times 1$ conv before adding.
Using ResNet without pretrained weights for small data. Training from scratch on small datasets gives poor results; almost always pretrain on ImageNet first.
Adding BN after the residual sum. The residual sum should usually come before the final activation/normalization in a block; placement matters.
Treating ResNet-50 as outdated. In 2026 it’s still the practical default for many transfer-learning workloads.

Residual connections. The conceptual foundation.
CNN architecture. Broader CNN context.
Vision transformers. Modern alternative.