Skip to content
mentorship

concepts

CNN architecture

Convolutions encode translation equivariance and locality. The structural inductive bias that powered the deep learning revolution in vision.

Reviewed · 3 min read

One-line definition

A convolutional neural network stacks convolutional layers (sliding-window linear operators with shared weights), non-linearities, and pooling / downsampling to map images to feature maps that grow in semantic abstraction and shrink in spatial resolution with depth.

Why it matters

CNNs powered the deep-learning revolution in computer vision (AlexNet 2012, VGG 2014, ResNet 2015). Their structural priors. Translation equivariance via weight sharing, local receptive fields, hierarchical composition. Match the structure of natural images and gave them a huge sample-efficiency advantage over fully-connected networks. Even in the transformer era, modern CNNs (ConvNeXt) remain competitive on standard benchmarks.

The building block: convolutional layer

Apply a small filter (e.g., ) at every spatial position of the input, producing one output channel. Repeat with filters → output of shape .

Per output:

Critical properties:

  • Weight sharing: the same filter is applied at every position. Vastly fewer parameters than fully-connected.
  • Translation equivariance: shifting the input shifts the output by the same amount. Hard-coded inductive bias.
  • Locality: each output depends only on a small spatial neighborhood of the input.

Standard CNN ingredients

  • Conv : workhorse; captures local features.
  • ReLU / GELU: pointwise non-linearity.
  • Batch normalization: stabilizes training, allows higher LRs.
  • Max pooling / average pooling: downsample by taking max / average over windows.
  • Strided convolution: alternative downsampling that learns the filter.
  • Global average pooling: reduce to before the classifier head.
  • convolution: per-pixel linear projection across channels; cheap channel mixing.

Architectural eras

EraArchitectureKey idea
2012AlexNetFirst major win; ReLU + dropout + GPU
2014VGGAll convs, very deep
2014GoogLeNet / InceptionMulti-scale modules, dim reduction
2015ResNetResidual connections enable 50+ layers
2016DenseNetDense feature reuse
2017MobileNetDepthwise separable convs for efficiency
2019EfficientNetCompound scaling depth × width × resolution
2020ViTTransformers replace CNN backbones
2022ConvNeXtModernized ResNet matching ViT performance

In 2026, ViT and ConvNeXt are the dominant ImageNet-class backbones; classic ResNet-50 still ubiquitous in transfer-learning pipelines.

Receptive field

The receptive field of a unit is the spatial extent of the input that influences it. Stacking layers of conv with stride 1 gives receptive field . Pooling and strided convolution multiply the effective stride, growing the receptive field exponentially.

For dense prediction (segmentation), large receptive field matters; for classification, global pooling at the end aggregates over all spatial locations.

ConvNeXt and the modern CNN

ConvNeXt (Liu et al., 2022) modernized ResNet-50 by adopting transformer-era design choices:

  • LayerNorm instead of BatchNorm.
  • GELU instead of ReLU.
  • Larger kernels ( depthwise).
  • Inverted bottleneck (channels-up then down).

Result: matches or beats ViT on ImageNet at the same compute. CNNs are not obsolete; transformers won by being better-designed, not by inherent architectural superiority.

Common pitfalls

  • Forgetting padding. Without padding, each conv layer shrinks . Use padding='same' to preserve spatial dimensions.
  • Channels-first vs. channels-last. PyTorch defaults to channels-first ; TensorFlow / Keras to channels-last. Conversions are common bug sources.
  • Skipping batch normalization. Deep CNNs without BN are very hard to train.
  • Using max-pool too aggressively. Halving spatial resolution at every layer destroys fine detail; stride-2 conv blocks let you control it.
  • Treating CNNs as universally outperformed by transformers. They are competitive on ImageNet at scale; on small datasets, CNNs often beat ViT due to inductive biases.