Skip to content
mentorship

concepts

Convolution as matrix multiplication (im2col)

A 2D convolution is a matmul in disguise. Unfold the input into columns, multiply by a flattened filter matrix. The reason CNNs run fast on the same hardware as transformers.

Reviewed · 3 min read

One-line definition

im2col rearranges a convolution input so each spatial location’s receptive field becomes a column of a matrix. The convolution then reduces to a single matmul: .

Why it matters

Modern hardware (GPUs, TPUs) is optimized for dense matmul. A naive 2D convolution loop is the wrong shape for that hardware: nested loops over spatial positions, channels, and kernel offsets, with poor memory locality. im2col turns the same arithmetic into a single GEMM call that lands on the highly tuned BLAS path.

Every major framework (cuDNN, MKL-DNN, XNNPACK) implements convolution as some variant of this idea. Understanding it explains why CNN inference cost scales like matmul, why grouped convolutions are cheap, and why depthwise-separable convolutions split into two matmuls.

The mechanism

For input , kernel , output :

  1. im2col: for each output position , extract the values in its receptive field and stack them as a column. The result is a matrix .
  2. Flatten kernel: reshape to .
  3. GEMM: , shape .
  4. col2im: reshape back to .

Memory cost: duplicates each input pixel up to times. For a kernel, that is a 9x blowup of the activation tensor.

Variants

  • Implicit GEMM: avoid materializing in memory. Compute the matmul tile by tile, indexing back into on the fly. cuDNN’s default for most conv shapes.
  • Winograd: trade matmul FLOPs for additions via polynomial transforms. Faster for small kernels (e.g. ) on certain hardware. Lower numerical precision.
  • FFT convolution: . Wins for large kernels (rare in modern CNNs).
  • Depthwise convolution: each input channel has its own filter, so is block-diagonal. The matmul splits into tiny independent matmuls, much cheaper.

Why this matters for the senior interview

If asked “how does convolution actually run on a GPU,” the expected answer is: it is a matmul. Then walk through the im2col reshape, the GEMM call, and the memory blowup. Bonus points for noting that the flattened kernel has shape , so the FLOP count is . The same formula you see in every model card.

Common pitfalls

  • Forgetting the memory cost. im2col can be larger than the activation it came from. Implicit GEMM exists for this reason.
  • Conflating convolution with cross-correlation. Deep learning frameworks implement cross-correlation; the kernel is not flipped. Mathematicians’ convolution flips the kernel. Almost never matters in practice.
  • Treating depthwise and pointwise as a single op. They are two distinct matmuls with very different shapes. Profile separately.