One-line definition
im2col rearranges a convolution input so each spatial location’s receptive field becomes a column of a matrix. The convolution then reduces to a single matmul: .
Why it matters
Modern hardware (GPUs, TPUs) is optimized for dense matmul. A naive 2D convolution loop is the wrong shape for that hardware: nested loops over spatial positions, channels, and kernel offsets, with poor memory locality. im2col turns the same arithmetic into a single GEMM call that lands on the highly tuned BLAS path.
Every major framework (cuDNN, MKL-DNN, XNNPACK) implements convolution as some variant of this idea. Understanding it explains why CNN inference cost scales like matmul, why grouped convolutions are cheap, and why depthwise-separable convolutions split into two matmuls.
The mechanism
For input , kernel , output :
- im2col: for each output position , extract the values in its receptive field and stack them as a column. The result is a matrix .
- Flatten kernel: reshape to .
- GEMM: , shape .
- col2im: reshape back to .
Memory cost: duplicates each input pixel up to times. For a kernel, that is a 9x blowup of the activation tensor.
Variants
- Implicit GEMM: avoid materializing in memory. Compute the matmul tile by tile, indexing back into on the fly. cuDNN’s default for most conv shapes.
- Winograd: trade matmul FLOPs for additions via polynomial transforms. Faster for small kernels (e.g. ) on certain hardware. Lower numerical precision.
- FFT convolution: . Wins for large kernels (rare in modern CNNs).
- Depthwise convolution: each input channel has its own filter, so is block-diagonal. The matmul splits into tiny independent matmuls, much cheaper.
Why this matters for the senior interview
If asked “how does convolution actually run on a GPU,” the expected answer is: it is a matmul. Then walk through the im2col reshape, the GEMM call, and the memory blowup. Bonus points for noting that the flattened kernel has shape , so the FLOP count is . The same formula you see in every model card.
Common pitfalls
- Forgetting the memory cost. im2col can be larger than the activation it came from. Implicit GEMM exists for this reason.
- Conflating convolution with cross-correlation. Deep learning frameworks implement cross-correlation; the kernel is not flipped. Mathematicians’ convolution flips the kernel. Almost never matters in practice.
- Treating depthwise and pointwise as a single op. They are two distinct matmuls with very different shapes. Profile separately.
Related
- CNN architecture.
- FlashAttention. Same idea: rearrange for hardware.