Skip to content
mentorship

concepts

Pruning: structured vs unstructured sparsity

Set unimportant weights to zero, recover most of the accuracy. Unstructured pruning shrinks model size; structured pruning shrinks inference time. They solve different problems.

Reviewed · 3 min read

One-line definition

Pruning removes weights from a trained network and fine-tunes to recover accuracy. Unstructured pruning zeros individual weights. Structured pruning removes entire neurons, channels, heads, or layers.

Why it matters

Modern networks are massively over-parameterized. The lottery ticket hypothesis (Frankle & Carbin, 2019) suggests that within a trained dense network, a small subnetwork (10 to 20 percent of weights) reaches the same accuracy when retrained. Pruning is the practical exploitation of that observation.

Two distinct goals:

  • Smaller model on disk and in memory: unstructured pruning + sparse storage. Useful for distribution and memory-bound deployment.
  • Faster inference on real hardware: structured pruning. Removes whole tensors so the remaining computation is dense and matches GEMM kernels.

The two regimes

Unstructured pruning

Zero individual weights below some magnitude threshold. Common criterion: magnitude pruning (), often combined with weight decay during fine-tuning.

Sparsity achievable90 to 95 percent on overparameterized models
Storage benefitReal (CSR / CSC formats)
Speed benefitNone on standard GPUs
Why no speedupSparse matmul kernels are rarely faster than dense matmul until > 90 percent sparsity, and only on specialized hardware (NVIDIA 2:4 sparsity, custom accelerators)

Structured pruning

Remove entire structures: convolutional channels, transformer heads, FFN neurons, sometimes whole layers.

Sparsity achievable30 to 70 percent typical
Storage benefitReal
Speed benefitReal, proportional to sparsity
Why it works on hardwareOutput is a smaller dense tensor; runs through standard GEMM

The 2:4 semi-structured compromise

NVIDIA Ampere and later support 2:4 sparsity: in every group of 4 weights, exactly 2 are zero. This pattern is preserved through dense matmul kernels with a 2x speedup. Compromise between unstructured (more flexible, no speedup) and structured (less flexible, full speedup).

Pipeline

  1. Train the dense model normally.
  2. Score weights or structures by importance (magnitude, gradient-magnitude, Hessian-based, Fisher).
  3. Prune below a target sparsity.
  4. Fine-tune to recover accuracy. Often iterative: prune-finetune-prune-finetune.
  5. Optional: lottery-ticket rewind. Reset weights to an early-training checkpoint, train the sparse mask from there.

Tradeoffs vs other compression

  • Pruning vs quantization: orthogonal. Combine both. INT8 + 50 percent structured sparsity is common in production.
  • Pruning vs distillation: distillation trains a smaller model from scratch with a teacher’s soft targets. Pruning starts dense and shrinks. Distillation often produces better small models but needs the teacher’s training data.
  • Pruning at training time (e.g. RigL, Sparse Transfer): grow-and-prune during training. Avoids the prune-finetune cycle but harder to tune.

Common pitfalls

  • Reporting unstructured-sparsity speedups on standard GPUs. Almost always misleading. Sparse storage is not sparse compute.
  • Pruning before training is done. Pruning the trajectory of training, not the final model, often hurts.
  • Treating “90 percent sparse” as a quality measure. What matters is task performance at the achieved compute or memory cost.
  • Forgetting batchnorm / layernorm. Structured pruning needs to also adjust normalization statistics for the kept channels.