One-line definition
Pruning removes weights from a trained network and fine-tunes to recover accuracy. Unstructured pruning zeros individual weights. Structured pruning removes entire neurons, channels, heads, or layers.
Why it matters
Modern networks are massively over-parameterized. The lottery ticket hypothesis (Frankle & Carbin, 2019) suggests that within a trained dense network, a small subnetwork (10 to 20 percent of weights) reaches the same accuracy when retrained. Pruning is the practical exploitation of that observation.
Two distinct goals:
- Smaller model on disk and in memory: unstructured pruning + sparse storage. Useful for distribution and memory-bound deployment.
- Faster inference on real hardware: structured pruning. Removes whole tensors so the remaining computation is dense and matches GEMM kernels.
The two regimes
Unstructured pruning
Zero individual weights below some magnitude threshold. Common criterion: magnitude pruning (), often combined with weight decay during fine-tuning.
| Sparsity achievable | 90 to 95 percent on overparameterized models |
| Storage benefit | Real (CSR / CSC formats) |
| Speed benefit | None on standard GPUs |
| Why no speedup | Sparse matmul kernels are rarely faster than dense matmul until > 90 percent sparsity, and only on specialized hardware (NVIDIA 2:4 sparsity, custom accelerators) |
Structured pruning
Remove entire structures: convolutional channels, transformer heads, FFN neurons, sometimes whole layers.
| Sparsity achievable | 30 to 70 percent typical |
| Storage benefit | Real |
| Speed benefit | Real, proportional to sparsity |
| Why it works on hardware | Output is a smaller dense tensor; runs through standard GEMM |
The 2:4 semi-structured compromise
NVIDIA Ampere and later support 2:4 sparsity: in every group of 4 weights, exactly 2 are zero. This pattern is preserved through dense matmul kernels with a 2x speedup. Compromise between unstructured (more flexible, no speedup) and structured (less flexible, full speedup).
Pipeline
- Train the dense model normally.
- Score weights or structures by importance (magnitude, gradient-magnitude, Hessian-based, Fisher).
- Prune below a target sparsity.
- Fine-tune to recover accuracy. Often iterative: prune-finetune-prune-finetune.
- Optional: lottery-ticket rewind. Reset weights to an early-training checkpoint, train the sparse mask from there.
Tradeoffs vs other compression
- Pruning vs quantization: orthogonal. Combine both. INT8 + 50 percent structured sparsity is common in production.
- Pruning vs distillation: distillation trains a smaller model from scratch with a teacher’s soft targets. Pruning starts dense and shrinks. Distillation often produces better small models but needs the teacher’s training data.
- Pruning at training time (e.g. RigL, Sparse Transfer): grow-and-prune during training. Avoids the prune-finetune cycle but harder to tune.
Common pitfalls
- Reporting unstructured-sparsity speedups on standard GPUs. Almost always misleading. Sparse storage is not sparse compute.
- Pruning before training is done. Pruning the trajectory of training, not the final model, often hurts.
- Treating “90 percent sparse” as a quality measure. What matters is task performance at the achieved compute or memory cost.
- Forgetting batchnorm / layernorm. Structured pruning needs to also adjust normalization statistics for the kept channels.