Pruning: structured vs unstructured sparsity

One-line definition

Pruning removes weights from a trained network and fine-tunes to recover accuracy. Unstructured pruning zeros individual weights. Structured pruning removes entire neurons, channels, heads, or layers.

Why it matters

Modern networks are massively over-parameterized. The lottery ticket hypothesis (Frankle & Carbin, 2019) suggests that within a trained dense network, a small subnetwork (10 to 20 percent of weights) reaches the same accuracy when retrained. Pruning is the practical exploitation of that observation.

Two distinct goals:

Smaller model on disk and in memory: unstructured pruning + sparse storage. Useful for distribution and memory-bound deployment.
Faster inference on real hardware: structured pruning. Removes whole tensors so the remaining computation is dense and matches GEMM kernels.

The two regimes

Unstructured pruning

Zero individual weights below some magnitude threshold. Common criterion: magnitude pruning ( $∣ w ∣ < τ$ ), often combined with weight decay during fine-tuning.


Sparsity achievable	90 to 95 percent on overparameterized models
Storage benefit	Real (CSR / CSC formats)
Speed benefit	None on standard GPUs
Why no speedup	Sparse matmul kernels are rarely faster than dense matmul until > 90 percent sparsity, and only on specialized hardware (NVIDIA 2:4 sparsity, custom accelerators)

Structured pruning

Remove entire structures: convolutional channels, transformer heads, FFN neurons, sometimes whole layers.


Sparsity achievable	30 to 70 percent typical
Storage benefit	Real
Speed benefit	Real, proportional to sparsity
Why it works on hardware	Output is a smaller dense tensor; runs through standard GEMM

The 2:4 semi-structured compromise

NVIDIA Ampere and later support 2:4 sparsity: in every group of 4 weights, exactly 2 are zero. This pattern is preserved through dense matmul kernels with a 2x speedup. Compromise between unstructured (more flexible, no speedup) and structured (less flexible, full speedup).

Pipeline

Train the dense model normally.
Score weights or structures by importance (magnitude, gradient-magnitude, Hessian-based, Fisher).
Prune below a target sparsity.
Fine-tune to recover accuracy. Often iterative: prune-finetune-prune-finetune.
Optional: lottery-ticket rewind. Reset weights to an early-training checkpoint, train the sparse mask from there.

Tradeoffs vs other compression

Pruning vs quantization: orthogonal. Combine both. INT8 + 50 percent structured sparsity is common in production.
Pruning vs distillation: distillation trains a smaller model from scratch with a teacher’s soft targets. Pruning starts dense and shrinks. Distillation often produces better small models but needs the teacher’s training data.
Pruning at training time (e.g. RigL, Sparse Transfer): grow-and-prune during training. Avoids the prune-finetune cycle but harder to tune.

Common pitfalls

Reporting unstructured-sparsity speedups on standard GPUs. Almost always misleading. Sparse storage is not sparse compute.
Pruning before training is done. Pruning the trajectory of training, not the final model, often hurts.
Treating “90 percent sparse” as a quality measure. What matters is task performance at the achieved compute or memory cost.
Forgetting batchnorm / layernorm. Structured pruning needs to also adjust normalization statistics for the kept channels.