One-line definition
Pipeline parallelism (PP) splits a model along its depth: GPU 0 holds layers 1–8, GPU 1 holds layers 9–16, etc. A mini-batch is divided into smaller micro-batches that flow through the stages so that GPU 0 starts processing micro-batch 2 while GPU 1 processes micro-batch 1, achieving parallel utilization despite the sequential layer dependency.
Why it matters
For large models that don’t fit on a single GPU, TP is the natural choice within a node (where NVLink is fast). But TP doesn’t extend across nodes. Communication kills throughput. PP scales across nodes via much smaller cross-stage messages (just the activations between consecutive stages, not weights), enabling multi-node scaling of frontier models.
The basic idea
Without micro-batches, naive pipeline:
GPU 0: forward layer 1-8 ──────────── → backward layer 1-8 ────
GPU 1: ─────────── forward 9-16 ──── → backward 9-16 ───────
GPU 2: ───────────────── forward 17-24 → backward 17-24 ─
Most GPUs are idle most of the time. The pipeline bubble.
With micro-batches, the bubble shrinks:
GPU 0: f1 f2 f3 f4 ───────────────────── b4 b3 b2 b1
GPU 1: ─── f1 f2 f3 f4 ───────── b4 b3 b2 b1 ───────
GPU 2: ─────── f1 f2 f3 f4 b4 b3 b2 b1 ─────────────
Bubble fraction .
GPipe vs. 1F1B vs. interleaved
- GPipe (Huang et al., 2018): all forwards then all backwards. Bubble fraction high; activation memory high (must store all forwards).
- 1F1B (one forward, one backward; PipeDream): start backward as soon as the first micro-batch reaches the last stage. Reduces bubble and activation memory.
- Interleaved 1F1B (Megatron): each GPU holds non-contiguous chunks of layers (e.g., layers 1-2 and 9-10) so the bubble shrinks further.
- Zero Bubble Pipeline (recent): split backward into weight-grad and input-grad parts to fill almost all bubbles.
Modern frontier training uses interleaved 1F1B or Zero Bubble.
Cost model
For a model with layers split across stages and micro-batches:
- Bubble: of total time. Minimize by increasing .
- Communication per micro-batch: send activations of one micro-batch between adjacent stages. Cost activation size batch × seq × hidden, much smaller than full weight all-reduce.
- Activation memory per stage: in 1F1B, micro-batches’ worth of activations.
When PP wins
- Cross-node scaling with slow interconnect.
- Very deep models where one stage easily fits on a node.
- Frontier training combining 3D parallelism (DP + TP + PP).
When PP loses
- Small models that fit on one node. TP within node + DP across nodes is simpler.
- Few micro-batches in a step. Bubble dominates.
- Workloads with very different per-layer compute. Load imbalance creates idle GPUs.
3D parallelism
The standard frontier training stack:
- Tensor parallel within a node (4–8 GPUs, NVLink).
- Pipeline parallel across small groups of nodes.
- Data parallel / FSDP across remaining nodes (sharded for memory).
For a 405B-parameter model on 1024 GPUs: TP=8, PP=16, DP=8 is a typical configuration.
Common pitfalls
- Few micro-batches → big bubble. Use as a rule of thumb.
- Imbalanced stage compute. Layer 1 (embedding) and the last layer (LM head) may be much heavier or lighter than middle layers. Manual partitioning helps.
- Forgetting activation memory grows with in 1F1B. Combine with activation checkpointing.
- Treating PP as the same as TP. Different sharding axes, different communication patterns. PP is bandwidth-light; TP is bandwidth-heavy.
- Skipping interleaving on >4 stages. Sequential 1F1B has noticeable bubble at large ; interleaving cuts it.
Related
- Tensor parallelism. Orthogonal sharding within nodes.
- FSDP and ZeRO. Orthogonal sharding for memory.
- Activation checkpointing. Reduce per-stage activation memory.