Skip to content
mentorship

concepts

Pipeline parallelism

Split the model across GPUs by layer; pipeline mini-batches through the stages. The way to scale across slow interconnects when TP isn't viable.

Reviewed · 3 min read

One-line definition

Pipeline parallelism (PP) splits a model along its depth: GPU 0 holds layers 1–8, GPU 1 holds layers 9–16, etc. A mini-batch is divided into smaller micro-batches that flow through the stages so that GPU 0 starts processing micro-batch 2 while GPU 1 processes micro-batch 1, achieving parallel utilization despite the sequential layer dependency.

Why it matters

For large models that don’t fit on a single GPU, TP is the natural choice within a node (where NVLink is fast). But TP doesn’t extend across nodes. Communication kills throughput. PP scales across nodes via much smaller cross-stage messages (just the activations between consecutive stages, not weights), enabling multi-node scaling of frontier models.

The basic idea

Without micro-batches, naive pipeline:

GPU 0: forward layer 1-8 ────────────  →  backward layer 1-8 ──── 
GPU 1: ───────────  forward 9-16 ────  →  backward 9-16 ───────
GPU 2: ─────────────────  forward 17-24 → backward 17-24 ─

Most GPUs are idle most of the time. The pipeline bubble.

With micro-batches, the bubble shrinks:

GPU 0: f1  f2  f3  f4 ─────────────────────  b4  b3  b2  b1
GPU 1: ─── f1  f2  f3  f4 ─────────  b4  b3  b2  b1 ───────
GPU 2: ─────── f1  f2  f3  f4  b4  b3  b2  b1 ─────────────

Bubble fraction .

GPipe vs. 1F1B vs. interleaved

  • GPipe (Huang et al., 2018): all forwards then all backwards. Bubble fraction high; activation memory high (must store all forwards).
  • 1F1B (one forward, one backward; PipeDream): start backward as soon as the first micro-batch reaches the last stage. Reduces bubble and activation memory.
  • Interleaved 1F1B (Megatron): each GPU holds non-contiguous chunks of layers (e.g., layers 1-2 and 9-10) so the bubble shrinks further.
  • Zero Bubble Pipeline (recent): split backward into weight-grad and input-grad parts to fill almost all bubbles.

Modern frontier training uses interleaved 1F1B or Zero Bubble.

Cost model

For a model with layers split across stages and micro-batches:

  • Bubble: of total time. Minimize by increasing .
  • Communication per micro-batch: send activations of one micro-batch between adjacent stages. Cost activation size batch × seq × hidden, much smaller than full weight all-reduce.
  • Activation memory per stage: in 1F1B, micro-batches’ worth of activations.

When PP wins

  • Cross-node scaling with slow interconnect.
  • Very deep models where one stage easily fits on a node.
  • Frontier training combining 3D parallelism (DP + TP + PP).

When PP loses

  • Small models that fit on one node. TP within node + DP across nodes is simpler.
  • Few micro-batches in a step. Bubble dominates.
  • Workloads with very different per-layer compute. Load imbalance creates idle GPUs.

3D parallelism

The standard frontier training stack:

  • Tensor parallel within a node (4–8 GPUs, NVLink).
  • Pipeline parallel across small groups of nodes.
  • Data parallel / FSDP across remaining nodes (sharded for memory).

For a 405B-parameter model on 1024 GPUs: TP=8, PP=16, DP=8 is a typical configuration.

Common pitfalls

  • Few micro-batches → big bubble. Use as a rule of thumb.
  • Imbalanced stage compute. Layer 1 (embedding) and the last layer (LM head) may be much heavier or lighter than middle layers. Manual partitioning helps.
  • Forgetting activation memory grows with in 1F1B. Combine with activation checkpointing.
  • Treating PP as the same as TP. Different sharding axes, different communication patterns. PP is bandwidth-light; TP is bandwidth-heavy.
  • Skipping interleaving on >4 stages. Sequential 1F1B has noticeable bubble at large ; interleaving cuts it.