One-line definition
Sequence packing concatenates multiple training examples back-to-back into a single fixed-length sequence and uses a block-diagonal attention mask so each example only attends within itself, eliminating the FLOPs and memory wasted on padding tokens.
Why it matters
Most NLP corpora have heavily skewed length distributions: many short examples, few long ones. With naive padding to the longest example in the batch, the wasted-token ratio is
For C4-like web text this is often 50–80%. Padded positions cost full FLOPs and memory but contribute nothing to the loss. Sequence packing recovers nearly all of that throughput.
The mechanism
- Pick a fixed packed length (e.g., 8192).
- Concatenate examples until adding the next would exceed . Record the boundaries (cumulative sequence lengths, often called
cu_seqlens). - Build a block-diagonal attention mask: token in example cannot attend to any token in .
- Compute attention with a kernel that respects
cu_seqlens(FlashAttention-2 supports this natively via thevarlenAPI). - Apply the loss only on response tokens within each example (mask the boundaries and any prompt tokens for SFT).
Position IDs reset at each example boundary so position 0 is the start of each packed example.
Numbers
For pretraining on web text packed at :
- Wasted tokens drop from ~50% (naive batching) to <2% (just the slack at the end of the packed sequence).
- Throughput per GPU roughly doubles.
- Quality is unchanged when the mask is correct.
Most modern training stacks pack by default. Llama, Mistral, Qwen, and major SFT toolkits (axolotl, TRL) all support it.
Common implementation pitfalls
- Forgetting to reset positions. If position IDs continue across boundaries, attention learns to treat packed boundaries as long-range dependencies and quality drops.
- Wrong loss masking. Loss must not flow from one example to another; mask boundaries explicitly.
- Mixing prompt and response in SFT without masking. For SFT, mask out prompt tokens from the loss within each packed example.
- Using a kernel that doesn’t support varlen. Without FlashAttention-2 varlen (or equivalent), the block-diagonal mask materializes the full matrix and you lose the speedup.
When not to pack
- Very long single examples that fill or exceed on their own (no concatenation possible; padding is already minimal).
- When examples have inter-document context that should attend across boundaries (rare).
- During inference, where examples come one at a time.
Related
- FlashAttention. The underlying kernel that makes varlen attention efficient.
- Gradient accumulation. Orthogonal way to grow effective batch size.