Skip to content
mentorship

questions

Walk me through how you'd train a 100B parameter model

The question is about parallelism and memory, not about modeling. The L6 answer combines data, tensor, pipeline, and FSDP/ZeRO sharding into a coherent strategy.

Reviewed · 4 min read

Asked in: senior LLM-team and infra-leaning ML loops.

The question tests systems thinking. The L4 answer says “use distributed training.” The L6 answer names the four parallelism axes, explains what each one shards, and describes the right combination for the model size and hardware.

What an L4 answer sounds like

“I’d use multiple GPUs and split the data across them. Maybe use Horovod or PyTorch DDP.”

This works for a 1B model, not for a 100B model. Pure data parallelism requires the full model to fit on each GPU; a 100B model in BF16 is 200 GB, more than any single GPU. You’ve trained at scale only in the textbook sense.

What an L5 answer sounds like

“100B parameters in BF16 is 200 GB of weights, plus gradients (200 GB), plus Adam optimizer state in FP32 (800 GB), plus activations. None of this fits on a single GPU. The training stack has to combine four axes of parallelism:

  1. Data parallelism (DP): replicate the model across GPUs, split the batch. Each GPU gets a different mini-batch slice. Gradients are all-reduced.
  2. Tensor parallelism (TP): split individual matrix multiplications across GPUs. Useful for the large MLPs and attention layers. Requires fast interconnect (NVLink, not just PCIe).
  3. Pipeline parallelism (PP): split the model by layer across GPUs. Each stage processes micro-batches in pipelined fashion. Has bubble overhead.
  4. Sharded data parallelism (FSDP / ZeRO Stage 3): shard model weights, gradients, and optimizer state across data-parallel ranks. Recovers memory while keeping the data-parallel programming model.

A typical recipe for 100B: TP within a node (for the big matmuls), FSDP across nodes (for memory savings), maybe PP if you have many nodes. Plus mixed precision (BF16), activation checkpointing, and gradient accumulation to hit the effective batch size.”

This is L5. You’ve named the parallelism axes, sized the memory problem, and given a defensible recipe.

What an L6 answer sounds like

“…a few more practical considerations:

Hardware topology drives the strategy. TP requires very fast intra-node interconnect (NVLink, 600 GB/s on H100). PP and FSDP can tolerate slower inter-node bandwidth (InfiniBand, ~100 GB/s). The decomposition follows the bandwidth hierarchy: TP within a node, FSDP/PP across nodes.

Optimizer state dominates memory. Adam’s m, v in FP32 is 4x the model size. ZeRO Stage 1 (shard optimizer state) is the largest single memory win. Stage 2 adds gradient sharding, Stage 3 adds parameter sharding. FSDP is the PyTorch-native version of ZeRO Stage 3.

Activation memory at long context dominates everything. For long-sequence training, activations exceed weights. Mitigations: activation checkpointing (recompute during backward), sequence parallelism (split activations along the sequence axis across TP ranks), context parallelism (Ring Attention, etc.).

Throughput tuning matters more than the choice of axes. Communication-overlap, careful kernel scheduling, and choice of micro-batch size for PP often deliver 30-50% throughput swings on the same hardware. Tools: PyTorch profiler, NVIDIA Nsight, Megatron’s tuning playbook.

Stability is the real challenge at this scale. Loss spikes are common; the standard mitigations are gradient clipping, careful warmup, sometimes embedding norm or QK norm. Numerical issues that don’t appear at 1B can be fatal at 100B.”

Tells that get you a strong-hire vote

  • You compute the memory budget explicitly (weights, gradients, optimizer state, activations).
  • You name all four parallelism axes and what each shards.
  • You map them to hardware topology (TP within node, FSDP/PP across nodes).
  • You mention activation memory as a separate concern at long context.
  • You discuss stability (loss spikes, gradient clipping) as a first-order concern at scale.

Tells that get you down-leveled

  • “Just use DDP” (won’t fit).
  • Confusing TP, PP, FSDP.
  • No memory budget calculation.
  • Suggesting CPU offloading as the primary strategy (works, but slow; usually a last resort).
  • No mention of activation checkpointing.

Common follow-up

“What’s the difference between FSDP and tensor parallelism?”

The L6 answer:

“TP shards an individual matrix multiplication across GPUs. The matrix is logically one operation; physically, each GPU computes part. Communication happens during the operation (all-reduce after each TP layer).

FSDP shards the weights across data-parallel ranks. Each GPU stores a partition; before a layer’s forward pass, the full weight is gathered from the partitions. Communication happens between layers (all-gather before each layer, reduce-scatter for gradients).

They’re complementary: TP gives you per-op parallelism with high bandwidth requirements; FSDP gives you per-layer memory savings with lower bandwidth requirements. Real systems combine both.”


Related: Mixed precision training, Adam, AdamW, and the modern optimizer landscape, Transformer architecture.