One-line definition
A Mixture-of-Experts layer replaces a single dense feed-forward network with parallel “expert” FFNs and a router that sends each token to the top- experts (typically or ). Total parameters scale with ; per-token compute scales with .
Why it matters
The defining tradeoff: a -of- MoE has roughly the same per-token FLOPs as a dense model with of the parameters, but the capacity of all experts. Mixtral 8×7B (Jiang et al., 2023) has 47B total parameters and uses ~13B per token. Quality close to a 70B dense model at ~5× lower inference compute.
MoE is the dominant strategy for scaling parameter count beyond what dense training and inference can afford. GPT-4, Mixtral, DeepSeek-V3, Grok 1, and many other 2024-2026 frontier models are MoE.
The mechanism
For each transformer block, replace the single FFN with:
- Router: a small linear layer . For each token, compute logits , take top- experts, and softmax-normalize over those .
- Experts: independent FFN blocks , each the same shape as the dense FFN it replaces.
- Combine: output is where is the router weight.
Attention layers are typically not MoE (shared across all tokens).
Load balancing
The router will collapse to a few favorite experts unless penalized. Standard fix: an auxiliary load-balancing loss that penalizes uneven expert usage within a batch. Alternatives include expert-choice routing (each expert picks its top tokens, Zhou et al., 2022) and noise injection.
If experts go unused for many steps their parameters drift; a few production systems “reset” dead experts.
Capacity and expert parallelism
With experts, an intuitive serving setup is expert parallelism: each GPU holds one expert. Tokens are routed via all-to-all communication. This works but introduces:
- Communication overhead: all-to-all is bandwidth-bound and stalls when the routing is imbalanced.
- Capacity factor: each expert has a max number of tokens it can process per batch; overflow tokens are dropped (or sent to a fallback path). Capacity factor of 1.25 is common.
Tradeoffs vs. dense
- Memory: MoE needs FFN parameters in HBM even though only are used per token. Inference VRAM is dominated by all experts being loaded, not just active ones.
- Throughput: MoE wins per-FLOP. Throughput per VRAM-byte is worse than dense.
- Quality at fixed FLOPs: MoE generally beats dense at matched per-token FLOPs.
- Fine-tuning: MoE is harder to fine-tune cleanly; routing can drift, and small datasets exacerbate load imbalance.
Common pitfalls
- Quoting “total parameters” as if they were active. A 47B MoE with 13B active is a 13B-FLOPs model with 47B-VRAM cost.
- Ignoring the routing loss. Without it, training collapses to using a few experts.
- Assuming MoE always wins. At small scale or with limited compute for routing experimentation, dense is simpler and competitive.