Skip to content
mentorship

concepts

Mixture of Experts (MoE)

Replace one large feed-forward block with N smaller experts and a router that activates only k of them per token. Trades parameter count for compute.

Reviewed · 2 min read

One-line definition

A Mixture-of-Experts layer replaces a single dense feed-forward network with parallel “expert” FFNs and a router that sends each token to the top- experts (typically or ). Total parameters scale with ; per-token compute scales with .

Why it matters

The defining tradeoff: a -of- MoE has roughly the same per-token FLOPs as a dense model with of the parameters, but the capacity of all experts. Mixtral 8×7B (Jiang et al., 2023) has 47B total parameters and uses ~13B per token. Quality close to a 70B dense model at ~5× lower inference compute.

MoE is the dominant strategy for scaling parameter count beyond what dense training and inference can afford. GPT-4, Mixtral, DeepSeek-V3, Grok 1, and many other 2024-2026 frontier models are MoE.

The mechanism

For each transformer block, replace the single FFN with:

  1. Router: a small linear layer . For each token, compute logits , take top- experts, and softmax-normalize over those .
  2. Experts: independent FFN blocks , each the same shape as the dense FFN it replaces.
  3. Combine: output is where is the router weight.

Attention layers are typically not MoE (shared across all tokens).

Load balancing

The router will collapse to a few favorite experts unless penalized. Standard fix: an auxiliary load-balancing loss that penalizes uneven expert usage within a batch. Alternatives include expert-choice routing (each expert picks its top tokens, Zhou et al., 2022) and noise injection.

If experts go unused for many steps their parameters drift; a few production systems “reset” dead experts.

Capacity and expert parallelism

With experts, an intuitive serving setup is expert parallelism: each GPU holds one expert. Tokens are routed via all-to-all communication. This works but introduces:

  • Communication overhead: all-to-all is bandwidth-bound and stalls when the routing is imbalanced.
  • Capacity factor: each expert has a max number of tokens it can process per batch; overflow tokens are dropped (or sent to a fallback path). Capacity factor of 1.25 is common.

Tradeoffs vs. dense

  • Memory: MoE needs FFN parameters in HBM even though only are used per token. Inference VRAM is dominated by all experts being loaded, not just active ones.
  • Throughput: MoE wins per-FLOP. Throughput per VRAM-byte is worse than dense.
  • Quality at fixed FLOPs: MoE generally beats dense at matched per-token FLOPs.
  • Fine-tuning: MoE is harder to fine-tune cleanly; routing can drift, and small datasets exacerbate load imbalance.

Common pitfalls

  • Quoting “total parameters” as if they were active. A 47B MoE with 13B active is a 13B-FLOPs model with 47B-VRAM cost.
  • Ignoring the routing loss. Without it, training collapses to using a few experts.
  • Assuming MoE always wins. At small scale or with limited compute for routing experimentation, dense is simpler and competitive.