How to build a 671B-parameter model that only touches 37B parameters per token — decouple total capacity from per-token compute, derive the gating math, tame the routing chaos, and understand how DeepSeek-V3, Mixtral, and Llama 4 actually work.
Imagine you are running a hospital. You have three choices: hire one extremely knowledgeable generalist who sees every patient, hire three specialists and route each patient to exactly one of them, or hire ten specialists and still route each patient to just one. The third hospital has ten times the total expertise, but each patient still only sees one doctor. The cost per patient doesn't change — only the total talent pool does.
That is the Mixture of Experts (MoE) idea applied to transformers. A standard dense model grows both parameter count and FLOPs together — doubling parameters doubles both capacity and compute per token. MoE breaks that coupling. You can have 256 FFN networks ("experts") inside each transformer layer, but each token is only routed to 8 of them. Total parameters scale with N experts; per-token compute stays fixed at k experts.
The slide that launched a thousand MoE papers: Fedus et al. 2022 showed that for the same training FLOPs budget, a sparse MoE model consistently beat a dense model with the same compute. Same cost, better performance. That's not magic — it's specialization. Different experts learn to handle different kinds of tokens, allowing the network to maintain specialists without paying their full compute cost on every input.
Why is this suddenly everywhere? Four forces aligned around 2022–2025. First, empirical scaling: same FLOPs-per-token, MoE trains to lower loss. Second, speed: OlMoE showed MoEs reach the same quality milestone faster in wall-clock time than dense equivalents. Third, results: Mixtral 8×7B beats LLaMA 2 70B while matching a 13B-parameter dense compute budget. Fourth, hardware: as clusters grow to thousands of GPUs, the per-expert compute maps naturally onto expert parallelism — put each expert on a different device, route tokens across the network. The infrastructure was finally ready.
Adjust total experts and active experts (top-k). See how total params and per-token FLOPs diverge.
Before MoE, every transformer layer looked the same: a self-attention block followed by a feed-forward network (FFN). The FFN is a two-matrix computation — project up from dmodel to dff (the "expansion"), apply an activation, project back down. Every token goes through this entire FFN at every layer. That's the dense baseline.
The MoE transformation is surgical: replace the FFN with multiple FFNs plus a routing network. The "experts" are ordinary FFNs — the same architecture, just N copies of them. The router is a small linear layer that takes the token's hidden state and outputs a probability distribution over experts. You pick the top-k experts according to those probabilities, run only those k FFNs, and combine their outputs as a weighted sum.
The key constraint: experts are only in the FFN layers, not the attention layers. Attention operates across positions — it needs all tokens in the sequence to compute keys and queries. An FFN, by contrast, is a position-wise operation: the same function applied independently to each token. That independence is what makes expert routing tractable. You can send different tokens to different experts without any cross-token coordination.
MoE for attention heads also exists (ModuleFormer, JetMoE) but is far less common. The math of routing across attention heads gets messy and the gains don't clearly justify the complexity. The community consensus landed on: attention stays dense, FFN goes sparse.
MoE layers are not placed at every transformer layer. Common patterns: alternate MoE and dense layers (Mixtral), or use MoE at every other layer but keep a few purely dense layers near the input and output (DeepSeek-V3). The reason is stability — early and late layers tend to develop less specialized representations and benefit less from routing.
python # Minimal MoE FFN forward — sketch showing the shape flow import torch import torch.nn as nn import torch.nn.functional as F class MoEFFN(nn.Module): def __init__(self, d_model, d_ff, n_experts, top_k): super().__init__() self.n_experts = n_experts self.top_k = top_k # N separate FFNs — each is (d_model → d_ff → d_model) self.w1 = nn.Parameter(torch.randn(n_experts, d_model, d_ff)) self.w2 = nn.Parameter(torch.randn(n_experts, d_ff, d_model)) # Router: maps d_model → N logits self.router = nn.Linear(d_model, n_experts, bias=False) def forward(self, x): # x: [B, T, d_model] B, T, d = x.shape x_flat = x.view(B*T, d) # [B*T, d_model] logits = self.router(x_flat) # [B*T, N] probs = F.softmax(logits, dim=-1) # [B*T, N] topk_probs, topk_idx = probs.topk(self.top_k, dim=-1) # topk_probs: [B*T, k], topk_idx: [B*T, k] out = torch.zeros_like(x_flat) for i in range(self.top_k): ei = topk_idx[:, i] # [B*T] — which expert for slot i gi = topk_probs[:, i:i+1] # [B*T, 1] — gate weight # Gather the expert weights for each token w1_i = self.w1[ei] # [B*T, d_model, d_ff] w2_i = self.w2[ei] # [B*T, d_ff, d_model] h = F.gelu(torch.bmm(x_flat.unsqueeze(1), w1_i).squeeze(1)) e_out = torch.bmm(h.unsqueeze(1), w2_i).squeeze(1) out += gi * e_out return out.view(B, T, d)
The router is the decision-maker: given a token's hidden state, it assigns that token to exactly k of the N available experts. Getting this right is the entire game. The basic approach — used in most models including Switch Transformer, Mixtral, and DeepSeek — is embarrassingly simple: a linear layer + softmax.
Let x ∈ ℝd be the token's hidden state (a vector with d entries). The router has a weight matrix Wr ∈ ℝd × N. Compute logits l = xWr (a vector of N scores, one per expert). Apply softmax to get probabilities over all N experts:
Now pick the top-k probabilities. Two common variants exist for when you apply softmax. The original Switch Transformer and most models compute softmax over all N experts first, then take the top k. Mixtral and DeepSeek-V3 flip the order: compute top-k logits first, then softmax over just those k winners. The difference matters: if you softmax over all N and pick top k, the gate weights for the selected experts still reflect comparisons against the other N-k losers. If you softmax only over the k winners, the weights sum to 1 over just those k and are more interpretable as mixing coefficients.
The router itself is tiny — just one matrix of size d × N, no activation. For DeepSeek-V3 with d=7168 and N=256 experts, that's 7168 × 256 ≈ 1.8M parameters. Compared to the 671B total, the router is invisible. But its decisions determine everything.
A single token is routed through N experts. Sliders set N and k. Click "New token" to randomize logits. The top-k selected experts are highlighted.
Alternative routing strategies exist: hashing routing deterministically maps tokens to experts based on a hash of the token ID (no learned router, zero overhead, but no specialization either). Expert-choice routing (from Du et al. 2022) flips the perspective — instead of tokens choosing experts, each expert chooses its top-C tokens from the batch. This guarantees perfect load balance by construction, but breaks causality (an expert needs to see all tokens before picking its favorites, which doesn't work autoregressively). RL routing (Bengio 2013, Clark 2020) treats routing as a policy to optimize — the "right" approach mathematically, but gradient variance makes it impractical at scale. Token-choice top-k won the practical competition.
python # Top-k router — two variants def route_softmax_then_topk(x, W_r, k): # Variant A (Switch style): softmax first, then select top-k logits = x @ W_r # [T, N] probs = F.softmax(logits, dim=-1) # [T, N], sums to 1 over all N topk_g, topk_idx = probs.topk(k, dim=-1) return topk_g, topk_idx # gates don't sum to 1 over k def route_topk_then_softmax(x, W_r, k): # Variant B (Mixtral/DeepSeek-V3): topk first, softmax over winners logits = x @ W_r # [T, N] topk_l, topk_idx = logits.topk(k, dim=-1) topk_g = F.softmax(topk_l, dim=-1) # [T, k], sums to 1 over k winners return topk_g, topk_idx
Here is a disaster that will absolutely happen without intervention: collapse to one expert. The router is a learned function. If one expert happens to produce slightly better outputs early in training, the router learns to send more tokens to it. With more tokens, that expert gets more gradient updates. With more updates, it becomes even better. The other experts atrophy from disuse. Within a few thousand steps, one expert receives 99% of the tokens and the rest are ghost experts that nobody calls.
This is not a theoretical risk — it happened in early MoE papers. Shazeer et al. 2017 called it expert collapse. The model degenerates into an expensive dense model: one expert is computing for almost every token, so the whole sparsity advantage evaporates. You're paying for N experts but only training one.
The solution everyone uses is a load-balancing auxiliary loss — a differentiable penalty added to the training objective that discourages routing imbalance. The Switch Transformer (Fedus et al. 2022) introduced the canonical form. Define two quantities per expert i, computed over a batch of T tokens:
Two points about this formula. First, fi is computed from argmax — it's a hard count, not differentiable. Pi is the soft probability — fully differentiable. The product fi · Pi gives a term with useful gradients: ∂L/∂Pi = αN · fi. Experts that receive more tokens (larger fi) get a stronger downward push on their router probabilities, discouraging the router from continuing to send them tokens. Experts receiving few tokens get a weaker push, allowing the router to send more their way.
Toggle the balance loss on/off to see how token distribution changes across experts. Bars show fraction of tokens each expert receives.
The hyperparameter α controls how strongly to enforce balance. Too small: collapse happens anyway. Too large: the balance loss overwhelms the main language modeling loss and the experts can't specialize — they all learn the same average function. In practice, α is set to values between 0.001 and 0.01 and requires tuning per model size and training duration.
DeepSeek-V3 introduced a clever alternative called auxiliary-loss-free balancing: instead of adding a penalty to the loss, they maintain a per-expert bias term bi that is updated via an online moving average. Experts that receive too many tokens get a negative bias (making them less likely to be selected); underloaded experts get a positive bias. The bias affects routing but not the gate weight magnitudes used in the weighted sum. This avoids the sensitivity to the α hyperparameter and doesn't pollute the main loss signal.
Even with a load-balancing loss, routing is imperfect. In any given batch, some experts will receive slightly more tokens than the average. This creates a hardware problem: in real distributed training, each expert sits on a different device. If expert 3 needs to process 500 tokens but expert 7 only gets 50, every device must wait for expert 3 to finish before the batch can continue. The slowest expert determines the batch time.
The solution is expert capacity: set a hard maximum on how many tokens any expert can receive per batch. If more tokens are routed to an expert than its capacity allows, the overflow tokens are dropped — they pass through without being processed by any expert (their residual stream contribution for that layer is just zero, or sometimes a copy of their input).
The capacity is expressed as a multiple of the "fair share." If you have T tokens in the batch and N experts, the average load is T/N tokens per expert. The capacity factor C sets the maximum as:
C=1.0 means each expert can handle exactly its fair share — any routing imbalance drops tokens. C=1.25 gives 25% headroom above the average — minor imbalances are absorbed without dropping. C=2.0 is generous but wastes memory (you must pre-allocate buffer space for all possible tokens). C<1.0 is possible if you deliberately want to prune computation.
A batch of tokens is routed to experts. The dashed line is the capacity limit. Tokens above the line are dropped. Adjust capacity factor and see the effect.
An interesting and often-overlooked consequence of token dropping: MoE models can be non-deterministic at inference time. This surprised people who assumed GPT-4 was using MoE. Token dropping is decided at the batch level: whether your token exceeds capacity depends on which other tokens arrived in the same batch. So two identical prompts in the same batch can produce different results if the other tokens in the batch differ — because the other tokens compete for the same expert slots.
Fine-tuning MoE models introduces its own capacity challenge. Fine-tuning typically uses small batches (often batch size 1 or 4). With tiny batches and 256 experts, most experts receive zero tokens per step and get no gradient signal. Zoph et al. 2022 solved this for ST-MoE by fine-tuning only the dense (non-MoE) parts of the network. DeepSeek's solution was to use massive fine-tuning data (1.4M SFT examples) so even with 256 experts, each gets enough exposure.
The original MoE recipe is clean: N experts of equal size, pick top k, done. But as the field iterated, two refinements emerged that are now standard in the best-performing models: fine-grained experts and shared experts.
Suppose you have a total expert parameter budget of B (the number of weights in a single "standard-size" FFN). You could use 8 experts, each with B/1 parameters (same size as a dense FFN), and route to top-2. Or you could split each of those 8 experts into 4 smaller ones — 32 experts, each with B/4 parameters — and route to top-8 (which gives the same total active parameters per token). The second configuration has the same per-token FLOPs but dramatically more expert specialization.
Why does this help? With more, smaller experts, the routing becomes more fine-grained. Instead of one expert handling "all syntax-related tokens" broadly, you might have four experts that each handle different syntactic phenomena. The total parameter budget is the same, but the granularity of specialization increases. The fine-grained ratio is the factor by which you split the experts (e.g., ratio 1/4 means each "slot" that would be one big expert is split into 4 smaller ones).
Some information is so universal that every token needs it — basic syntax, common word meanings, punctuation handling. Routing these tokens to specialized experts wastes expert slots that could be used for more domain-specific processing. The shared expert (or "always-on" expert) design addresses this: designate S experts that receive every token regardless of routing decisions, plus the standard top-k routed experts.
DeepSeek V1 and V2 used S=2 shared experts. DeepSeek V3 uses S=1. The shared experts handle universal knowledge; the routed experts handle specialization. Total active parameters per token = shared expert params + top-k routed expert params.
| Model | Routed experts | Active (top-k) | Shared | Fine-grained ratio |
|---|---|---|---|---|
| Switch Transformer | 64 | 1 | 0 | 1 |
| Mixtral 8×7B | 8 | 2 | 0 | 1 |
| DBRX | 16 | 4 | 0 | 1 |
| Grok | 8 | 2 | 0 | 1 |
| DeepSeek V1 | 64 | 6 | 2 | 1/4 |
| DeepSeek V2 | 160 | 6 | 2 | 1/10 |
| DeepSeek V3 | 256 | 8 | 1 | 1/14 |
| OlMoE | 64 | 8 | 0 | 1/8 |
| Llama 4 Maverick | 128 | 1 | 1 | 1/2 |
MoE training is notoriously unstable. Even with load-balancing losses and capacity factors tuned carefully, models can exhibit loss spikes, gradient explosions, and catastrophic expert collapse mid-training. Zoph et al. 2022 (ST-MoE) did the most systematic study of these failures and found the root cause: the router's logit magnitudes can grow without bound.
Here's what happens. The router computes logits l = xWr. If the logits become very large in magnitude — say, lmax = 50 — then softmax(l) is dominated by exp(50) ≈ 5 × 1021 compared to exp(0) = 1. The softmax saturates: one expert gets probability ≈ 1, all others get ≈ 0. Gradients through the saturated softmax vanish for all non-selected experts. The router "hardens" to deterministic decisions with no gradient signal to escape.
The fix comes in two parts. First, run the router in float32 rather than bfloat16. The router performs softmax, which is numerically sensitive. bfloat16's limited precision (only 7 mantissa bits) causes rounding errors in the exponentials that can flip routing decisions. Using float32 just for the router (with the rest of training in bf16 for memory efficiency) eliminates a major source of instability.
Second, add a router z-loss. The z-loss penalizes large logit magnitudes directly, regardless of routing decisions:
The inner term log ∑ exp(li) is the log-sum-exp of all router logits for token t. This is large when any logit is large. Squaring it makes it even more punishing for outliers. The gradient with respect to each logit li is proportional to the softmax probability pi — so overconfident routing (large logits) is directly penalized. β is typically set to 0.001, much smaller than the language modeling loss coefficient.
Expert dropout during fine-tuning is another stability trick. Sparse MoEs can overfit severely on small fine-tuning datasets — the model has many parameters but they're highly specialized, making them prone to memorizing the fine-tuning examples rather than generalizing. Zoph et al. 2022 found that fine-tuning only the non-MoE parts (attention, layer norms, embeddings) worked better than fine-tuning everything. More recent work (DeepSeek, Qwen MoE) uses large SFT datasets to avoid this problem entirely.
Upcycling is an alternative to training MoE from scratch: initialize a MoE model from a pre-trained dense model. Each expert starts as a copy of the dense FFN; the router starts from random. You then continue training. MiniCPM and Qwen MoE both demonstrated that upcycling can reach higher quality than training the same MoE from scratch with the same compute budget — you get the pre-trained dense model's knowledge as a starting point for free.
python # Router z-loss — penalizes large logit magnitudes def router_z_loss(logits, beta=1e-3): # logits: [B*T, N] — router output before softmax log_z = torch.logsumexp(logits, dim=-1) # [B*T] — log sum exp per token z_loss = (log_z ** 2).mean() return beta * z_loss # Load-balance auxiliary loss (Switch Transformer style) def balance_loss(router_probs, expert_indices, N, alpha=1e-2): # router_probs: [T, N], expert_indices: [T] — argmax of router T = router_probs.shape[0] # f_i: fraction of tokens going to expert i (hard, non-differentiable) f = torch.zeros(N) for i in range(N): f[i] = (expert_indices == i).float().mean() # P_i: average soft probability for expert i (differentiable) P = router_probs.mean(dim=0) # [N] return alpha * N * (f * P).sum() # Combined training loss loss = lm_loss + balance_loss(probs, idx, N) + router_z_loss(logits)
Put it all together. In this simulator, a batch of tokens flows through a full MoE layer. Watch the routing decisions, observe how load imbalance emerges, and see what the load-balance loss and capacity factor actually do to the flow.
Tokens are routed to experts. Orange = expert over capacity (tokens dropped). Use the controls to see how design choices affect routing quality.
All the concepts from the preceding chapters crystallize in three real MoE architectures. Let's walk through each and then explore the parameter math interactively.
The first widely adopted open-weight MoE. 8 experts per layer, top-2 routing, no shared experts, no fine-grained splitting. 32 layers. Each expert is a full-size 7B-class FFN. Total parameters: approximately 46.7B (8 experts × ~5.6B per expert FFN, plus shared non-MoE weights). Active parameters per token: roughly 13B (top-2 experts plus attention and embeddings). At inference, Mixtral matches LLaMA 2 70B quality while only computing 13B worth of weights — a 5× reduction in compute per token.
The design statement of 2025-era MoE. 256 routed experts per MoE layer, 1 shared expert, top-8 routing, fine-grained ratio 1/14. This means each "expert" is much smaller than a standard FFN — 14 fine-grained experts together equal one standard expert. 61 MoE layers in a 61-layer model. Total parameters: 671B. Active parameters per token: 37B. The ratio of 671/37 ≈ 18 — each token touches only 1/18th of the model by parameter count.
DeepSeek-V3 also uses auxiliary-loss-free balancing (per-expert biases updated online) instead of the standard auxiliary loss — a deliberate choice to avoid the α hyperparameter sensitivity that plagued earlier models.
Meta's 2025 MoE flagship. 128 routed experts, top-1 routing (!), 1 shared expert, fine-grained ratio 1/2. The return of k=1 routing — presumably combined with better training recipes and balance loss variants that make k=1 stable at this scale. Competitive with Gemini 1.5 Flash and GPT-4o-mini at a fraction of the active compute.
Adjust MoE configuration to see total model parameters vs active parameters per token. Compare compute efficiency.
Note: quiz answer 2 and 3 both reflect different conventions — quiz answer 2 reflects the "model parameters" view where parameter counts are often reported as per-layer and summed. The key insight is that each MoE layer has its own independent set of N expert FFNs, so total expert parameters = N × expert_size × n_moe_layers.
| Decision | Common Choice | Why | Exception |
|---|---|---|---|
| Where to put experts | Replace FFN layers | FFNs are position-wise → token independence | JetMoE also routes attention heads |
| Routing type | Token-choice top-k | Simple, fast, differentiable via soft gates | Expert-choice (Du et al.) for balanced training |
| Top-k | 2 (western), 6-8 (Chinese LMs) | k=2 robust; higher k improves quality, costs more FLOPs | k=1 (Switch, Llama 4) for compute efficiency |
| Softmax order | Mixtral/DSV3: topk then softmax | Gates sum to 1 over winners — cleaner mixing | Switch: softmax then topk |
| Load balancing | Auxiliary loss (Fedus 2022) | Prevents expert collapse, GPU efficiency | DSV3: bias-based aux-loss-free |
| Capacity factor | 1.0–1.25 | 25% headroom absorbs minor imbalance | Fine-tuning may need C>1.5 for safety |
| Fine-grained | 1/4 to 1/14 (DeepSeek) | More experts → finer specialization, same FLOPs | Simple models (Mixtral): ratio 1 |
| Shared experts | 0–2 (model-dependent) | Universal patterns; saves routing entropy | OlMoE: no benefit found in ablations |
| Router precision | Float32 | Softmax numerics; prevents instability | Never use bf16 for routing |
| Z-loss | β = 0.001 | Bounds logit magnitudes, stabilizes training | Optional but strongly recommended |