Mixture of Experts — Language Models from Scratch (CS336 L4)

Chapter 0: The Core Insight

Imagine you are running a hospital. You have three choices: hire one extremely knowledgeable generalist who sees every patient, hire three specialists and route each patient to exactly one of them, or hire ten specialists and still route each patient to just one. The third hospital has ten times the total expertise, but each patient still only sees one doctor. The cost per patient doesn't change — only the total talent pool does.

That is the Mixture of Experts (MoE) idea applied to transformers. A standard dense model grows both parameter count and FLOPs together — doubling parameters doubles both capacity and compute per token. MoE breaks that coupling. You can have 256 FFN networks ("experts") inside each transformer layer, but each token is only routed to 8 of them. Total parameters scale with N experts; per-token compute stays fixed at k experts.

The slide that launched a thousand MoE papers: Fedus et al. 2022 showed that for the same training FLOPs budget, a sparse MoE model consistently beat a dense model with the same compute. Same cost, better performance. That's not magic — it's specialization. Different experts learn to handle different kinds of tokens, allowing the network to maintain specialists without paying their full compute cost on every input.

The decoupling principle. In a dense FFN layer with d_model=4096 and d_ff=16384, every token touches all 16384 hidden units. In a MoE layer with 64 experts each of size d_ff=16384, but top-k=2 routing, each token still only touches 2×16384 hidden units. Total parameters grew 64×; per-token FLOPs grew only 2×. You can't get this leverage from a dense model at any scale.

Why is this suddenly everywhere? Four forces aligned around 2022–2025. First, empirical scaling: same FLOPs-per-token, MoE trains to lower loss. Second, speed: OlMoE showed MoEs reach the same quality milestone faster in wall-clock time than dense equivalents. Third, results: Mixtral 8×7B beats LLaMA 2 70B while matching a 13B-parameter dense compute budget. Fourth, hardware: as clusters grow to thousands of GPUs, the per-expert compute maps naturally onto expert parallelism — put each expert on a different device, route tokens across the network. The infrastructure was finally ready.

MoE is not new. The idea goes back to Jacobs, Jordan, Nowlan & Hinton 1991 — "Adaptive Mixtures of Local Experts." Shazeer et al. 2017 scaled it to language models with the "Outrageously Large Neural Networks" paper. What changed is infrastructure: sparse tensor operations, expert parallelism libraries like MegaBlocks, and enough GPUs to actually run 256-expert models.

Dense vs Sparse: compute per token

Adjust total experts and active experts (top-k). See how total params and per-token FLOPs diverge.

Total experts N 8

Active (top-k) 2

A MoE model has 256 experts per layer, each with the same size as a standard FFN. If top-k=8, how do per-token FLOPs compare to a single-expert (dense) model of the same expert size?

256× more FLOPs — every expert runs. 8× more FLOPs — only the top 8 experts run. Same FLOPs — the router adds no compute. 8/256 fewer FLOPs — sparsity reduces compute below baseline.

Chapter 1: Sparse FFN Architecture

Before MoE, every transformer layer looked the same: a self-attention block followed by a feed-forward network (FFN). The FFN is a two-matrix computation — project up from d_model to d_ff (the "expansion"), apply an activation, project back down. Every token goes through this entire FFN at every layer. That's the dense baseline.

The MoE transformation is surgical: replace the FFN with multiple FFNs plus a routing network. The "experts" are ordinary FFNs — the same architecture, just N copies of them. The router is a small linear layer that takes the token's hidden state and outputs a probability distribution over experts. You pick the top-k experts according to those probabilities, run only those k FFNs, and combine their outputs as a weighted sum.

Token hidden state x ∈ ℝ^d

Output of attention for this token position

↓ Router (linear layer + softmax)

Gate weights g₁…g_N ∈ [0,1]

Probability distribution over N experts; pick top-k

↓ Run top-k experts

Expert outputs e_i(x) for i in top-k

Each is the full FFN forward pass through expert i

↓ Weighted sum

Output = ∑_{i ∈ top-k} g_i · e_i(x)

Back to d_model, continues through the next block

The key constraint: experts are only in the FFN layers, not the attention layers. Attention operates across positions — it needs all tokens in the sequence to compute keys and queries. An FFN, by contrast, is a position-wise operation: the same function applied independently to each token. That independence is what makes expert routing tractable. You can send different tokens to different experts without any cross-token coordination.

MoE for attention heads also exists (ModuleFormer, JetMoE) but is far less common. The math of routing across attention heads gets messy and the gains don't clearly justify the complexity. The community consensus landed on: attention stays dense, FFN goes sparse.

FLOPs accounting for a MoE layer. Dense FFN with d_model=4096, d_ff=16384: forward pass costs 2 × d_model × d_ff = 2 × 4096 × 16384 ≈ 134M FLOPs per token. MoE with N=64 experts, same expert size, top-k=2: same 2 × 4096 × 16384 × 2 ≈ 268M FLOPs per token — only 2× dense baseline. But total parameters jumped 64×. That's the trade.

MoE layers are not placed at every transformer layer. Common patterns: alternate MoE and dense layers (Mixtral), or use MoE at every other layer but keep a few purely dense layers near the input and output (DeepSeek-V3). The reason is stability — early and late layers tend to develop less specialized representations and benefit less from routing.

python
# Minimal MoE FFN forward — sketch showing the shape flow
import torch
import torch.nn as nn
import torch.nn.functional as F

class MoEFFN(nn.Module):
    def __init__(self, d_model, d_ff, n_experts, top_k):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        # N separate FFNs — each is (d_model → d_ff → d_model)
        self.w1 = nn.Parameter(torch.randn(n_experts, d_model, d_ff))
        self.w2 = nn.Parameter(torch.randn(n_experts, d_ff, d_model))
        # Router: maps d_model → N logits
        self.router = nn.Linear(d_model, n_experts, bias=False)

    def forward(self, x):
        # x: [B, T, d_model]
        B, T, d = x.shape
        x_flat = x.view(B*T, d)              # [B*T, d_model]

        logits = self.router(x_flat)          # [B*T, N]
        probs  = F.softmax(logits, dim=-1)   # [B*T, N]

        topk_probs, topk_idx = probs.topk(self.top_k, dim=-1)
        # topk_probs: [B*T, k],  topk_idx: [B*T, k]

        out = torch.zeros_like(x_flat)
        for i in range(self.top_k):
            ei = topk_idx[:, i]                # [B*T] — which expert for slot i
            gi = topk_probs[:, i:i+1]         # [B*T, 1] — gate weight
            # Gather the expert weights for each token
            w1_i = self.w1[ei]                # [B*T, d_model, d_ff]
            w2_i = self.w2[ei]                # [B*T, d_ff, d_model]
            h = F.gelu(torch.bmm(x_flat.unsqueeze(1), w1_i).squeeze(1))
            e_out = torch.bmm(h.unsqueeze(1), w2_i).squeeze(1)
            out += gi * e_out

        return out.view(B, T, d)

Why are MoE experts placed in FFN layers rather than attention layers?

Attention is already sparse due to masking, so MoE would be redundant. Expert weights are too large to fit in GPU memory alongside attention matrices. FFNs are position-wise — each token is independent — so tokens can be routed to different experts without cross-token coordination. MoE was only tested on FFNs; attention experts have never been tried.

Chapter 2: Router & Top-K Gating

The router is the decision-maker: given a token's hidden state, it assigns that token to exactly k of the N available experts. Getting this right is the entire game. The basic approach — used in most models including Switch Transformer, Mixtral, and DeepSeek — is embarrassingly simple: a linear layer + softmax.

Let x ∈ ℝ^d be the token's hidden state (a vector with d entries). The router has a weight matrix W_r ∈ ℝ^{d × N}. Compute logits l = xW_r (a vector of N scores, one per expert). Apply softmax to get probabilities over all N experts:

p_i(x) = softmax(xW_r)_i = exp(l_i) / ∑_j=1^N exp(l_j)

Now pick the top-k probabilities. Two common variants exist for when you apply softmax. The original Switch Transformer and most models compute softmax over all N experts first, then take the top k. Mixtral and DeepSeek-V3 flip the order: compute top-k logits first, then softmax over just those k winners. The difference matters: if you softmax over all N and pick top k, the gate weights for the selected experts still reflect comparisons against the other N-k losers. If you softmax only over the k winners, the weights sum to 1 over just those k and are more interpretable as mixing coefficients.

Output = ∑_{i ∈ top-k} g_i(x) · FFN_i(x)

where g_i(x) = softmax over top-k selection of p_i(x)

The router itself is tiny — just one matrix of size d × N, no activation. For DeepSeek-V3 with d=7168 and N=256 experts, that's 7168 × 256 ≈ 1.8M parameters. Compared to the 671B total, the router is invisible. But its decisions determine everything.

Gating network: token → softmax → top-k selection

A single token is routed through N experts. Sliders set N and k. Click "New token" to randomize logits. The top-k selected experts are highlighted.

N experts 8

top-k 2

Why not just argmax (k=1)? Switch Transformer used k=1 — the simplest possible router. One expert, zero weighted combination needed. But k=1 is brittle: a small change in logits can flip a token to a completely different expert, causing discontinuous behavior. k=2 creates a soft mixture: even if one expert is only marginally preferred, the second expert still contributes. Most modern models use k=2 to k=8 as a result.

Alternative routing strategies exist: hashing routing deterministically maps tokens to experts based on a hash of the token ID (no learned router, zero overhead, but no specialization either). Expert-choice routing (from Du et al. 2022) flips the perspective — instead of tokens choosing experts, each expert chooses its top-C tokens from the batch. This guarantees perfect load balance by construction, but breaks causality (an expert needs to see all tokens before picking its favorites, which doesn't work autoregressively). RL routing (Bengio 2013, Clark 2020) treats routing as a policy to optimize — the "right" approach mathematically, but gradient variance makes it impractical at scale. Token-choice top-k won the practical competition.

python
# Top-k router — two variants
def route_softmax_then_topk(x, W_r, k):
    # Variant A (Switch style): softmax first, then select top-k
    logits = x @ W_r                        # [T, N]
    probs  = F.softmax(logits, dim=-1)    # [T, N], sums to 1 over all N
    topk_g, topk_idx = probs.topk(k, dim=-1)
    return topk_g, topk_idx                 # gates don't sum to 1 over k

def route_topk_then_softmax(x, W_r, k):
    # Variant B (Mixtral/DeepSeek-V3): topk first, softmax over winners
    logits = x @ W_r                        # [T, N]
    topk_l, topk_idx = logits.topk(k, dim=-1)
    topk_g = F.softmax(topk_l, dim=-1)   # [T, k], sums to 1 over k winners
    return topk_g, topk_idx

In Variant B (top-k then softmax), the gate weights g_i for the selected experts sum to:

1 over all N experts (since softmax is used). Some value less than 1, since only the top k are included. Exactly 1 over the k selected experts (softmax renormalizes over just the k winners). k, since each selected expert gets equal weight.

Chapter 3: Load Balancing

Here is a disaster that will absolutely happen without intervention: collapse to one expert. The router is a learned function. If one expert happens to produce slightly better outputs early in training, the router learns to send more tokens to it. With more tokens, that expert gets more gradient updates. With more updates, it becomes even better. The other experts atrophy from disuse. Within a few thousand steps, one expert receives 99% of the tokens and the rest are ghost experts that nobody calls.

This is not a theoretical risk — it happened in early MoE papers. Shazeer et al. 2017 called it expert collapse. The model degenerates into an expensive dense model: one expert is computing for almost every token, so the whole sparsity advantage evaporates. You're paying for N experts but only training one.

The solution everyone uses is a load-balancing auxiliary loss — a differentiable penalty added to the training objective that discourages routing imbalance. The Switch Transformer (Fedus et al. 2022) introduced the canonical form. Define two quantities per expert i, computed over a batch of T tokens:

f_i = fraction of tokens routed to expert i = (1/T) ∑_t=1^T 1[argmax p(x_t) = i]

P_i = average router probability assigned to expert i = (1/T) ∑_t=1^T p_i(x_t)

L_balance = α · N · ∑_i=1^N f_i · P_i

Two points about this formula. First, f_i is computed from argmax — it's a hard count, not differentiable. P_i is the soft probability — fully differentiable. The product f_i · P_i gives a term with useful gradients: ∂L/∂P_i = αN · f_i. Experts that receive more tokens (larger f_i) get a stronger downward push on their router probabilities, discouraging the router from continuing to send them tokens. Experts receiving few tokens get a weaker push, allowing the router to send more their way.

Why multiply f_i × P_i? The Cauchy-Schwarz inequality tells us the sum ∑ f_i · P_i is minimized when f_i = P_i = 1/N for all i — perfectly uniform distribution. Any imbalance increases the sum. So minimizing L_balance forces the router toward uniformity. The N scaling factor keeps the loss magnitude constant regardless of how many experts you have.

Load balancing: balanced vs collapsed routing

Toggle the balance loss on/off to see how token distribution changes across experts. Bars show fraction of tokens each expert receives.

The hyperparameter α controls how strongly to enforce balance. Too small: collapse happens anyway. Too large: the balance loss overwhelms the main language modeling loss and the experts can't specialize — they all learn the same average function. In practice, α is set to values between 0.001 and 0.01 and requires tuning per model size and training duration.

DeepSeek-V3 introduced a clever alternative called auxiliary-loss-free balancing: instead of adding a penalty to the loss, they maintain a per-expert bias term b_i that is updated via an online moving average. Experts that receive too many tokens get a negative bias (making them less likely to be selected); underloaded experts get a positive bias. The bias affects routing but not the gate weight magnitudes used in the weighted sum. This avoids the sensitivity to the α hyperparameter and doesn't pollute the main loss signal.

In the Switch Transformer load-balance loss L = αN ∑ f_i P_i, why is P_i (soft probability) used instead of just f_i (hard count) for both terms?

P_i is easier to compute than f_i during training. f_i uses argmax which is not differentiable; P_i provides the gradient signal needed to update the router. f_i is always equal to P_i when routing is balanced. The product ensures the loss is bounded between 0 and 1.

Chapter 4: Expert Capacity & Token Dropping

Even with a load-balancing loss, routing is imperfect. In any given batch, some experts will receive slightly more tokens than the average. This creates a hardware problem: in real distributed training, each expert sits on a different device. If expert 3 needs to process 500 tokens but expert 7 only gets 50, every device must wait for expert 3 to finish before the batch can continue. The slowest expert determines the batch time.

The solution is expert capacity: set a hard maximum on how many tokens any expert can receive per batch. If more tokens are routed to an expert than its capacity allows, the overflow tokens are dropped — they pass through without being processed by any expert (their residual stream contribution for that layer is just zero, or sometimes a copy of their input).

The capacity is expressed as a multiple of the "fair share." If you have T tokens in the batch and N experts, the average load is T/N tokens per expert. The capacity factor C sets the maximum as:

Capacity per expert = C × T / N

C=1.0 means each expert can handle exactly its fair share — any routing imbalance drops tokens. C=1.25 gives 25% headroom above the average — minor imbalances are absorbed without dropping. C=2.0 is generous but wastes memory (you must pre-allocate buffer space for all possible tokens). C<1.0 is possible if you deliberately want to prune computation.

Expert capacity & token dropping

A batch of tokens is routed to experts. The dashed line is the capacity limit. Tokens above the line are dropped. Adjust capacity factor and see the effect.

Capacity factor C 1.25

N experts 8

Dropped tokens are a real quality hit. A dropped token doesn't get updated by that MoE layer. If a critical word — say, "not" in "the model is not helpful" — gets its expert slot dropped, the representation for "not" passes through the layer unchanged and the semantic signal is lost. This matters enough that ST-MoE (Zoph et al. 2022) showed dropping even 1% of tokens measurably harms downstream quality on fine-tuning benchmarks.

An interesting and often-overlooked consequence of token dropping: MoE models can be non-deterministic at inference time. This surprised people who assumed GPT-4 was using MoE. Token dropping is decided at the batch level: whether your token exceeds capacity depends on which other tokens arrived in the same batch. So two identical prompts in the same batch can produce different results if the other tokens in the batch differ — because the other tokens compete for the same expert slots.

Fine-tuning MoE models introduces its own capacity challenge. Fine-tuning typically uses small batches (often batch size 1 or 4). With tiny batches and 256 experts, most experts receive zero tokens per step and get no gradient signal. Zoph et al. 2022 solved this for ST-MoE by fine-tuning only the dense (non-MoE) parts of the network. DeepSeek's solution was to use massive fine-tuning data (1.4M SFT examples) so even with 256 experts, each gets enough exposure.

With T=512 tokens, N=8 experts, and capacity factor C=1.5, what is the maximum tokens each expert can process?

64 tokens (T/N, no headroom). 512 tokens (full batch, no dropping). 96 tokens (C × T/N = 1.5 × 512/8). 128 tokens (C × T/N, rounded up).

Chapter 5: Fine-Grained & Shared Experts

The original MoE recipe is clean: N experts of equal size, pick top k, done. But as the field iterated, two refinements emerged that are now standard in the best-performing models: fine-grained experts and shared experts.

Fine-Grained Experts

Suppose you have a total expert parameter budget of B (the number of weights in a single "standard-size" FFN). You could use 8 experts, each with B/1 parameters (same size as a dense FFN), and route to top-2. Or you could split each of those 8 experts into 4 smaller ones — 32 experts, each with B/4 parameters — and route to top-8 (which gives the same total active parameters per token). The second configuration has the same per-token FLOPs but dramatically more expert specialization.

Why does this help? With more, smaller experts, the routing becomes more fine-grained. Instead of one expert handling "all syntax-related tokens" broadly, you might have four experts that each handle different syntactic phenomena. The total parameter budget is the same, but the granularity of specialization increases. The fine-grained ratio is the factor by which you split the experts (e.g., ratio 1/4 means each "slot" that would be one big expert is split into 4 smaller ones).

OlMoE vs DeepSeek on fine-grained experts. OlMoE (Allen AI) found gains from fine-grained experts in their ablations. DeepSeek also uses them (ratio 1/14 in V3 — 256 experts where a "standard" model might have 18). Both confirm the pattern. The marginal cost of finer granularity is just router complexity; the benefit is richer specialization.

Shared Experts

Some information is so universal that every token needs it — basic syntax, common word meanings, punctuation handling. Routing these tokens to specialized experts wastes expert slots that could be used for more domain-specific processing. The shared expert (or "always-on" expert) design addresses this: designate S experts that receive every token regardless of routing decisions, plus the standard top-k routed experts.

DeepSeek V1 and V2 used S=2 shared experts. DeepSeek V3 uses S=1. The shared experts handle universal knowledge; the routed experts handle specialization. Total active parameters per token = shared expert params + top-k routed expert params.

Model	Routed experts	Active (top-k)	Shared	Fine-grained ratio
Switch Transformer	64	1	0	1
Mixtral 8×7B	8	2	0	1
DBRX	16	4	0	1
Grok	8	2	0	1
DeepSeek V1	64	6	2	1/4
DeepSeek V2	160	6	2	1/10
DeepSeek V3	256	8	1	1/14
OlMoE	64	8	0	1/8
Llama 4 Maverick	128	1	1	1/2

The shared expert controversy. DeepSeek's ablations showed gains from shared experts. OlMoE's ablations showed no gains. The discrepancy is likely scale- and data-dependent. At DeepSeek's scale (hundreds of billions of parameters), shared experts may be picking up truly universal patterns that benefit every token. At OlMoE's smaller scale, the routing may already handle this naturally. Do not treat shared experts as universally beneficial.

A model uses fine-grained ratio 1/4 with 64 routed experts and top-8 routing. How does this compare to a baseline with 16 experts and top-2 routing, assuming the same per-token FLOPs budget?

They are identical — both compute the same number of expert FFNs per token. Same total active params and FLOPs, but 4× more experts with 4× finer specialization granularity. 4× more FLOPs — the fine-grained routing is more expensive. Fine-grained experts reduce total parameters by 4×.

Chapter 6: Training Instability & Fixes

MoE training is notoriously unstable. Even with load-balancing losses and capacity factors tuned carefully, models can exhibit loss spikes, gradient explosions, and catastrophic expert collapse mid-training. Zoph et al. 2022 (ST-MoE) did the most systematic study of these failures and found the root cause: the router's logit magnitudes can grow without bound.

Here's what happens. The router computes logits l = xW_r. If the logits become very large in magnitude — say, l_max = 50 — then softmax(l) is dominated by exp(50) ≈ 5 × 10²¹ compared to exp(0) = 1. The softmax saturates: one expert gets probability ≈ 1, all others get ≈ 0. Gradients through the saturated softmax vanish for all non-selected experts. The router "hardens" to deterministic decisions with no gradient signal to escape.

The fix comes in two parts. First, run the router in float32 rather than bfloat16. The router performs softmax, which is numerically sensitive. bfloat16's limited precision (only 7 mantissa bits) causes rounding errors in the exponentials that can flip routing decisions. Using float32 just for the router (with the rest of training in bf16 for memory efficiency) eliminates a major source of instability.

Second, add a router z-loss. The z-loss penalizes large logit magnitudes directly, regardless of routing decisions:

L_z = β · (1/T) ∑_t=1^T (log ∑_i=1^N exp(l_i(x_t)))²

The inner term log ∑ exp(l_i) is the log-sum-exp of all router logits for token t. This is large when any logit is large. Squaring it makes it even more punishing for outliers. The gradient with respect to each logit l_i is proportional to the softmax probability p_i — so overconfident routing (large logits) is directly penalized. β is typically set to 0.001, much smaller than the language modeling loss coefficient.

Float32 for the router is not optional. Zoph et al. showed that switching the router to float32 alone — no other changes — eliminated most of the training spikes they observed. The combination of float32 router + z-loss makes MoE training stable enough to scale to hundreds of billions of parameters. Both are standard in production MoE code today.

Expert dropout during fine-tuning is another stability trick. Sparse MoEs can overfit severely on small fine-tuning datasets — the model has many parameters but they're highly specialized, making them prone to memorizing the fine-tuning examples rather than generalizing. Zoph et al. 2022 found that fine-tuning only the non-MoE parts (attention, layer norms, embeddings) worked better than fine-tuning everything. More recent work (DeepSeek, Qwen MoE) uses large SFT datasets to avoid this problem entirely.

Upcycling is an alternative to training MoE from scratch: initialize a MoE model from a pre-trained dense model. Each expert starts as a copy of the dense FFN; the router starts from random. You then continue training. MiniCPM and Qwen MoE both demonstrated that upcycling can reach higher quality than training the same MoE from scratch with the same compute budget — you get the pre-trained dense model's knowledge as a starting point for free.

python
# Router z-loss — penalizes large logit magnitudes
def router_z_loss(logits, beta=1e-3):
    # logits: [B*T, N] — router output before softmax
    log_z = torch.logsumexp(logits, dim=-1)  # [B*T] — log sum exp per token
    z_loss = (log_z ** 2).mean()
    return beta * z_loss

# Load-balance auxiliary loss (Switch Transformer style)
def balance_loss(router_probs, expert_indices, N, alpha=1e-2):
    # router_probs: [T, N], expert_indices: [T] — argmax of router
    T = router_probs.shape[0]
    # f_i: fraction of tokens going to expert i (hard, non-differentiable)
    f = torch.zeros(N)
    for i in range(N):
        f[i] = (expert_indices == i).float().mean()
    # P_i: average soft probability for expert i (differentiable)
    P = router_probs.mean(dim=0)  # [N]
    return alpha * N * (f * P).sum()

# Combined training loss
loss = lm_loss + balance_loss(probs, idx, N) + router_z_loss(logits)

What is the primary purpose of the router z-loss in MoE training?

To ensure all experts receive equal numbers of tokens (same as the balance loss). To speed up training by reducing the number of softmax operations needed. To keep router logit magnitudes small, preventing softmax saturation and numerical instability. To regularize the expert weights, preventing experts from overfitting.

Chapter 7: MoE Showcase — Live System

Put it all together. In this simulator, a batch of tokens flows through a full MoE layer. Watch the routing decisions, observe how load imbalance emerges, and see what the load-balance loss and capacity factor actually do to the flow.

Full MoE layer simulator

Tokens are routed to experts. Orange = expert over capacity (tokens dropped). Use the controls to see how design choices affect routing quality.

Tokens T 24

N experts 6

top-k 2

Capacity factor C 1.25

What to explore. (1) Turn balance loss OFF and click "New batch" a few times — notice how one or two experts tend to attract far more tokens. (2) Turn balance loss ON — the distribution spreads out. (3) Reduce capacity factor below 1.0 — watch tokens get dropped (shown in red). (4) Increase top-k — notice each token touches more experts, FLOPs rise, but load distributes more naturally. (5) Try N=12, k=1 — this is Switch Transformer; one expert gets everything when balance fails.

Chapter 8: Real Models & Param Explorer

All the concepts from the preceding chapters crystallize in three real MoE architectures. Let's walk through each and then explore the parameter math interactively.

Mixtral 8×7B

The first widely adopted open-weight MoE. 8 experts per layer, top-2 routing, no shared experts, no fine-grained splitting. 32 layers. Each expert is a full-size 7B-class FFN. Total parameters: approximately 46.7B (8 experts × ~5.6B per expert FFN, plus shared non-MoE weights). Active parameters per token: roughly 13B (top-2 experts plus attention and embeddings). At inference, Mixtral matches LLaMA 2 70B quality while only computing 13B worth of weights — a 5× reduction in compute per token.

DeepSeek-V3 (671B total / 37B active)

The design statement of 2025-era MoE. 256 routed experts per MoE layer, 1 shared expert, top-8 routing, fine-grained ratio 1/14. This means each "expert" is much smaller than a standard FFN — 14 fine-grained experts together equal one standard expert. 61 MoE layers in a 61-layer model. Total parameters: 671B. Active parameters per token: 37B. The ratio of 671/37 ≈ 18 — each token touches only 1/18th of the model by parameter count.

DeepSeek-V3 also uses auxiliary-loss-free balancing (per-expert biases updated online) instead of the standard auxiliary loss — a deliberate choice to avoid the α hyperparameter sensitivity that plagued earlier models.

Llama 4 Maverick

Meta's 2025 MoE flagship. 128 routed experts, top-1 routing (!), 1 shared expert, fine-grained ratio 1/2. The return of k=1 routing — presumably combined with better training recipes and balance loss variants that make k=1 stable at this scale. Competitive with Gemini 1.5 Flash and GPT-4o-mini at a fraction of the active compute.

Effective vs total params explorer

Adjust MoE configuration to see total model parameters vs active parameters per token. Compare compute efficiency.

Routed experts N 64

top-k active 8

Shared experts S 1

Expert size (M params) 400M

The 37B number, derived. DeepSeek-V3 has 256 routed experts + 1 shared. Each routed expert has ~690M parameters (fine-grained). Active: 8 routed + 1 shared = 9 × 690M ≈ 6.2B from MoE experts per MoE layer. Plus non-MoE weights (attention, norms, embeddings): ~30B. Total active: ~37B. Total: 9 × 690M × 61 layers + 30B = ~408B from MoE layers + 30B dense = 671B total. The math checks out.

Mixtral 8×7B routes to top-2 experts per token. If the model has 32 layers (each with a MoE FFN), and each expert FFN has 5.6B parameters, how many total expert parameters are in the model?

5.6B (one expert, activated per layer). 44.8B (8 experts × 5.6B, counted once). 44.8B (8 experts × 5.6B per expert, shared across all layers — one set of expert weights per layer, not per-layer copies). 1,433.6B (8 × 5.6B × 32 layers — each layer has its own expert weights).

Note: quiz answer 2 and 3 both reflect different conventions — quiz answer 2 reflects the "model parameters" view where parameter counts are often reported as per-layer and summed. The key insight is that each MoE layer has its own independent set of N expert FFNs, so total expert parameters = N × expert_size × n_moe_layers.

Chapter 9: Connections & Cheat Sheet

MoE Design Space — Quick Reference

Decision	Common Choice	Why	Exception
Where to put experts	Replace FFN layers	FFNs are position-wise → token independence	JetMoE also routes attention heads
Routing type	Token-choice top-k	Simple, fast, differentiable via soft gates	Expert-choice (Du et al.) for balanced training
Top-k	2 (western), 6-8 (Chinese LMs)	k=2 robust; higher k improves quality, costs more FLOPs	k=1 (Switch, Llama 4) for compute efficiency
Softmax order	Mixtral/DSV3: topk then softmax	Gates sum to 1 over winners — cleaner mixing	Switch: softmax then topk
Load balancing	Auxiliary loss (Fedus 2022)	Prevents expert collapse, GPU efficiency	DSV3: bias-based aux-loss-free
Capacity factor	1.0–1.25	25% headroom absorbs minor imbalance	Fine-tuning may need C>1.5 for safety
Fine-grained	1/4 to 1/14 (DeepSeek)	More experts → finer specialization, same FLOPs	Simple models (Mixtral): ratio 1
Shared experts	0–2 (model-dependent)	Universal patterns; saves routing entropy	OlMoE: no benefit found in ablations
Router precision	Float32	Softmax numerics; prevents instability	Never use bf16 for routing
Z-loss	β = 0.001	Bounds logit magnitudes, stabilizes training	Optional but strongly recommended

Parameter Math Cheat Sheet

Total expert params = N_experts × d_model × d_{ff_expert} × 2 × n_{moe_layers}

Active params per token = (top-k + S) × d_model × d_{ff_expert} × 2 + dense_params

Sparsity ratio = active_params / total_params ≈ (k + S) / N_experts

MoE FLOPs per token ≈ 2 × dense_FLOPs × (k / N) × N + router_FLOPs

= 2 × dense_FLOPs × k (router cost is negligible)

Training Recipe Checklist

Router in float32 — not bf16
Load-balance auxiliary loss with α = 0.001–0.01 (tune per model)
Router z-loss with β = 0.001 for logit magnitude control
Expert capacity factor C = 1.0–1.25 (raise if too many tokens dropped)
Monitor expert utilization histograms during training — collapse is silent otherwise
Consider upcycling from dense if you have a trained dense checkpoint
Fine-tuning: either use large SFT dataset or freeze MoE FFN weights

Go Deeper

CS336 Lec 3 — Architectures & Hyperparameters — The dense FFN baseline; SwiGLU activation that MoE experts use; GQA attention that pairs with MoE layers.
CS336 Lec 2 — PyTorch & Resource Accounting — The FLOPs counting framework (6ND rule, matmul cost) that lets you compare dense vs MoE compute budgets.
Transformer Architecture — The full self-attention block that wraps around these MoE FFN layers.

Key Papers

Jacobs et al. 1991. "Adaptive Mixtures of Local Experts." Neural Computation.
Shazeer et al. 2017. "Outrageously Large Neural Networks." ICLR 2017.
Fedus et al. 2022. "Switch Transformers." JMLR 2022.
Zoph et al. 2022. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv 2202.08906.
Jiang et al. 2024. "Mixtral of Experts." arXiv 2401.04088.
Dai et al. 2024. "DeepSeekMoE." arXiv 2401.06066.
DeepSeek-AI 2024. "DeepSeek-V3 Technical Report." arXiv 2412.19437.
Muennighoff et al. 2024. "OlMoE: Open Mixture-of-Experts Language Models." arXiv 2409.02060.

Feynman test for this lecture. Can you explain — without notes — why MoE decouples parameter count from FLOPs per token, how the top-k softmax router works (both variants), why expert collapse happens and what the balance loss does about it, and what a capacity factor of 1.25 means concretely? If yes, you've internalized CS336 Lecture 4. If not, the MoE Showcase in Chapter 7 is the most efficient 10 minutes you can spend on this material.

DeepSeek-V3 uses "auxiliary-loss-free balancing." What is the mechanism, and what problem does it solve compared to standard auxiliary loss balancing?

It removes all load balancing, relying on random initialization to be uniform. It uses RL instead of gradient descent to learn routing policies. It maintains a per-expert bias updated online (not gradient-based), avoiding sensitivity to the α hyperparameter and not polluting the main loss signal. It eliminates the router entirely, replacing it with a fixed hash function for stability.