Mixture of Experts (MoE) has become the dominant approach for scaling language models beyond dense compute limits. The principle: not every token needs every parameter. By activating only a subset of the network for each input, MoE models can have trillions of parameters while using compute comparable to a much smaller dense model.
The Architecture
In a standard transformer, each layer has a feed-forward network (FFN) applied to every token. In MoE, this FFN is replaced by N expert FFNs and a gating network (router) that selects which experts to use:
class MoELayer(nn.Module):
def __init__(self, d_model, d_ff, n_experts, top_k=2):
super().__init__()
self.experts = nn.ModuleList([
FFN(d_model, d_ff) for _ in range(n_experts)
])
self.gate = nn.Linear(d_model, n_experts, bias=False)
self.top_k = top_k
def forward(self, x):
gate_logits = self.gate(x)
weights, indices = torch.topk(gate_logits, self.top_k, dim=-1)
weights = F.softmax(weights, dim=-1)
output = torch.zeros_like(x)
for i, expert in enumerate(self.experts):
mask = (indices == i).any(dim=-1)
if mask.any():
expert_out = expert(x[mask])
w = weights[indices == i]
output[mask] += w.unsqueeze(-1) * expert_out
return outputMixtral 8x7B, for example, has 8 expert FFNs per layer with top-2 routing. Total parameters: ~47B. Active parameters per token: ~13B. You get 47B capacity for roughly 13B compute cost.
The Routing Problem
The gating function determines everything. Naive top-k routing has a critical failure mode: expert collapse. Early in training, some experts randomly receive more useful gradients, become better, attract more tokens, and dominate. Other experts wither from disuse.
Load Balancing
To prevent collapse, MoE models add an auxiliary loss:
L_balance = α * N * Σ(f_i * P_i)
where:
f_i = fraction of tokens routed to expert i
P_i = average routing probability for expert i
N = number of experts
α = balance coefficient (typically 0.01)This loss is minimized when all experts receive equal traffic. Switch Transformer (Fedus et al., 2022) simplified routing to top-1 selection with a capacity factor. If an expert's buffer is full, overflow tokens skip expert processing via residual connections.
Expert Specialization
Do experts actually specialize? Research shows mixed results:
- In vision MoE models, experts often specialize by spatial frequency or object type
- In language models, specialization is less clear-cut — experts may handle different token positions or syntactic structures
- Some studies find specialization is mostly about load distribution rather than semantic meaning
Architectures like Soft MoE replace hard top-k routing with soft weighted combinations, and expert-choice routing flips the paradigm: experts choose their top-k tokens instead.
Practical Challenges
Memory: All expert parameters must reside in memory even though only a fraction is active. An 8x7B MoE model needs memory for all 47B parameters.
Communication: In distributed training, tokens must be dispatched across devices via all-to-all communication. This can become the bottleneck at scale.
Inference: Different tokens in a batch may activate different experts, creating irregular compute patterns. Expert parallelism helps but serving remains more complex than dense models.
The Scaling Argument
Dense scaling hits diminishing returns. MoE breaks this by decoupling total knowledge capacity (parameter count) from per-token compute cost (active parameters).
DeepSeek-V3 demonstrated this: a 671B-parameter MoE model with 37B active parameters competing with dense models many times its active compute cost. The trend is clear — future frontier models will likely all be some form of sparse architecture.
The open question is whether MoE is the optimal form of sparsity, or if architectures with more dynamic computation paths (early-exit, adaptive compute) will prove superior.