Loading...
Loading...
Qwen 3's small MoE. 30B stored, only 3B active per token - runs at Qwen 2.5 3B latency but ships Qwen 3 14B quality. 128 experts, top-8, no shared expert.
Fine-grained Mixture of Experts (MoE): 128 experts, top-8 routing, NO shared expert. Only ~3B active per token. Grouped-Query Attention (GQA) 8:1 + QK-Norm. Head dim 128 with 32 Q heads means Q-projection is larger than hidden size.
Qwen 3 30B-A3B is the Qwen model that makes fine-grained Mixture of Experts (MoE) cheap. 30B stored, only ~3B active per token - 14B-class quality at the inference speed of a 3B dense model. Runs on a single 40GB GPU in 4-bit. Copies the DeepSeek V3 MoE playbook with one deliberate deviation: no shared expert.
30.5B total parameters, 3B active. 48 layers, hidden dim 2048, 32 Q heads sharing 4 key/value (KV) heads (Grouped-Query Attention (GQA) 8:1), QK-Norm on Q/K, 128 routed experts top-8 with no shared expert, tiny 768-dim Feed-Forward Network (FFN) per expert, 151,936 vocab, 32K native context. Apache 2.0.
Still needs 30B of VRAM loaded even though only 3B compute. Fewer inference engines handle fine-grained MoE efficiently than handle dense Llama-style. No shared expert means a bad top-8 routing pick has nowhere to fall back. Fine-tuning 128-expert MoE with LoRA is still open research in 2026.
Pick over Qwen 3 14B dense unless your inference stack cannot run MoE. Same or better quality, lower latency, only marginally more VRAM. Pick DeepSeek V3 if you have 8x H100. For single-GPU MoE in 2026, this is the sweet spot.
| Released | April 2025 |
|---|---|
| Organization | Alibaba Cloud / Qwen Team |
| License | Apache 2.0 |
| Parameters | 30.5B total · 8 of 128 experts active |
| Layers | 48 |
| Hidden dim (d_model) | 2,048 |
| Attention heads | 32 Q heads / 4 KV heads |
| Head dim | 128 |
| Attention type | GQA + MoE FFN |
| QK-Norm | Yes |
| FFN intermediate | 768 |
| FFN activation | SwiGLU |
| Normalization | RMSNorm (pre) |
| Position encoding | RoPE |
| Max context | 32,768 tokens |
| Vocabulary | 151,936 |
128 routed experts, top-8 per token, tiny 768-dim FFN per expert. Same granularity idea as DeepSeek V3 (256 experts) at a fraction of the total parameters. ~3B active per token despite 30B stored.
Every token is routed to 8 of 128 experts with nothing always-on. The Qwen team's reported ablations suggested the shared expert did not justify its cost at this scale. At Qwen 3 30B-A3B's expert width (768 FFN, 2048 hidden, SwiGLU, 48 layers), one shared expert would cost ~0.23B always-active parameters per token - the routing budget is spent on additional fine-grained experts instead.
First major open MoE to ship QK-Norm (Root Mean Square Normalization (RMSNorm) on Q and K before attention). Stabilizes attention at long context and enables higher learning rates. Pairs well with fine-grained routing which can otherwise make training jittery.
Hidden dim is 2048 but Q projection outputs 32 * 128 = 4096. Attention runs in a wider space than the residual stream. Unusual but lets the model keep the full 128-dim head size that Rotary Position Embeddings (RoPE) kernels are optimized for.
Why: DeepSeek V3 uses 256 experts + 1 shared. Qwen 3 30B-A3B uses 128 + 0. At this total-parameter budget, the shared expert's always-on compute was worth more spent on extra routed experts. Different scale, different sweet spot.
Why: Same aggressive ratio as Llama 3.1 70B. Small KV cache matters more here because MoE makes the compute side already cheap - keeping memory lean was the next lever.
Why: Deep and narrow. Depth helps MoE models because each layer has its own 128-expert mixture, so more layers means more routing opportunities. Narrow hidden keeps per-expert FFN matrices small.
The MoE, unless your stack cannot run it. 30B-A3B matches or beats 14B dense on benchmarks, generates at roughly the speed of a 3B model, and costs only ~9GB more VRAM in 4-bit (~17GB vs ~9GB at Q4_K_M). The single case where 14B wins is inference engines without efficient MoE kernels - if you are running llama.cpp or an older TGI, the dense model is simpler.
Different scale budget. DeepSeek V3 is 671B total - a shared expert is rounding error. Qwen 3 30B-A3B has only 30B total, so a shared expert costs a meaningful slice of active compute. Qwen's ablations reportedly found that slice was better spent on additional routed experts. Neither approach is universally right.
Head dim 128 * 32 Q heads = 4096. Hidden dim is 2048, so the Q projection is a 2048 to 4096 linear layer. Attention runs in a wider space than the residual stream, then projects back to 2048 via the output layer. Unusual but it lets the model keep the standard 128-dim head that RoPE and FlashAttention kernels are optimized for.
BF16: ~61GB - single H100 80GB or 2x A100 40GB. 4-bit GGUF: ~17GB, fits on RTX 4090 24GB with room for context. This is the sweet spot model for single-GPU MoE in 2026. Latency is closer to a 3B dense model than a 30B one.
Different weight classes. DeepSeek V3 is 671B total / 37B active - flagship. Qwen 3 30B-A3B is 30B / 3B - laptop-class MoE. Quality goes to DeepSeek V3 by a clear margin. Deployability goes to Qwen 3 30B-A3B by a much larger margin. Pick by hardware: if you have 8x H100 go DeepSeek V3; otherwise Qwen 3 30B-A3B.
Yes. The entire Qwen 3 family shares the thinking-mode toggle - special tokens switch the same weights between fast-response and extended-reasoning behavior. Works identically on the MoE variants and the dense ones.