Qwen 3 30B-A3B Architecture

2025

Alibaba Cloud / Qwen Team · Released April 29, 2025 · Apache 2.0

Qwen 3's small MoE. 30B stored, only 3B active per token - runs at Qwen 2.5 3B latency but ships Qwen 3 14B quality. 128 experts, top-8, no shared expert.

Paper Model card config.json GitHubApache 2.0

Transformer Block (Qwen style)

Show tensor shapes48 layers x this block

What makes Qwen 3 30B-A3B different

Fine-grained Mixture of Experts (MoE): 128 experts, top-8 routing, NO shared expert. Only ~3B active per token. Grouped-Query Attention (GQA) 8:1 + QK-Norm. Head dim 128 with 32 Q heads means Q-projection is larger than hidden size.

Context length8,192 tokens

(max 32,768)

Zoom-on-hover

Hover any component in the diagram to see a zoomed view plus what it does and why.

Tip: Scroll the diagram sideways to see the full block. Open on a wider screen to hover for zoomed-in details of each component.

Dimensions

Parameters30.5B

Layers48

Hidden dim2,048

Q heads32

KV heads4

Head dim128

FFN dim768

Experts (active / total)8 / 128

Vocab151,936

Max context32K

Memory @ 8,192 tokens

Weights (FP16)61.00 GB

Weights (INT4)15.25 GB

KV cache (FP16)805.3 MB

Total (FP16)61.81 GB

Design choices

AttentionGQA + Mixture of Experts

PositionROPE

NormRMSNorm (pre)

FFN activationSwiGLU

MoE128 experts, top-8

Concepts

MoEFine-grained MoE128 expertsTop-8GQA 8:1QK-NormRoPEApache 2.0

Qwen 3 30B-A3B architecture: what it is and why it matters

The pitch

Qwen 3 30B-A3B is the Qwen model that makes fine-grained Mixture of Experts (MoE) cheap. 30B stored, only ~3B active per token - 14B-class quality at the inference speed of a 3B dense model. Runs on a single 40GB GPU in 4-bit. Copies the DeepSeek V3 MoE playbook with one deliberate deviation: no shared expert.

By the numbers

30.5B total parameters, 3B active. 48 layers, hidden dim 2048, 32 Q heads sharing 4 key/value (KV) heads (Grouped-Query Attention (GQA) 8:1), QK-Norm on Q/K, 128 routed experts top-8 with no shared expert, tiny 768-dim Feed-Forward Network (FFN) per expert, 151,936 vocab, 32K native context. Apache 2.0.

The tradeoff

Still needs 30B of VRAM loaded even though only 3B compute. Fewer inference engines handle fine-grained MoE efficiently than handle dense Llama-style. No shared expert means a bad top-8 routing pick has nowhere to fall back. Fine-tuning 128-expert MoE with LoRA is still open research in 2026.

When to pick it

Pick over Qwen 3 14B dense unless your inference stack cannot run MoE. Same or better quality, lower latency, only marginally more VRAM. Pick DeepSeek V3 if you have 8x H100. For single-GPU MoE in 2026, this is the sweet spot.

At a glance
Released	April 2025
Organization	Alibaba Cloud / Qwen Team
License	Apache 2.0
Parameters	30.5B total · 8 of 128 experts active
Layers	48
Hidden dim (d_model)	2,048
Attention heads	32 Q heads / 4 KV heads
Head dim	128
Attention type	GQA + MoE FFN
QK-Norm	Yes
FFN intermediate	768
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	32,768 tokens
Vocabulary	151,936

Key architectural innovations in Qwen 3 30B-A3B

1Fine-grained MoE at 30B scale

128 routed experts, top-8 per token, tiny 768-dim FFN per expert. Same granularity idea as DeepSeek V3 (256 experts) at a fraction of the total parameters. ~3B active per token despite 30B stored.

2No shared expert (vs DeepSeek V3)

Every token is routed to 8 of 128 experts with nothing always-on. The Qwen team's reported ablations suggested the shared expert did not justify its cost at this scale. At Qwen 3 30B-A3B's expert width (768 FFN, 2048 hidden, SwiGLU, 48 layers), one shared expert would cost ~0.23B always-active parameters per token - the routing budget is spent on additional fine-grained experts instead.

3QK-Norm + MoE combined

First major open MoE to ship QK-Norm (Root Mean Square Normalization (RMSNorm) on Q and K before attention). Stabilizes attention at long context and enables higher learning rates. Pairs well with fine-grained routing which can otherwise make training jittery.

4Q-projection larger than hidden size

Hidden dim is 2048 but Q projection outputs 32 * 128 = 4096. Attention runs in a wider space than the residual stream. Unusual but lets the model keep the full 128-dim head size that Rotary Position Embeddings (RoPE) kernels are optimized for.

Qwen 3 30B-A3B design choices and rationale

128 experts with no shared expert

Why: DeepSeek V3 uses 256 experts + 1 shared. Qwen 3 30B-A3B uses 128 + 0. At this total-parameter budget, the shared expert's always-on compute was worth more spent on extra routed experts. Different scale, different sweet spot.

GQA 8:1 (32 Q / 4 KV heads)

Why: Same aggressive ratio as Llama 3.1 70B. Small KV cache matters more here because MoE makes the compute side already cheap - keeping memory lean was the next lever.

48 layers at 2048 hidden

Why: Deep and narrow. Depth helps MoE models because each layer has its own 128-expert mixture, so more layers means more routing opportunities. Narrow hidden keeps per-expert FFN matrices small.

Qwen 3 30B-A3B strengths

+Roughly Qwen 3 14B dense quality at ~1/4 the active compute per token
+Apache 2.0 license - unrestricted commercial deployment
+Fits in 4-bit on a single 24GB consumer GPU (RTX 4090); BF16 needs ~61GB (single H100 80GB or 2x A100 40GB)
+Strong multilingual, code, and reasoning - benefits from the full Qwen 3 training corpus

Qwen 3 30B-A3B limitations

−Still needs 30B of VRAM loaded even though only 3B compute - not smaller memory than a dense 30B
−Fewer inference engines handle fine-grained MoE efficiently than handle dense Llama-style models
−No shared expert means routing failures (a bad top-8 pick) have nowhere to fall back to
−Fine-tuning 128-expert MoE with LoRA is still open research in 2026 - routing drift is real

Qwen 3 30B-A3B architecture FAQ

Qwen 3 30B-A3B vs Qwen 3 14B dense - which do I pick?

The MoE, unless your stack cannot run it. 30B-A3B matches or beats 14B dense on benchmarks, generates at roughly the speed of a 3B model, and costs only ~9GB more VRAM in 4-bit (~17GB vs ~9GB at Q4_K_M). The single case where 14B wins is inference engines without efficient MoE kernels - if you are running llama.cpp or an older TGI, the dense model is simpler.

Why no shared expert when DeepSeek V3 has one?

Different scale budget. DeepSeek V3 is 671B total - a shared expert is rounding error. Qwen 3 30B-A3B has only 30B total, so a shared expert costs a meaningful slice of active compute. Qwen's ablations reportedly found that slice was better spent on additional routed experts. Neither approach is universally right.

Why is Q-projection 4096 when hidden dim is 2048?

Head dim 128 * 32 Q heads = 4096. Hidden dim is 2048, so the Q projection is a 2048 to 4096 linear layer. Attention runs in a wider space than the residual stream, then projects back to 2048 via the output layer. Unusual but it lets the model keep the standard 128-dim head that RoPE and FlashAttention kernels are optimized for.

How much VRAM to actually run it?

BF16: ~61GB - single H100 80GB or 2x A100 40GB. 4-bit GGUF: ~17GB, fits on RTX 4090 24GB with room for context. This is the sweet spot model for single-GPU MoE in 2026. Latency is closer to a 3B dense model than a 30B one.

Qwen 3 30B-A3B vs DeepSeek V3?

Different weight classes. DeepSeek V3 is 671B total / 37B active - flagship. Qwen 3 30B-A3B is 30B / 3B - laptop-class MoE. Quality goes to DeepSeek V3 by a clear margin. Deployability goes to Qwen 3 30B-A3B by a much larger margin. Pick by hardware: if you have 8x H100 go DeepSeek V3; otherwise Qwen 3 30B-A3B.

Does it support thinking mode like Qwen 3 14B?

Yes. The entire Qwen 3 family shares the thinking-mode toggle - special tokens switch the same weights between fast-response and extended-reasoning behavior. Works identically on the MoE variants and the dense ones.

Compare Qwen 3 30B-A3B with related LLM architectures

Qwen 3 14B

Alibaba's 2025 flagship dense model. Adds QK-Norm for stability, GQA 5:1, 32k native context. Competitive with 20B+ models.

Compare Qwen 3 30B-A3B vs Qwen 3 14B →

Qwen 3 235B-A22B

Qwen 3's flagship. 235B stored, ~22B active per token. The Apache-licensed answer to DeepSeek V3 at roughly one-third the total parameters.

Compare Qwen 3 30B-A3B vs Qwen 3 235B-A22B →

Qwen 3 30B-A3B concepts: related reading

Mixture of Experts Moe Moe Load Balancing kv Cache Llm Inference How Transformers Work

Qwen 3 30B-A3B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

At a glance
Released	April 2025
Organization	Alibaba Cloud / Qwen Team
License	Apache 2.0
Parameters	30.5B total · 8 of 128 experts active
Layers	48
Hidden dim (d_model)	2,048
Attention heads	32 Q heads / 4 KV heads
Head dim	128
Attention type	GQA + MoE FFN
QK-Norm	Yes
FFN intermediate	768
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	32,768 tokens
Vocabulary	151,936

Key architectural innovations in Qwen 3 30B-A3B

1Fine-grained MoE at 30B scale

128 routed experts, top-8 per token, tiny 768-dim FFN per expert. Same granularity idea as DeepSeek V3 (256 experts) at a fraction of the total parameters. ~3B active per token despite 30B stored.

2No shared expert (vs DeepSeek V3)

3QK-Norm + MoE combined

4Q-projection larger than hidden size

Qwen 3 30B-A3B design choices and rationale

128 experts with no shared expert

GQA 8:1 (32 Q / 4 KV heads)

Why: Same aggressive ratio as Llama 3.1 70B. Small KV cache matters more here because MoE makes the compute side already cheap - keeping memory lean was the next lever.

48 layers at 2048 hidden

Why: Deep and narrow. Depth helps MoE models because each layer has its own 128-expert mixture, so more layers means more routing opportunities. Narrow hidden keeps per-expert FFN matrices small.

Qwen 3 30B-A3B strengths

+Roughly Qwen 3 14B dense quality at ~1/4 the active compute per token

+Apache 2.0 license - unrestricted commercial deployment

+Fits in 4-bit on a single 24GB consumer GPU (RTX 4090); BF16 needs ~61GB (single H100 80GB or 2x A100 40GB)

+Strong multilingual, code, and reasoning - benefits from the full Qwen 3 training corpus

Qwen 3 30B-A3B limitations

−Still needs 30B of VRAM loaded even though only 3B compute - not smaller memory than a dense 30B

−Fewer inference engines handle fine-grained MoE efficiently than handle dense Llama-style models

−No shared expert means routing failures (a bad top-8 pick) have nowhere to fall back to

−Fine-tuning 128-expert MoE with LoRA is still open research in 2026 - routing drift is real

Qwen 3 30B-A3B Architecture

Interactive Qwen 3 30B-A3B architecture explorer

Transformer Block (Qwen style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Qwen 3 30B-A3B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Qwen 3 30B-A3B

1Fine-grained MoE at 30B scale

2No shared expert (vs DeepSeek V3)

3QK-Norm + MoE combined

4Q-projection larger than hidden size

Qwen 3 30B-A3B design choices and rationale

128 experts with no shared expert

GQA 8:1 (32 Q / 4 KV heads)

48 layers at 2048 hidden

Qwen 3 30B-A3B strengths

Qwen 3 30B-A3B limitations

Qwen 3 30B-A3B architecture FAQ

Qwen 3 30B-A3B vs Qwen 3 14B dense - which do I pick?

Why no shared expert when DeepSeek V3 has one?

Why is Q-projection 4096 when hidden dim is 2048?

How much VRAM to actually run it?

Qwen 3 30B-A3B vs DeepSeek V3?

Does it support thinking mode like Qwen 3 14B?

Compare Qwen 3 30B-A3B with related LLM architectures

Qwen 3 30B-A3B concepts: related reading

Qwen 3 30B-A3B Architecture

Interactive Qwen 3 30B-A3B architecture explorer

Transformer Block (Qwen style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Qwen 3 30B-A3B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Qwen 3 30B-A3B

1Fine-grained MoE at 30B scale

2No shared expert (vs DeepSeek V3)

3QK-Norm + MoE combined

4Q-projection larger than hidden size

Qwen 3 30B-A3B design choices and rationale

128 experts with no shared expert

GQA 8:1 (32 Q / 4 KV heads)

48 layers at 2048 hidden

Qwen 3 30B-A3B strengths

Qwen 3 30B-A3B limitations

Qwen 3 30B-A3B architecture FAQ

Qwen 3 30B-A3B vs Qwen 3 14B dense - which do I pick?

Why no shared expert when DeepSeek V3 has one?

Why is Q-projection 4096 when hidden dim is 2048?

How much VRAM to actually run it?

Qwen 3 30B-A3B vs DeepSeek V3?

Does it support thinking mode like Qwen 3 14B?

Compare Qwen 3 30B-A3B with related LLM architectures

Qwen 3 30B-A3B concepts: related reading