Qwen 3 235B-A22B Architecture

2025

Alibaba Cloud / Qwen Team · Released April 29, 2025 · Apache 2.0

Qwen 3's flagship. 235B stored, ~22B active per token. The Apache-licensed answer to DeepSeek V3 at roughly one-third the total parameters.

Paper Model card config.json GitHubApache 2.0

Transformer Block (Qwen style)

Show tensor shapes94 layers x this block

What makes Qwen 3 235B-A22B different

Qwen 3 flagship Mixture of Experts (MoE). 235B stored, 22B active. 94 layers, Grouped-Query Attention (GQA) 16:1 (64 Q / 4 key/value (KV) heads), 128 experts top-8, no shared expert, QK-Norm. Apache 2.0.

Context length8,192 tokens

(max 32,768)

Zoom-on-hover

Hover any component in the diagram to see a zoomed view plus what it does and why.

Tip: Scroll the diagram sideways to see the full block. Open on a wider screen to hover for zoomed-in details of each component.

Dimensions

Parameters235B

Layers94

Hidden dim4,096

Q heads64

KV heads4

Head dim128

FFN dim1,536

Experts (active / total)8 / 128

Vocab151,936

Max context32K

Memory @ 8,192 tokens

Weights (FP16)470.00 GB

Weights (INT4)117.50 GB

KV cache (FP16)1.58 GB

Total (FP16)471.58 GB

Design choices

AttentionGQA + Mixture of Experts

PositionROPE

NormRMSNorm (pre)

FFN activationSwiGLU

MoE128 experts, top-8

Concepts

MoEFine-grained MoE128 expertsTop-8GQA 16:1QK-NormRoPEApache 2.094 layers

Qwen 3 235B-A22B architecture: what it is and why it matters

The pitch

Alibaba's answer to DeepSeek V3: same fine-grained Mixture of Experts (MoE) recipe, one-third the total parameters, Apache 2.0 license instead of a custom agreement. Trades blows with DeepSeek V3 on benchmarks. Qwen edges ahead on code and multilingual; DeepSeek edges ahead on reasoning.

By the numbers

235B total parameters, 22B active. 94 layers (deeper than any other model in this catalog), hidden dim 4096, 64 Q heads sharing 4 key/value (KV) heads (Grouped-Query Attention (GQA) 16:1, most aggressive ratio shipped), QK-Norm on Q/K, 128 routed experts + 0 shared, top-8 routing, expert Feed-Forward Network (FFN) 1536. Apache 2.0.

The tradeoff

Needs ~235GB in BF16 - multi-H100 node or heavy quantization. 94 layers means high sequential depth, so latency is closer to a dense 70B than to 22B. Native 32K context shorter than DeepSeek V3's 128K. Fine-grained MoE fine-tuning still requires specialized frameworks.

When to pick it

The default pick at this quality tier when Apache 2.0 matters - the most permissive license at the 200B+ scale. Pick DeepSeek V3 for pure quality on reasoning and longer native context. If licensing is neutral, pick by fine-tune ecosystem fit.

At a glance
Released	April 2025
Organization	Alibaba Cloud / Qwen Team
License	Apache 2.0
Parameters	235B total · 8 of 128 experts active
Layers	94
Hidden dim (d_model)	4,096
Attention heads	64 Q heads / 4 KV heads
Head dim	128
Attention type	GQA + MoE FFN
QK-Norm	Yes
FFN intermediate	1,536
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	32,768 tokens
Vocabulary	151,936

Key architectural innovations in Qwen 3 235B-A22B

194 layers - unusually deep

Deeper than any other model in this catalog. At fixed total parameters, deeper-narrower MoE lets more layers participate in expert routing. Each token gets 94 separate top-8 routing decisions.

2GQA 16:1 - most aggressive ratio shipped

64 query heads share 4 KV heads. Halves per-layer KV cache vs Llama 3.1 70B's 8:1; ~40% smaller total per token even with 14 more layers. Makes batch-serving cheap enough that MoE compute becomes the bottleneck, not memory bandwidth.

3128 experts, no shared - same recipe as 30B-A3B

Qwen 3 kept the expert count constant at 128 between the 30B and 235B variants. Scaling went into depth (48 to 94 layers) and expert FFN width (768 to 1536), not more experts. Fewer experts than DeepSeek V3's 256.

4Apache 2.0 at flagship MoE scale

The no-fine-print option among large-scale frontier-quality MoE models. DeepSeek V3 uses a custom license with commercial clauses. Llama 3.1 405B uses the Community License (700M MAU cap). Grok-1 (314B) was Apache 2.0 in 2024 but is base-only and abandoned, so for production use Qwen 3 235B-A22B is the cleanest 200B+ Apache-licensed model in 2026.

Qwen 3 235B-A22B design choices and rationale

94 layers instead of wider hidden dim

Why: Hidden 4096 is modest for a 235B model (DeepSeek V3 uses 7168). Qwen bet depth over width: more sequential transformer blocks means more routing decisions and more residual-stream refinement per token.

128 experts, not 256 like DeepSeek V3

Why: Coarser routing, but 22B active per token (DeepSeek V3 uses 37B active). Qwen prioritized inference cost per token over peak routing granularity. Fewer experts = smaller router, simpler load balancing.

GQA 16:1 aggressive ratio

Why: At 94 layers and 235B, the absolute KV cache would be huge with normal ratios. 16:1 brings it back to serving-friendly. Qwen validated empirically that quality holds at this ratio when QK-Norm is present.

Qwen 3 235B-A22B strengths

+State-of-the-art open-weights quality, trading blows with DeepSeek V3
+Apache 2.0 - the most permissive license at this quality tier
+Only 22B active per token - cheaper to serve than Llama 3.1 70B dense
+Strong multilingual, especially Chinese, Japanese, Korean, and code
+Spec highlights: RoPE.

Qwen 3 235B-A22B limitations

−Needs ~235GB in BF16 - multi-H100 node or heavy quantization to serve
−94 layers means high sequential depth per token - latency is closer to a dense 70B than to 22B
−Native 32K context shorter than DeepSeek V3's 128K - YaRN extension needed for long documents
−Fine-grained MoE fine-tuning still requires specialized frameworks - fewer off-the-shelf LoRA paths than Llama 3.1

Qwen 3 235B-A22B architecture FAQ

Qwen 3 235B-A22B vs DeepSeek V3 - which do I pick?

License, not architecture. Both are great. Qwen 3 is Apache 2.0, DeepSeek uses a custom license with commercial clauses. For enterprises that need clean legal, Qwen 3 235B-A22B wins automatically. For pure quality, DeepSeek V3 is slightly ahead on reasoning and has 128K native context. On code and multilingual, Qwen often edges ahead.

Why does Qwen 3 235B-A22B use 94 layers when most MoE models use fewer?

Qwen 3 spent parameters on depth instead of width. Hidden dim is 4096 (modest for 235B), but 94 layers gives 94 separate 128-expert routing decisions per token. More opportunities to refine the residual stream and pick specialized experts. The tradeoff is per-token latency.

How much hardware to actually serve it?

BF16: ~470GB - 8x H100 80GB minimum. FP8: ~235GB, fits on 4x H100. INT4: ~120GB, runs on 2x H100 or 4x A100 40GB. Per-token compute is only 22B active, so latency is surprisingly reasonable - batch throughput is where this model shines.

Qwen 3 235B-A22B vs Llama 3.1 405B?

Qwen wins on quality per compute dollar. 405B is fully dense so every token activates 405B. Qwen 3 235B-A22B activates only 22B per token - roughly 18x cheaper inference at similar or better quality. The 405B still has an ecosystem advantage (more fine-tunes, more tutorials) but the raw economics favor the MoE.

Why only 128 experts when DeepSeek V3 has 256?

Fewer experts = coarser routing but lower active compute. Qwen 3 235B-A22B activates 22B per token; DeepSeek V3 activates 37B. Qwen optimized for cheaper-per-token serving; DeepSeek optimized for peak routing granularity. Two valid approaches.

Does the thinking mode work at this scale?

Yes, and arguably this is where thinking mode pays off most. The 235B model has enough capacity to run genuinely useful chains of thought. On math and competition reasoning, thinking-mode Qwen 3 235B-A22B is competitive with OpenAI o1-mini and DeepSeek R1.

Compare Qwen 3 235B-A22B with related LLM architectures

Qwen 3 30B-A3B

Qwen 3's small MoE. 30B stored, only 3B active per token - runs at Qwen 2.5 3B latency but ships Qwen 3 14B quality. 128 experts, top-8, no shared expert.

Compare Qwen 3 235B-A22B vs Qwen 3 30B-A3B →

Qwen 3 14B

Alibaba's 2025 flagship dense model. Adds QK-Norm for stability, GQA 5:1, 32k native context. Competitive with 20B+ models.

Compare Qwen 3 235B-A22B vs Qwen 3 14B →

Kimi K2

Moonshot's trillion-parameter MoE. Same MLA + fine-grained MoE DNA as DeepSeek, scaled to 384 experts and 1T total. 32B active per token, modified MIT.

Compare Qwen 3 235B-A22B vs Kimi K2 →

Qwen 3 235B-A22B concepts: related reading

Mixture of Experts Moe Moe Load Balancing kv Cache Llm Inference How Transformers Work

Qwen 3 235B-A22B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

At a glance
Released	April 2025
Organization	Alibaba Cloud / Qwen Team
License	Apache 2.0
Parameters	235B total · 8 of 128 experts active
Layers	94
Hidden dim (d_model)	4,096
Attention heads	64 Q heads / 4 KV heads
Head dim	128
Attention type	GQA + MoE FFN
QK-Norm	Yes
FFN intermediate	1,536
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	32,768 tokens
Vocabulary	151,936

Key architectural innovations in Qwen 3 235B-A22B

194 layers - unusually deep

Deeper than any other model in this catalog. At fixed total parameters, deeper-narrower MoE lets more layers participate in expert routing. Each token gets 94 separate top-8 routing decisions.

2GQA 16:1 - most aggressive ratio shipped

3128 experts, no shared - same recipe as 30B-A3B

4Apache 2.0 at flagship MoE scale

Qwen 3 235B-A22B design choices and rationale

94 layers instead of wider hidden dim

128 experts, not 256 like DeepSeek V3

GQA 16:1 aggressive ratio

Qwen 3 235B-A22B limitations

−Needs ~235GB in BF16 - multi-H100 node or heavy quantization to serve

−94 layers means high sequential depth per token - latency is closer to a dense 70B than to 22B

−Native 32K context shorter than DeepSeek V3's 128K - YaRN extension needed for long documents

−Fine-grained MoE fine-tuning still requires specialized frameworks - fewer off-the-shelf LoRA paths than Llama 3.1

Qwen 3 235B-A22B Architecture

Interactive Qwen 3 235B-A22B architecture explorer

Transformer Block (Qwen style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Qwen 3 235B-A22B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Qwen 3 235B-A22B

194 layers - unusually deep

2GQA 16:1 - most aggressive ratio shipped

3128 experts, no shared - same recipe as 30B-A3B

4Apache 2.0 at flagship MoE scale

Qwen 3 235B-A22B design choices and rationale

94 layers instead of wider hidden dim

128 experts, not 256 like DeepSeek V3

GQA 16:1 aggressive ratio

Qwen 3 235B-A22B strengths

Qwen 3 235B-A22B limitations

Qwen 3 235B-A22B architecture FAQ

Qwen 3 235B-A22B vs DeepSeek V3 - which do I pick?

Why does Qwen 3 235B-A22B use 94 layers when most MoE models use fewer?

How much hardware to actually serve it?

Qwen 3 235B-A22B vs Llama 3.1 405B?

Why only 128 experts when DeepSeek V3 has 256?

Does the thinking mode work at this scale?

Compare Qwen 3 235B-A22B with related LLM architectures

Qwen 3 235B-A22B concepts: related reading

Qwen 3 235B-A22B Architecture

Interactive Qwen 3 235B-A22B architecture explorer

Transformer Block (Qwen style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Qwen 3 235B-A22B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Qwen 3 235B-A22B

194 layers - unusually deep

2GQA 16:1 - most aggressive ratio shipped

3128 experts, no shared - same recipe as 30B-A3B

4Apache 2.0 at flagship MoE scale

Qwen 3 235B-A22B design choices and rationale

94 layers instead of wider hidden dim

128 experts, not 256 like DeepSeek V3

GQA 16:1 aggressive ratio

Qwen 3 235B-A22B strengths

Qwen 3 235B-A22B limitations

Qwen 3 235B-A22B architecture FAQ

Qwen 3 235B-A22B vs DeepSeek V3 - which do I pick?

Why does Qwen 3 235B-A22B use 94 layers when most MoE models use fewer?

How much hardware to actually serve it?

Qwen 3 235B-A22B vs Llama 3.1 405B?

Why only 128 experts when DeepSeek V3 has 256?

Does the thinking mode work at this scale?

Compare Qwen 3 235B-A22B with related LLM architectures

Qwen 3 235B-A22B concepts: related reading