Loading...
Loading...
Qwen 3's flagship. 235B stored, ~22B active per token. The Apache-licensed answer to DeepSeek V3 at roughly one-third the total parameters.
Qwen 3 flagship Mixture of Experts (MoE). 235B stored, 22B active. 94 layers, Grouped-Query Attention (GQA) 16:1 (64 Q / 4 key/value (KV) heads), 128 experts top-8, no shared expert, QK-Norm. Apache 2.0.
Alibaba's answer to DeepSeek V3: same fine-grained Mixture of Experts (MoE) recipe, one-third the total parameters, Apache 2.0 license instead of a custom agreement. Trades blows with DeepSeek V3 on benchmarks. Qwen edges ahead on code and multilingual; DeepSeek edges ahead on reasoning.
235B total parameters, 22B active. 94 layers (deeper than any other model in this catalog), hidden dim 4096, 64 Q heads sharing 4 key/value (KV) heads (Grouped-Query Attention (GQA) 16:1, most aggressive ratio shipped), QK-Norm on Q/K, 128 routed experts + 0 shared, top-8 routing, expert Feed-Forward Network (FFN) 1536. Apache 2.0.
Needs ~235GB in BF16 - multi-H100 node or heavy quantization. 94 layers means high sequential depth, so latency is closer to a dense 70B than to 22B. Native 32K context shorter than DeepSeek V3's 128K. Fine-grained MoE fine-tuning still requires specialized frameworks.
The default pick at this quality tier when Apache 2.0 matters - the most permissive license at the 200B+ scale. Pick DeepSeek V3 for pure quality on reasoning and longer native context. If licensing is neutral, pick by fine-tune ecosystem fit.
| Released | April 2025 |
|---|---|
| Organization | Alibaba Cloud / Qwen Team |
| License | Apache 2.0 |
| Parameters | 235B total · 8 of 128 experts active |
| Layers | 94 |
| Hidden dim (d_model) | 4,096 |
| Attention heads | 64 Q heads / 4 KV heads |
| Head dim | 128 |
| Attention type | GQA + MoE FFN |
| QK-Norm | Yes |
| FFN intermediate | 1,536 |
| FFN activation | SwiGLU |
| Normalization | RMSNorm (pre) |
| Position encoding | RoPE |
| Max context | 32,768 tokens |
| Vocabulary | 151,936 |
Deeper than any other model in this catalog. At fixed total parameters, deeper-narrower MoE lets more layers participate in expert routing. Each token gets 94 separate top-8 routing decisions.
64 query heads share 4 KV heads. Halves per-layer KV cache vs Llama 3.1 70B's 8:1; ~40% smaller total per token even with 14 more layers. Makes batch-serving cheap enough that MoE compute becomes the bottleneck, not memory bandwidth.
Qwen 3 kept the expert count constant at 128 between the 30B and 235B variants. Scaling went into depth (48 to 94 layers) and expert FFN width (768 to 1536), not more experts. Fewer experts than DeepSeek V3's 256.
The no-fine-print option among large-scale frontier-quality MoE models. DeepSeek V3 uses a custom license with commercial clauses. Llama 3.1 405B uses the Community License (700M MAU cap). Grok-1 (314B) was Apache 2.0 in 2024 but is base-only and abandoned, so for production use Qwen 3 235B-A22B is the cleanest 200B+ Apache-licensed model in 2026.
Why: Hidden 4096 is modest for a 235B model (DeepSeek V3 uses 7168). Qwen bet depth over width: more sequential transformer blocks means more routing decisions and more residual-stream refinement per token.
Why: Coarser routing, but 22B active per token (DeepSeek V3 uses 37B active). Qwen prioritized inference cost per token over peak routing granularity. Fewer experts = smaller router, simpler load balancing.
Why: At 94 layers and 235B, the absolute KV cache would be huge with normal ratios. 16:1 brings it back to serving-friendly. Qwen validated empirically that quality holds at this ratio when QK-Norm is present.
License, not architecture. Both are great. Qwen 3 is Apache 2.0, DeepSeek uses a custom license with commercial clauses. For enterprises that need clean legal, Qwen 3 235B-A22B wins automatically. For pure quality, DeepSeek V3 is slightly ahead on reasoning and has 128K native context. On code and multilingual, Qwen often edges ahead.
Qwen 3 spent parameters on depth instead of width. Hidden dim is 4096 (modest for 235B), but 94 layers gives 94 separate 128-expert routing decisions per token. More opportunities to refine the residual stream and pick specialized experts. The tradeoff is per-token latency.
BF16: ~470GB - 8x H100 80GB minimum. FP8: ~235GB, fits on 4x H100. INT4: ~120GB, runs on 2x H100 or 4x A100 40GB. Per-token compute is only 22B active, so latency is surprisingly reasonable - batch throughput is where this model shines.
Qwen wins on quality per compute dollar. 405B is fully dense so every token activates 405B. Qwen 3 235B-A22B activates only 22B per token - roughly 18x cheaper inference at similar or better quality. The 405B still has an ecosystem advantage (more fine-tunes, more tutorials) but the raw economics favor the MoE.
Fewer experts = coarser routing but lower active compute. Qwen 3 235B-A22B activates 22B per token; DeepSeek V3 activates 37B. Qwen optimized for cheaper-per-token serving; DeepSeek optimized for peak routing granularity. Two valid approaches.
Yes, and arguably this is where thinking mode pays off most. The 235B model has enough capacity to run genuinely useful chains of thought. On math and competition reasoning, thinking-mode Qwen 3 235B-A22B is competitive with OpenAI o1-mini and DeepSeek R1.