Qwen 3 14B Architecture

2025

Alibaba Cloud / Qwen Team · Released April 29, 2025 · Apache 2.0

Alibaba's 2025 flagship dense model. Adds QK-Norm for stability, GQA 5:1, 32k native context. Competitive with 20B+ models.

Paper Model card config.json GitHubApache 2.0

Transformer Block (Qwen style)

Show tensor shapes40 layers x this block

What makes Qwen 3 14B different

Qwen 3 adds QK-Norm (Root Mean Square Normalization (RMSNorm) applied to Q and K) for training stability. Grouped-Query Attention (GQA) 5:1 with 40 query heads and 8 key/value (KV) heads. 151k vocabulary.

Context length8,192 tokens

(max 32,768)

Zoom-on-hover

Hover any component in the diagram to see a zoomed view plus what it does and why.

Tip: Scroll the diagram sideways to see the full block. Open on a wider screen to hover for zoomed-in details of each component.

Dimensions

Parameters14.8B

Layers40

Hidden dim5,120

Q heads40

KV heads8

Head dim128

FFN dim17,408

Vocab151,936

Max context32K

Memory @ 8,192 tokens

Weights (FP16)29.60 GB

Weights (INT4)7.40 GB

KV cache (FP16)1.34 GB

Total (FP16)30.94 GB

Design choices

AttentionGrouped-Query Attention (GQA)

PositionROPE

NormRMSNorm (pre)

FFN activationSwiGLU

Concepts

GQA 5:1QK-NormRMSNormRoPESwiGLUThinking ModeApache 2.0

Qwen 3 14B architecture: what it is and why it matters

The pitch

Qwen 2.5 with QK-Norm and a thinking mode toggle. QK-Norm had been sitting in research papers since the 2023 ViT-22B paper; OLMo 2 (Nov 2024) and Gemma 3 (March 2025) shipped it first, but Qwen 3 is the version most production teams pick up. It delivers small but consistent stability and quality gains across the whole Qwen 3 family.

By the numbers

14.8B parameters across 40 layers, hidden dim 5120, 40 Q heads sharing 8 key/value (KV) heads (Grouped-Query Attention (GQA) 5:1), QK-Norm on Q and K, Swish-Gated Linear Unit (SwiGLU), Rotary Position Embeddings (RoPE). 151,936-token vocab, 32K native context extendable via YaRN. Apache 2.0. Ships alongside dense 0.6B-32B and 30B/235B Mixture of Experts (MoE) siblings.

The tradeoff

Smaller fine-tune ecosystem than Llama. QK-Norm adds minor per-layer overhead. Native 32K context shorter than Llama 3.1's 128K. Qwen 3 30B-A3B (MoE sibling) often beats 14B dense at similar serving cost, which makes the 14B a harder sell unless MoE serving is off the table.

When to pick it

The best 14B dense pick in 2026 for reasoning, code, and multilingual. Pick 14B dense over 30B-A3B MoE when inference stack does not handle MoE well. Upgrade from Qwen 2.5 14B on sight - QK-Norm alone justifies it.

At a glance
Released	April 2025
Organization	Alibaba Cloud / Qwen Team
License	Apache 2.0
Parameters	14.8B (dense)
Layers	40
Hidden dim (d_model)	5,120
Attention heads	40 Q heads / 8 KV heads
Head dim	128
Attention type	Grouped-query (GQA)
QK-Norm	Yes
FFN intermediate	17,408
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	32,768 tokens
Vocabulary	151,936

Key architectural innovations in Qwen 3 14B

1QK-Norm

Root Mean Square Normalization (RMSNorm) applied to Q and K vectors independently before attention. Stabilizes score distributions, enables higher learning rates, reduces need for attention softmax tricks like logit soft-capping. Originally from the 2023 ViT-22B paper; OLMo 2 13B (Nov 2024) and Gemma 3 27B (March 2025) shipped it earlier, but Qwen 3 14B is where most production teams encounter it.

2"Thinking mode" toggle

Same weights can operate in fast-response mode or multi-step reasoning mode. Switched by special tokens in the input prompt. Emulates test-time compute patterns popularized by OpenAI o1/DeepSeek R1 without needing a separate model.

3Dense + MoE co-release

Qwen 3 shipped both dense (0.6B to 32B) and MoE variants (30B-A3B, 235B-A22B) simultaneously. 14B is positioned as the production dense workhorse; 30B-A3B offers similar quality at lower inference cost via MoE.

4GQA 5:1 ratio

40 query heads share 8 KV heads. Between Llama 3 8B's 4:1 and Qwen 2.5 7B's 7:1. Chosen based on empirical quality-memory tradeoff at the 14B scale.

Qwen 3 14B design choices and rationale

QK-Norm added (vs Qwen 2.5 without it)

Why: Attention score distributions in deep networks can become unstable - some heads produce huge values, others near-zero. RMSNorm on Q and K keeps magnitudes bounded. Enables higher learning rates and more stable long-context behavior.

40 layers + 17,408 FFN (vs 48 layers + 13,824 FFN in Qwen 2.5 14B)

Why: Qwen 3 14B keeps the same hidden 5120 and GQA 5:1 as Qwen 2.5 14B but trades depth for FFN width. Fewer sequential layers means lower per-token latency; the wider FFN preserves overall capacity. Total parameter budget stays in the same 14B class.

32k native context (not 128k)

Why: Most applications do not need 128k; training at 32k is cheaper and more stable. YaRN extension still works at inference time for applications that do.

Qwen 3 14B strengths

+Best-in-class quality for 14B dense as of 2025
+Apache 2.0 license enables unrestricted commercial deployment
+Strong multilingual (119 languages and dialects per the Qwen 3 paper)
+Thinking mode provides test-time reasoning without a separate model
+Qwen3-Coder family released as a separate specialized line for code-heavy workloads

Qwen 3 14B limitations

−QK-Norm adds slight compute overhead per layer (minor)
−Fewer fine-tunes and LoRAs available vs Llama ecosystem
−Native 32k context shorter than Llama 3.1's 128k
−MoE variants (Qwen 3 30B-A3B) often beat the 14B dense on quality at similar cost

Qwen 3 14B architecture FAQ

What is QK-Norm and why does it matter?

RMSNorm applied to Q and K before attention. Keeps attention score magnitudes bounded. Previously, some attention heads would produce huge values while others collapsed - QK-Norm prevents this. Enables higher learning rates and more stable long-context training. Almost every new model in 2025+ is adopting it.

Qwen 2.5 14B vs Qwen 3 14B - worth upgrading?

Yes. QK-Norm alone gives small but consistent quality gains. Thinking mode is a nice extra if your use case benefits from test-time reasoning. Training data was also larger and higher quality. If you have infrastructure running Qwen 2.5, the upgrade path is straightforward.

Qwen 3 14B vs Llama 3.1 8B?

Qwen 3 14B wins on quality but uses ~2x the memory. For applications where 14B memory is fine, Qwen 3 is the better choice. For memory-constrained edge deployments, Llama 3.1 8B still makes sense. Ecosystem-wise, Llama has more fine-tunes and tutorials; Qwen has better underlying architecture.

What is Qwen 3 thinking mode?

Special tokens in the prompt toggle direct-response vs extended-reasoning behavior. In reasoning mode, the model generates a chain-of-thought before the answer. Same weights, different inference pattern. Useful for math and logic - skip it for casual chat because it slows responses down significantly.

Qwen 3 14B vs DeepSeek V3?

Different weight classes. DeepSeek V3 is 671B total (37B active MoE), Qwen 3 14B is dense. For apples-to-apples, Qwen 3 235B-A22B (their own MoE) vs DeepSeek V3 - close match with DeepSeek usually slightly ahead on reasoning.

Does Qwen 3 14B need an H100?

FP16 needs ~30GB - A100 40GB, H100 80GB, or RTX 6000 Ada 48GB. 4-bit GGUF: ~9GB, fits on RTX 4090 or any 16GB consumer card. For production, the 40GB tier is where most 14B-class models live.

Compare Qwen 3 14B with related LLM architectures

Qwen 3 30B-A3B

Qwen 3's small MoE. 30B stored, only 3B active per token - runs at Qwen 2.5 3B latency but ships Qwen 3 14B quality. 128 experts, top-8, no shared expert.

Compare Qwen 3 14B vs Qwen 3 30B-A3B →

Qwen 3 235B-A22B

Qwen 3's flagship. 235B stored, ~22B active per token. The Apache-licensed answer to DeepSeek V3 at roughly one-third the total parameters.

Compare Qwen 3 14B vs Qwen 3 235B-A22B →

Qwen 3 14B concepts: related reading

How Transformers Work kv Cache Llm Inference Rope Positional Embeddings Flash Attention Explained

Qwen 3 14B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

At a glance
Released	April 2025
Organization	Alibaba Cloud / Qwen Team
License	Apache 2.0
Parameters	14.8B (dense)
Layers	40
Hidden dim (d_model)	5,120
Attention heads	40 Q heads / 8 KV heads
Head dim	128
Attention type	Grouped-query (GQA)
QK-Norm	Yes
FFN intermediate	17,408
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	32,768 tokens
Vocabulary	151,936

Key architectural innovations in Qwen 3 14B

1QK-Norm

2"Thinking mode" toggle

3Dense + MoE co-release

4GQA 5:1 ratio

40 query heads share 8 KV heads. Between Llama 3 8B's 4:1 and Qwen 2.5 7B's 7:1. Chosen based on empirical quality-memory tradeoff at the 14B scale.

Qwen 3 14B design choices and rationale

QK-Norm added (vs Qwen 2.5 without it)

40 layers + 17,408 FFN (vs 48 layers + 13,824 FFN in Qwen 2.5 14B)

32k native context (not 128k)

Why: Most applications do not need 128k; training at 32k is cheaper and more stable. YaRN extension still works at inference time for applications that do.

Qwen 3 14B strengths

+Best-in-class quality for 14B dense as of 2025

+Apache 2.0 license enables unrestricted commercial deployment

+Strong multilingual (119 languages and dialects per the Qwen 3 paper)

+Thinking mode provides test-time reasoning without a separate model

+Qwen3-Coder family released as a separate specialized line for code-heavy workloads

Qwen 3 14B architecture FAQ

What is QK-Norm and why does it matter?

Qwen 2.5 14B vs Qwen 3 14B - worth upgrading?

Qwen 3 14B vs Llama 3.1 8B?

What is Qwen 3 thinking mode?

Qwen 3 14B vs DeepSeek V3?

Does Qwen 3 14B need an H100?

FP16 needs ~30GB - A100 40GB, H100 80GB, or RTX 6000 Ada 48GB. 4-bit GGUF: ~9GB, fits on RTX 4090 or any 16GB consumer card. For production, the 40GB tier is where most 14B-class models live.

Qwen 3 14B Architecture

Interactive Qwen 3 14B architecture explorer

Transformer Block (Qwen style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Qwen 3 14B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Qwen 3 14B

1QK-Norm

2"Thinking mode" toggle

3Dense + MoE co-release

4GQA 5:1 ratio

Qwen 3 14B design choices and rationale

QK-Norm added (vs Qwen 2.5 without it)

40 layers + 17,408 FFN (vs 48 layers + 13,824 FFN in Qwen 2.5 14B)

32k native context (not 128k)

Qwen 3 14B strengths

Qwen 3 14B limitations

Qwen 3 14B architecture FAQ

What is QK-Norm and why does it matter?

Qwen 2.5 14B vs Qwen 3 14B - worth upgrading?

Qwen 3 14B vs Llama 3.1 8B?

What is Qwen 3 thinking mode?

Qwen 3 14B vs DeepSeek V3?

Does Qwen 3 14B need an H100?

Compare Qwen 3 14B with related LLM architectures

Qwen 3 14B concepts: related reading

Qwen 3 14B Architecture

Interactive Qwen 3 14B architecture explorer

Transformer Block (Qwen style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Qwen 3 14B architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Qwen 3 14B

1QK-Norm

2"Thinking mode" toggle

3Dense + MoE co-release

4GQA 5:1 ratio

Qwen 3 14B design choices and rationale

QK-Norm added (vs Qwen 2.5 without it)

40 layers + 17,408 FFN (vs 48 layers + 13,824 FFN in Qwen 2.5 14B)

32k native context (not 128k)

Qwen 3 14B strengths

Qwen 3 14B limitations

Qwen 3 14B architecture FAQ

What is QK-Norm and why does it matter?

Qwen 2.5 14B vs Qwen 3 14B - worth upgrading?

Qwen 3 14B vs Llama 3.1 8B?

What is Qwen 3 thinking mode?

Qwen 3 14B vs DeepSeek V3?

Does Qwen 3 14B need an H100?

Compare Qwen 3 14B with related LLM architectures

Qwen 3 14B concepts: related reading