Kimi K2 Architecture

2025

Moonshot AI · Released July 11, 2025 · Modified MIT

Moonshot's trillion-parameter MoE. Same MLA + fine-grained MoE DNA as DeepSeek, scaled to 384 experts and 1T total. 32B active per token, modified MIT.

Paper Model card config.json GitHubModified MIT

Transformer Block (Kimi style)

Show tensor shapes61 layers x this block

What makes Kimi K2 different

Moonshot's 1T Mixture of Experts (MoE). Borrows DeepSeek V3's Multi-head Latent Attention (MLA) + fine-grained MoE recipe but pushes to 384 routed experts (50% more) with only 64 attention heads. 1T total, 32B active.

Context length8,192 tokens

(max 131,072)

Zoom-on-hover

Hover any component in the diagram to see a zoomed view plus what it does and why.

Tip: Scroll the diagram sideways to see the full block. Open on a wider screen to hover for zoomed-in details of each component.

Dimensions

Parameters1000B

Layers1 dense + 60 MoE = 61

Hidden dim7,168

Q heads64

KV latent (MLA)512

Head dim128

FFN dim2,048

Experts (active / total)8 routed + 1 shared / 384 total

Vocab163,840

Max context128K

Memory @ 8,192 tokens

Weights (FP16)2000.00 GB

Weights (INT4)500.00 GB

KV cache (FP16)511.7 MB

Total (FP16)2000.51 GB

Design choices

AttentionMLA + Mixture of Experts

PositionROPE

NormRMSNorm (pre)

FFN activationSwiGLU

MoE384 experts, top-8

Concepts

MLAFine-grained MoE384 expertsTop-8FP8AgenticModified MIT1T params

Kimi K2 architecture: what it is and why it matters

The pitch

Kimi K2 is what happens when Moonshot takes the DeepSeek V3 recipe and pushes every knob further. Same Multi-head Latent Attention (MLA), same fine-grained Mixture of Experts (MoE) with one shared expert, same 61 layers, same 7168 hidden, same block-FP8 weight release. Moonshot did not try to invent - they took the recipe that worked and scaled it.

By the numbers

1T total parameters, ~32B active per token. 384 routed experts (50% more than V3's 256) with top-8 routing unchanged. 64 attention heads (half V3's 128) to cut MLA decompression cost. 61 layers, hidden 7168, expert Feed-Forward Network (FFN) 2048, 163,840-token vocab, 128K context. Modified MIT license.

The tradeoff

1T weights need ~1TB in FP8 - one H200 node minimum, two for long-context concurrency. 384 experts makes fine-tuning and LoRA harder than dense models or smaller MoEs. Not a reasoning model out of the box - use K2 for agentic workloads and DeepSeek R1 for chain-of-thought math. Fewer community fine-tunes than older DeepSeek releases.

When to pick it

The absolute ceiling of open-weights quality in 2026 if you can afford 8-16x H100s, alongside DeepSeek R1. Aimed at agentic workloads - cleared DeepSeek V3 (65.8% vs 51.8% on SWE-Bench Verified single-attempt) and closed within 7 points of Claude Sonnet 4. For reasoning-heavy math and logic, pick DeepSeek R1 instead.

At a glance
Released	July 2025
Organization	Moonshot AI
License	Modified MIT
Parameters	1.0T total · 8 of 384 experts active
Layers	61
Hidden dim (d_model)	7,168
Attention heads	64 heads
Head dim	128
Attention type	MLA + MoE FFN
FFN intermediate	2,048
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	131,072 tokens
Vocabulary	163,840

Key architectural innovations in Kimi K2

1384 routed experts at trillion-param scale

50% more experts than DeepSeek V3. Each expert is the same 2048 FFN width, so finer specialization per token. Router picks top-8 of 384 instead of top-8 of 256, pushing expert granularity further than any prior open model.

264 attention heads (half of DeepSeek V3)

Moonshot cut heads from 128 to 64 while keeping the 512-dim MLA key/value (KV) latent. Fewer heads means cheaper attention decompression at inference. Ablations showed quality did not suffer at this scale - a useful empirical point for MLA head-count tradeoffs.

3Agentic post-training

Kimi K2 Instruct was explicitly trained for tool use, long-horizon planning, and multi-turn agent loops. On SWE-Bench Verified single-attempt it scored 65.8% (vs DeepSeek V3 at 51.8%) and closed within ~7 points of Claude Sonnet 4 (72.7%). On Tau2-Bench telecom it actually beat Claude Opus 4 (65.8 vs 57.0).

4Modified MIT license at 1T scale

Permissive commercial license at the same quality tier as DeepSeek R1. No field-of-use restrictions on most deployments. The first trillion-parameter open model with clean enterprise licensing.

Kimi K2 design choices and rationale

384 experts vs DeepSeek V3's 256

Why: Scaling total parameters primarily through expert count keeps active compute low. Each of the 384 specialists can learn a narrower slice of the data distribution, improving routing precision without raising per-token FLOPs much.

Halve attention heads to 64

Why: MLA attention's latent projection dominates compute at 128 heads. Cutting to 64 halves the decompression cost while the KV latent stays compact. Empirical quality held in Moonshot's ablations at this scale.

Reuse V3's 61 layers and 7168 hidden

Why: Moonshot kept everything structural identical to V3 to minimize training risk. Scaling went entirely into expert count. This also made it trivial for inference engines like vLLM and SGLang to support K2 quickly - they just extend their V3 code path.

Kimi K2 strengths

+Top-tier open-weights quality, trading blows with DeepSeek V3 and R1
+Best agentic performance of any open model at release - SWE-Bench Verified and Tau2-Bench leader
+Modified MIT license - permissive commercial use, matches DeepSeek R1
+Reuses DeepSeek V3 inference tooling almost verbatim - vLLM and SGLang support shipped fast
+Fine-grained MoE recipe at 1T scale: 384 routed experts + 1 shared, ~32B active per token

Kimi K2 limitations

−1T weights need ~1TB in FP8 - one H200 node minimum, multi-node for production concurrency
−384 experts makes fine-tuning and LoRA harder than dense models or smaller MoEs
−Not a reasoning model out of the box - use K2 for agentic and DeepSeek R1 for chain-of-thought math
−Newer than DeepSeek, so fewer community fine-tunes and downstream variants available

Kimi K2 architecture FAQ

Kimi K2 vs DeepSeek V3 - which wins?

Depends on workload. For agentic tasks (tool use, SWE-Bench-style code edits, multi-turn planning) Kimi K2 Instruct leads. For general chat and reasoning, DeepSeek V3 or R1 are usually ahead. Architecturally near-identical; the post-training recipes diverge.

Is Kimi K2 a copy of DeepSeek V3?

Structurally very close. Same MLA attention design, same fine-grained MoE with shared expert, same 61 layers, same 7168 hidden, same 2048 expert FFN, same block-FP8 weight release. The non-trivial deltas are 384 experts (vs 256), 64 heads (vs 128), a much larger tokenizer (160K vs 129K vocab), and the training recipe (BF16 in compute with the MuonClip optimizer instead of V3's FP8 mixed precision). Moonshot was explicit about building on the V3 recipe.

Why only 64 attention heads?

MLA attention pays a decompression cost proportional to head count. Halving from 128 to 64 cuts that cost meaningfully without shrinking the 512-dim KV latent that stores the actual information. The K2 tech report shows doubling back to 128 heads at 128K context inflates inference FLOPs by 83% for only 0.5-1.2% loss improvement.

How much hardware to serve Kimi K2?

FP8: ~1TB for weights alone - one H200 node (8x H200 141GB = 1128GB) is the floor; 16x H100 80GB (1280GB) also works. Add a second H200 node for long-context concurrent serving where KV cache pressure matters. INT4: ~500GB, fits on 8x H100 80GB or 4x H200. Active compute is only 32B per token, so per-token latency is reasonable - throughput scales with how much memory you can dedicate to expert weights and KV cache.

What is "agentic post-training" in practice?

Heavy RL and Supervised Fine-Tuning (SFT) on tool-use traces, code execution loops, and multi-turn task completion. The Instruct variant learned to emit well-formed function calls, retry after tool errors, and plan multi-step workflows. This is a post-training property - the base K2 weights are a general MoE LLM.

Kimi K2 vs DeepSeek R1 for agents?

K2 Instruct. R1 is a reasoning model - it generates long internal chains of thought that are great for math and logic but slow and verbose for agent loops where you want tight tool-call-respond cycles. K2 Instruct was built specifically for the latter pattern.

Compare Kimi K2 with related LLM architectures

Qwen 3 235B-A22B

Qwen 3's flagship. 235B stored, ~22B active per token. The Apache-licensed answer to DeepSeek V3 at roughly one-third the total parameters.

Compare Kimi K2 vs Qwen 3 235B-A22B →

Kimi K2 concepts: related reading

Mixture of Experts Moe Moe Load Balancing kv Cache Llm Inference How Transformers Work

Kimi K2 architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

At a glance
Released	July 2025
Organization	Moonshot AI
License	Modified MIT
Parameters	1.0T total · 8 of 384 experts active
Layers	61
Hidden dim (d_model)	7,168
Attention heads	64 heads
Head dim	128
Attention type	MLA + MoE FFN
FFN intermediate	2,048
FFN activation	SwiGLU
Normalization	RMSNorm (pre)
Position encoding	RoPE
Max context	131,072 tokens
Vocabulary	163,840

Key architectural innovations in Kimi K2

1384 routed experts at trillion-param scale

264 attention heads (half of DeepSeek V3)

3Agentic post-training

4Modified MIT license at 1T scale

Permissive commercial license at the same quality tier as DeepSeek R1. No field-of-use restrictions on most deployments. The first trillion-parameter open model with clean enterprise licensing.

Kimi K2 design choices and rationale

384 experts vs DeepSeek V3's 256

Halve attention heads to 64

Reuse V3's 61 layers and 7168 hidden

Kimi K2 strengths

+Top-tier open-weights quality, trading blows with DeepSeek V3 and R1

+Best agentic performance of any open model at release - SWE-Bench Verified and Tau2-Bench leader

+Modified MIT license - permissive commercial use, matches DeepSeek R1

+Reuses DeepSeek V3 inference tooling almost verbatim - vLLM and SGLang support shipped fast

+Fine-grained MoE recipe at 1T scale: 384 routed experts + 1 shared, ~32B active per token

Kimi K2 limitations

−1T weights need ~1TB in FP8 - one H200 node minimum, multi-node for production concurrency

−384 experts makes fine-tuning and LoRA harder than dense models or smaller MoEs

−Not a reasoning model out of the box - use K2 for agentic and DeepSeek R1 for chain-of-thought math

−Newer than DeepSeek, so fewer community fine-tunes and downstream variants available

Kimi K2 Architecture

Interactive Kimi K2 architecture explorer

Transformer Block (Kimi style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Kimi K2 architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Kimi K2

1384 routed experts at trillion-param scale

264 attention heads (half of DeepSeek V3)

3Agentic post-training

4Modified MIT license at 1T scale

Kimi K2 design choices and rationale

384 experts vs DeepSeek V3's 256

Halve attention heads to 64

Reuse V3's 61 layers and 7168 hidden

Kimi K2 strengths

Kimi K2 limitations

Kimi K2 architecture FAQ

Kimi K2 vs DeepSeek V3 - which wins?

Is Kimi K2 a copy of DeepSeek V3?

Why only 64 attention heads?

How much hardware to serve Kimi K2?

What is "agentic post-training" in practice?

Kimi K2 vs DeepSeek R1 for agents?

Compare Kimi K2 with related LLM architectures

Kimi K2 concepts: related reading

Kimi K2 Architecture

Interactive Kimi K2 architecture explorer

Transformer Block (Kimi style)

Dimensions

Memory @ 8,192 tokens

Design choices

Concepts

Kimi K2 architecture: what it is and why it matters

The pitch

By the numbers

The tradeoff

When to pick it

Key architectural innovations in Kimi K2

1384 routed experts at trillion-param scale

264 attention heads (half of DeepSeek V3)

3Agentic post-training

4Modified MIT license at 1T scale

Kimi K2 design choices and rationale

384 experts vs DeepSeek V3's 256

Halve attention heads to 64

Reuse V3's 61 layers and 7168 hidden

Kimi K2 strengths

Kimi K2 limitations

Kimi K2 architecture FAQ

Kimi K2 vs DeepSeek V3 - which wins?

Is Kimi K2 a copy of DeepSeek V3?

Why only 64 attention heads?

How much hardware to serve Kimi K2?

What is "agentic post-training" in practice?

Kimi K2 vs DeepSeek R1 for agents?

Compare Kimi K2 with related LLM architectures

Kimi K2 concepts: related reading