Loading...
Loading...
Moonshot's trillion-parameter MoE. Same MLA + fine-grained MoE DNA as DeepSeek, scaled to 384 experts and 1T total. 32B active per token, modified MIT.
Moonshot's 1T Mixture of Experts (MoE). Borrows DeepSeek V3's Multi-head Latent Attention (MLA) + fine-grained MoE recipe but pushes to 384 routed experts (50% more) with only 64 attention heads. 1T total, 32B active.
Kimi K2 is what happens when Moonshot takes the DeepSeek V3 recipe and pushes every knob further. Same Multi-head Latent Attention (MLA), same fine-grained Mixture of Experts (MoE) with one shared expert, same 61 layers, same 7168 hidden, same block-FP8 weight release. Moonshot did not try to invent - they took the recipe that worked and scaled it.
1T total parameters, ~32B active per token. 384 routed experts (50% more than V3's 256) with top-8 routing unchanged. 64 attention heads (half V3's 128) to cut MLA decompression cost. 61 layers, hidden 7168, expert Feed-Forward Network (FFN) 2048, 163,840-token vocab, 128K context. Modified MIT license.
1T weights need ~1TB in FP8 - one H200 node minimum, two for long-context concurrency. 384 experts makes fine-tuning and LoRA harder than dense models or smaller MoEs. Not a reasoning model out of the box - use K2 for agentic workloads and DeepSeek R1 for chain-of-thought math. Fewer community fine-tunes than older DeepSeek releases.
The absolute ceiling of open-weights quality in 2026 if you can afford 8-16x H100s, alongside DeepSeek R1. Aimed at agentic workloads - cleared DeepSeek V3 (65.8% vs 51.8% on SWE-Bench Verified single-attempt) and closed within 7 points of Claude Sonnet 4. For reasoning-heavy math and logic, pick DeepSeek R1 instead.
| Released | July 2025 |
|---|---|
| Organization | Moonshot AI |
| License | Modified MIT |
| Parameters | 1.0T total · 8 of 384 experts active |
| Layers | 61 |
| Hidden dim (d_model) | 7,168 |
| Attention heads | 64 heads |
| Head dim | 128 |
| Attention type | MLA + MoE FFN |
| FFN intermediate | 2,048 |
| FFN activation | SwiGLU |
| Normalization | RMSNorm (pre) |
| Position encoding | RoPE |
| Max context | 131,072 tokens |
| Vocabulary | 163,840 |
50% more experts than DeepSeek V3. Each expert is the same 2048 FFN width, so finer specialization per token. Router picks top-8 of 384 instead of top-8 of 256, pushing expert granularity further than any prior open model.
Moonshot cut heads from 128 to 64 while keeping the 512-dim MLA key/value (KV) latent. Fewer heads means cheaper attention decompression at inference. Ablations showed quality did not suffer at this scale - a useful empirical point for MLA head-count tradeoffs.
Kimi K2 Instruct was explicitly trained for tool use, long-horizon planning, and multi-turn agent loops. On SWE-Bench Verified single-attempt it scored 65.8% (vs DeepSeek V3 at 51.8%) and closed within ~7 points of Claude Sonnet 4 (72.7%). On Tau2-Bench telecom it actually beat Claude Opus 4 (65.8 vs 57.0).
Permissive commercial license at the same quality tier as DeepSeek R1. No field-of-use restrictions on most deployments. The first trillion-parameter open model with clean enterprise licensing.
Why: Scaling total parameters primarily through expert count keeps active compute low. Each of the 384 specialists can learn a narrower slice of the data distribution, improving routing precision without raising per-token FLOPs much.
Why: MLA attention's latent projection dominates compute at 128 heads. Cutting to 64 halves the decompression cost while the KV latent stays compact. Empirical quality held in Moonshot's ablations at this scale.
Why: Moonshot kept everything structural identical to V3 to minimize training risk. Scaling went entirely into expert count. This also made it trivial for inference engines like vLLM and SGLang to support K2 quickly - they just extend their V3 code path.
Depends on workload. For agentic tasks (tool use, SWE-Bench-style code edits, multi-turn planning) Kimi K2 Instruct leads. For general chat and reasoning, DeepSeek V3 or R1 are usually ahead. Architecturally near-identical; the post-training recipes diverge.
Structurally very close. Same MLA attention design, same fine-grained MoE with shared expert, same 61 layers, same 7168 hidden, same 2048 expert FFN, same block-FP8 weight release. The non-trivial deltas are 384 experts (vs 256), 64 heads (vs 128), a much larger tokenizer (160K vs 129K vocab), and the training recipe (BF16 in compute with the MuonClip optimizer instead of V3's FP8 mixed precision). Moonshot was explicit about building on the V3 recipe.
MLA attention pays a decompression cost proportional to head count. Halving from 128 to 64 cuts that cost meaningfully without shrinking the 512-dim KV latent that stores the actual information. The K2 tech report shows doubling back to 128 heads at 128K context inflates inference FLOPs by 83% for only 0.5-1.2% loss improvement.
FP8: ~1TB for weights alone - one H200 node (8x H200 141GB = 1128GB) is the floor; 16x H100 80GB (1280GB) also works. Add a second H200 node for long-context concurrent serving where KV cache pressure matters. INT4: ~500GB, fits on 8x H100 80GB or 4x H200. Active compute is only 32B per token, so per-token latency is reasonable - throughput scales with how much memory you can dedicate to expert weights and KV cache.
Heavy RL and Supervised Fine-Tuning (SFT) on tool-use traces, code execution loops, and multi-turn task completion. The Instruct variant learned to emit well-formed function calls, retry after tool errors, and plan multi-step workflows. This is a post-training property - the base K2 weights are a general MoE LLM.
K2 Instruct. R1 is a reasoning model - it generates long internal chains of thought that are great for math and logic but slow and verbose for agent loops where you want tight tool-call-respond cycles. K2 Instruct was built specifically for the latter pattern.