Loading...
Loading...
Every modern open-weights LLM sits somewhere on the same spectrum: shrink the KV cache, shrink the compute cost, or both. The playground below has four modes - see the raw mask shape, type your own prompt and watch attention weights redistribute, race all six variants as tokens stream in, and inspect which tokens individual heads specialize on.
32 query heads share 8 KV heads (4:1 ratio). KV cache is 25% of MHA while attention compute is unchanged.
Cache compression and quality are the two axes people fight over. GQA trades some quality and ecosystem support for compression or compute wins.
Red dashed lines mark common GPU VRAM ceilings. Where GQA crosses a line, that is where you run out of memory.
The KV cache became the binding constraint on LLM serving sometime around the GPT-4 era. Every attention variant since is a different answer to one question.
For Llama 3.1 70B in FP16: 80 layers × 8 KV heads × 128 head-dim × 131,072 tokens × 2 bytes × 2 (K and V) = ~43 GB per sequence, before you count the weights or any activations. That is larger than the entire weights of Llama 3.2 3B at FP16.
Before GPT-4, context was 2K-4K and nobody cared. Once 128K became table stakes, the KV cache started dwarfing the weights and the attention recipe had to change.
Keep full causal attention, shrink what each token stores. GQA shares KV heads across query groups (Llama 2 70B 8:1, Llama 3.1 70B 8:1, Mistral Large 2 12:1, Llama 3.1 405B 16:1). MQA collapses to a single KV head; DeepSeek's MLA compresses K and V into a ~512-dim latent plus a small RoPE slice.
Trade-off: more compression, more quality risk. GQA won the 2024-2025 defaults because it stays simple and every inference engine already supports it.
Attack the O(L²) compute cost instead of the per-token memory. Sliding Window Attention (Mistral 7B, Gemma 2/3) caps each query at a fixed recent window. DeepSeek Sparse Attention (V3.2) uses a lightning indexer to pick the top-k most relevant past tokens per query.
Trade-off: long-range recall vs compute. SWA loses distant tokens entirely; DSA preserves them via content-based selection. Both drop attention FLOPs from quadratic to linear once they saturate.
Gemma 2/3 stacks GQA 2:1 with alternating local/global layers (Gemma 3 runs five local windows for every full layer). DeepSeek V3 pairs MLA with fine-grained MoE so it compresses BOTH the KV footprint and the active parameter count. Llama 4 Scout adds iRoPE on top of GQA + MoE to push context to 10M tokens.
The modern frontier isn't picking one trick - it is layering three or four, each attacking a different cost axis.
The playground above lets you feel the trade-offs. Set context to 128K, switch between variants, watch the KV cache column flip by an order of magnitude.
Every query head has its own K and V. The original.
Introduced in 'Attention Is All You Need'. Every attention head maintains its own K and V, giving maximum representational capacity and maximum memory footprint. GPT-2, GPT-3, Llama 2 7B, and Phi-3 Mini all use MHA. Past 7B scale, the KV cache becomes the binding serving constraint - which is why 2024+ dense models all moved to GQA.
Share KV heads across groups of query heads.
Llama 2 70B shipped with GQA 8:1 (64 Q heads, 8 KV heads). Today this is the default: Llama 3.1 70B and Qwen 2.5 72B both use 8:1, Mistral Large 2 goes to 12:1, Llama 3.1 405B to 16:1. KV cache shrinks proportionally to the group ratio while attention compute is unchanged. The ecosystem compatibility is the biggest reason it won over MLA in 2024-2025.
All query heads share a single KV head.
The aggressive limit of GQA - one KV head, every Q head shares it. PaLM and Falcon used it. At frontier scale (70B+) the quality loss is measurable on reasoning benchmarks, which is why the field moved back toward GQA with tuneable ratios rather than forcing 1 KV head. Still seen in some small on-device models where cache size dominates.
Compress K and V into a low-rank latent, decompress per head.
DeepSeek V2's signature move. K and V are projected into a ~512-dim latent per token (plus a small decoupled RoPE component for positional info), cached as the latent, then reconstructed to full-head K and V at attention time. Roughly 10x smaller KV cache than MHA at comparable quality. Requires custom kernels - which is why it took a year for mainstream inference engines to fully support it. DeepSeek V3 and Kimi K2 both use MLA.
Each token only attends within a fixed window of recent tokens.
Mistral 7B shipped with a 4096-token sliding window, dropping attention compute from O(L²) to O(L·window). Gemma 2 and 3 extend the idea with alternating local/global layers - most layers use a 1024- or 4096-token window, occasional full-attention layers preserve long-range retrieval. Pure SWA is rare because stacked sliding windows have attenuation effects; the hybrid pattern is what works.
A lightning indexer selects the top-k tokens per query.
The most recent attention variant at time of writing. A cheap indexer module scores all past tokens; the main attention runs only over the top-k selected per query. Brings complexity to O(L·k) - similar to sliding window but with content-aware selection instead of positional windowing. Early results suggest it preserves more long-range quality than pure SWA at the same compute budget. Not yet widespread; V3.2 is the reference implementation.
Vanilla multi-head attention (MHA) stores a K and V pair for every head, every layer, every token. At 128k context and 70B+ params the KV cache exceeds the weights in memory. GQA, MQA, MLA, and sliding-window variants are all attempts to shrink that cache or shrink the O(L²) compute cost, each with different quality trade-offs.
MLA on compression (10x smaller KV cache than an equivalent GQA at comparable quality per DeepSeek V2 ablations) but requires custom inference kernels. GQA on ecosystem support - every inference engine handles it natively. For production in 2026, pick GQA 8:1 unless you are on DeepSeek or Kimi infrastructure, where MLA is the default.
By itself, yes - information older than the window cannot be directly attended to. Modern SWA models (Gemma 2/3, Mistral 7B) mitigate this by interleaving sliding-window layers with global-attention layers, or stacking layers so information flows through the residual stream across windows. Pure SWA is rare in frontier models for this reason.
Introduced in DeepSeek V3.2 (Sep 2025). A lightning indexer scores every past token cheaply, the main attention runs only over the top-k selected tokens per query. Drops attention compute from O(L²) to O(Lk) while preserving long-range access, unlike sliding window. Requires the indexer to be trained end-to-end with the rest of the model.
Rarely. PaLM and Falcon used it, but at 70B+ scale quality degradation showed up clearly. GQA is the modern replacement - pick your compression via the KV-head ratio instead of forcing a single KV head. Some smaller models still use MQA where compression matters more than absolute quality.
GQA shares KV heads across query heads; MLA compresses K and V into a shared low-rank latent then reconstructs per-head at attention time. At the same compression ratio, MLA preserves more quality because each head still computes attention in its full dimension - the compression is decoded back, not averaged over heads.