Attention Variants: MHA, GQA, MQA, MLA, SWA, DSA

Every modern open-weights LLM sits somewhere on the same spectrum: shrink the KV cache, shrink the compute cost, or both. The playground below has four modes - see the raw mask shape, type your own prompt and watch attention weights redistribute, race all six variants as tokens stream in, and inspect which tokens individual heads specialize on.

Attention Mask

48×48 grid over 8,192 tokens · causal

Full attention

Partial / strided

Masked out

Cost at these settings

KV cache / layer

4 KB per token

33.6 MB

KV cache total (32L)

25% of MHA baseline

1.07 GB

Attention FLOPs / layer

100% of MHA baseline

1.10 TF

What the numbers mean

32 query heads share 8 KV heads (4:1 ratio). KV cache is 25% of MHA while attention compute is unchanged.

How GQA stores KV per token

Per-token: 4 KB75% smaller than MHA

Total at 8,192 tokens · 32 layers: 1.07 GB (MHA equivalent would need 4.29 GB)

Insights

GQA 4:1 drops cache to 4 KB/token (75% smaller than MHA) with no attention compute penalty. Sweet spot for 70B-class dense models. Total at 8,192 tokens × 32 layers: 1.07 GB.

Capability vs MHA baseline

higher on axis = better

MHA

GQA

Insights

Dominant 2024 default: nearly MHA-quality polygon with a major cache-compression bump and zero ecosystem penalty. Hence Llama 3, Qwen 2.5, and most new releases ship it.

Cache compression and quality are the two axes people fight over. GQA trades some quality and ecosystem support for compression or compute wins.

KV cache vs context length

lines use your current config

Insights

GQA exceeds RTX 4090 24GB at 262,144 tokens of context.
GQA fits 512K+ context within H100 80GB at these settings.

Red dashed lines mark common GPU VRAM ceilings. Where GQA crosses a line, that is where you run out of memory.

Top 5 most-searched models using GQA (3 total in catalog)

Qwen 3 14B Qwen 3 235B-A22B Qwen 3 30B-A3B

Why attention variants matter

The KV cache became the binding constraint on LLM serving sometime around the GPT-4 era. Every attention variant since is a different answer to one question.

The problem

A 70B model at 128K context needs 40+ GB just for KV cache

For Llama 3.1 70B in FP16: 80 layers × 8 KV heads × 128 head-dim × 131,072 tokens × 2 bytes × 2 (K and V) = ~43 GB per sequence, before you count the weights or any activations. That is larger than the entire weights of Llama 3.2 3B at FP16.

Before GPT-4, context was 2K-4K and nobody cared. Once 128K became table stakes, the KV cache started dwarfing the weights and the attention recipe had to change.

Track 1

Compress the cache per token

Keep full causal attention, shrink what each token stores. GQA shares KV heads across query groups (Llama 2 70B 8:1, Llama 3.1 70B 8:1, Mistral Large 2 12:1, Llama 3.1 405B 16:1). MQA collapses to a single KV head; DeepSeek's MLA compresses K and V into a ~512-dim latent plus a small RoPE slice.

Trade-off: more compression, more quality risk. GQA won the 2024-2025 defaults because it stays simple and every inference engine already supports it.

Track 2

Read fewer tokens per query

Attack the O(L²) compute cost instead of the per-token memory. Sliding Window Attention (Mistral 7B, Gemma 2/3) caps each query at a fixed recent window. DeepSeek Sparse Attention (V3.2) uses a lightning indexer to pick the top-k most relevant past tokens per query.

Trade-off: long-range recall vs compute. SWA loses distant tokens entirely; DSA preserves them via content-based selection. Both drop attention FLOPs from quadratic to linear once they saturate.

Hybrid

Combine tracks - and sometimes both axes at once

Gemma 2/3 stacks GQA 2:1 with alternating local/global layers (Gemma 3 runs five local windows for every full layer). DeepSeek V3 pairs MLA with fine-grained MoE so it compresses BOTH the KV footprint and the active parameter count. Llama 4 Scout adds iRoPE on top of GQA + MoE to push context to 10M tokens.

The modern frontier isn't picking one trick - it is layering three or four, each attacking a different cost axis.

The playground above lets you feel the trade-offs. Set context to 128K, switch between variants, watch the KV cache column flip by an order of magnitude.

The six variants in detail

MHA - Multi-Head Attention

2017

Every query head has its own K and V. The original.

Introduced in 'Attention Is All You Need'. Every attention head maintains its own K and V, giving maximum representational capacity and maximum memory footprint. GPT-2, GPT-3, Llama 2 7B, and Phi-3 Mini all use MHA. Past 7B scale, the KV cache becomes the binding serving constraint - which is why 2024+ dense models all moved to GQA.

GQA - Grouped-Query Attention

2023 (Llama 2 70B)

Share KV heads across groups of query heads.

Llama 2 70B shipped with GQA 8:1 (64 Q heads, 8 KV heads). Today this is the default: Llama 3.1 70B and Qwen 2.5 72B both use 8:1, Mistral Large 2 goes to 12:1, Llama 3.1 405B to 16:1. KV cache shrinks proportionally to the group ratio while attention compute is unchanged. The ecosystem compatibility is the biggest reason it won over MLA in 2024-2025.

MQA - Multi-Query Attention

2022 (Shazeer, PaLM, Falcon)

All query heads share a single KV head.

The aggressive limit of GQA - one KV head, every Q head shares it. PaLM and Falcon used it. At frontier scale (70B+) the quality loss is measurable on reasoning benchmarks, which is why the field moved back toward GQA with tuneable ratios rather than forcing 1 KV head. Still seen in some small on-device models where cache size dominates.

MLA - Multi-head Latent Attention

2024 (DeepSeek V2)

Compress K and V into a low-rank latent, decompress per head.

DeepSeek V2's signature move. K and V are projected into a ~512-dim latent per token (plus a small decoupled RoPE component for positional info), cached as the latent, then reconstructed to full-head K and V at attention time. Roughly 10x smaller KV cache than MHA at comparable quality. Requires custom kernels - which is why it took a year for mainstream inference engines to fully support it. DeepSeek V3 and Kimi K2 both use MLA.

SWA - Sliding Window Attention

2023 (Mistral 7B)

Each token only attends within a fixed window of recent tokens.

Mistral 7B shipped with a 4096-token sliding window, dropping attention compute from O(L²) to O(L·window). Gemma 2 and 3 extend the idea with alternating local/global layers - most layers use a 1024- or 4096-token window, occasional full-attention layers preserve long-range retrieval. Pure SWA is rare because stacked sliding windows have attenuation effects; the hybrid pattern is what works.

DSA - DeepSeek Sparse Attention

2025 (DeepSeek V3.2)

A lightning indexer selects the top-k tokens per query.

The most recent attention variant at time of writing. A cheap indexer module scores all past tokens; the main attention runs only over the top-k selected per query. Brings complexity to O(L·k) - similar to sliding window but with content-aware selection instead of positional windowing. Early results suggest it preserves more long-range quality than pure SWA at the same compute budget. Not yet widespread; V3.2 is the reference implementation.

FAQ

Why do modern LLMs need attention variants beyond MHA?

Vanilla multi-head attention (MHA) stores a K and V pair for every head, every layer, every token. At 128k context and 70B+ params the KV cache exceeds the weights in memory. GQA, MQA, MLA, and sliding-window variants are all attempts to shrink that cache or shrink the O(L²) compute cost, each with different quality trade-offs.

GQA vs MLA - which one wins?

MLA on compression (10x smaller KV cache than an equivalent GQA at comparable quality per DeepSeek V2 ablations) but requires custom inference kernels. GQA on ecosystem support - every inference engine handles it natively. For production in 2026, pick GQA 8:1 unless you are on DeepSeek or Kimi infrastructure, where MLA is the default.

Does sliding window attention hurt long-range reasoning?

By itself, yes - information older than the window cannot be directly attended to. Modern SWA models (Gemma 2/3, Mistral 7B) mitigate this by interleaving sliding-window layers with global-attention layers, or stacking layers so information flows through the residual stream across windows. Pure SWA is rare in frontier models for this reason.

What is DeepSeek Sparse Attention (DSA)?

Introduced in DeepSeek V3.2 (Sep 2025). A lightning indexer scores every past token cheaply, the main attention runs only over the top-k selected tokens per query. Drops attention compute from O(L²) to O(Lk) while preserving long-range access, unlike sliding window. Requires the indexer to be trained end-to-end with the rest of the model.

Is MQA used in any current flagship model?

Rarely. PaLM and Falcon used it, but at 70B+ scale quality degradation showed up clearly. GQA is the modern replacement - pick your compression via the KV-head ratio instead of forcing a single KV head. Some smaller models still use MQA where compression matters more than absolute quality.

How is MLA different from just using more aggressive GQA?

GQA shares KV heads across query heads; MLA compresses K and V into a shared low-rank latent then reconstructs per-head at attention time. At the same compression ratio, MLA preserves more quality because each head still computes attention in its full dimension - the compression is decoded back, not averaged over heads.

Attention Variants: MHA, GQA, MQA, MLA, SWA, DSA

Attention Mask

Cost at these settings

How GQA stores KV per token

Capability vs MHA baseline

KV cache vs context length

Why attention variants matter

A 70B model at 128K context needs 40+ GB just for KV cache

Compress the cache per token

Read fewer tokens per query

Combine tracks - and sometimes both axes at once

The six variants in detail

MHA - Multi-Head Attention

GQA - Grouped-Query Attention

MQA - Multi-Query Attention

MLA - Multi-head Latent Attention

SWA - Sliding Window Attention

DSA - DeepSeek Sparse Attention

FAQ

Why do modern LLMs need attention variants beyond MHA?

GQA vs MLA - which one wins?

Does sliding window attention hurt long-range reasoning?

What is DeepSeek Sparse Attention (DSA)?

Is MQA used in any current flagship model?

How is MLA different from just using more aggressive GQA?

Related reading

Attention Variants: MHA, GQA, MQA, MLA, SWA, DSA

Attention Mask

Cost at these settings

How GQA stores KV per token

Capability vs MHA baseline

KV cache vs context length

Why attention variants matter

A 70B model at 128K context needs 40+ GB just for KV cache

Compress the cache per token

Read fewer tokens per query

Combine tracks - and sometimes both axes at once

The six variants in detail

MHA - Multi-Head Attention

GQA - Grouped-Query Attention

MQA - Multi-Query Attention

MLA - Multi-head Latent Attention

SWA - Sliding Window Attention

DSA - DeepSeek Sparse Attention

FAQ

Why do modern LLMs need attention variants beyond MHA?

GQA vs MLA - which one wins?

Does sliding window attention hurt long-range reasoning?

What is DeepSeek Sparse Attention (DSA)?

Is MQA used in any current flagship model?

How is MLA different from just using more aggressive GQA?

Related reading