LLM Architectures Explorer

Interactive reference for open-weights language model architectures. Every model has a full transformer diagram, live tensor shape tracking, and a KV cache memory calculator that responds to context length in real time. Architecture specs verified against the original paper, HuggingFace config.json, and a secondary reference for every model.

What is a transformer architecture?

Every modern open-weights LLM (GPT, Llama, Mistral, Qwen, Gemma, DeepSeek) shares the same decoder-only transformer skeleton: stacked blocks of normalization + attention + feedforward, wired with residual connections. Models differ in the choices inside each slot: the attention variant (MHA vs GQA vs MoE), the positional encoding (learned vs RoPE), the normalization (LayerNorm vs RMSNorm), the activation (GELU vs SwiGLU), and the scale of each dimension. Pick a model below to see its exact wiring.

LLM Architectures Explorer

What is a transformer architecture?

Compare any two architectures side-by-side

Attention Variants Playground

Kimi K2

Qwen 3 14B

Qwen 3 30B-A3B

Qwen 3 235B-A22B

LLM architecture building blocks: attention, KV cache, RoPE, MoE

LLM Architectures Explorer

What is a transformer architecture?

Compare any two architectures side-by-side

Attention Variants Playground

Kimi K2

Qwen 3 14B

Qwen 3 30B-A3B

Qwen 3 235B-A22B

LLM architecture building blocks: attention, KV cache, RoPE, MoE