Tokenizer Playground

Explore how different tokenization methods break text into tokens. Essential for understanding NLP and LLMs.

Input Text

Show IDs

Subword tokenization like GPT. Merges frequent character sequences.

41 tokens

Characters

Words

Tokens

1.4

Chars/Token

LLMs don't see text - they see sequences of numbers (token IDs). Tokenization converts text into these numerical representations.

Subword tokenization (like BPE) balances vocabulary size with the ability to represent any word, even rare ones.

Token count matters for API costs and context limits. GPT-4 has ~8k-128k token limits.

BPE (Byte-Pair Encoding)

Used by GPT models, merges frequent pairs

WordPiece

Used by BERT, similar to BPE

SentencePiece

Used by T5, language-agnostic

tiktoken

OpenAI's fast BPE implementation

Tip: Notice how "ing", "tion", and "the" often become single tokens in BPE - these are common subwords!

BPE tokenization has surprising pitfalls when dealing with numbers and code. Learn about the BPE tokenization trap