Loading...
Loading...
Explore how different tokenization methods break text into tokens. Essential for understanding NLP and LLMs.
Subword tokenization like GPT. Merges frequent character sequences.
LLMs don't see text - they see sequences of numbers (token IDs). Tokenization converts text into these numerical representations.
Subword tokenization (like BPE) balances vocabulary size with the ability to represent any word, even rare ones.
Token count matters for API costs and context limits. GPT-4o supports 128K tokens; GPT-4.1 supports over 1M tokens.
BPE tokenization has surprising pitfalls when dealing with numbers and code. Learn about the BPE tokenization trap
Tokenization is the process of splitting text into smaller units called tokens. These tokens are what language models like GPT and BERT process. Common methods include whitespace splitting, word-level splitting, and subword tokenization like BPE (Byte Pair Encoding).
BPE is a subword tokenization algorithm used by GPT models. It starts with individual characters and iteratively merges the most frequent pairs into new tokens. This allows it to handle any word, including rare or misspelled ones, by breaking them into known subword pieces like "un" + "happi" + "ness".
LLMs have fixed context windows measured in tokens (e.g., GPT-4o supports 128K tokens). More tokens mean higher API costs since providers charge per token. Understanding tokenization helps you estimate costs and optimize prompt length.
Both are subword tokenization methods. BPE (used by GPT) merges the most frequent byte pairs, while WordPiece (used by BERT) selects merges that maximize the likelihood of the training data. In practice, they produce similar results but WordPiece tends to produce slightly different subword boundaries.