graph BT
l["l (byte)"] --> lo["lo (merge 1)"]
o["o (byte)"] --> lo
lo --> low["low (merge 2)"]
w["w (byte)"] --> low
low --> lowe["lowe (merge 3)"]
e["e (byte)"] --> lowe
style l fill:#f0f0f0,stroke:#999
style o fill:#f0f0f0,stroke:#999
style w fill:#f0f0f0,stroke:#999
style e fill:#f0f0f0,stroke:#999
style lo fill:#d4e6f1,stroke:#2980b9
style low fill:#aed6f1,stroke:#2471a3
style lowe fill:#85c1e9,stroke:#1a5276
Why Tokenization Matters
When you type “unhappiness” into ChatGPT, the model doesn’t see the word “unhappiness.” It sees something like ["un", "happ", "iness"] — three tokens that were chosen by an algorithm months before the model was even trained. That algorithm decided, based on statistics from a massive training corpus, that these three pieces are the right granularity. Not individual characters (too many tokens, too little meaning per token). Not whole words (too many unique words, no way to handle words never seen in training). Subwords — the sweet spot.
This isn’t a minor preprocessing detail. Tokenization defines what the model can see. Consider three strategies on the same sentence:
| Strategy | “The cat sat unhappily” becomes | Tokens | Vocab Size |
|---|---|---|---|
| Character-level | ["T","h","e"," ","c","a","t"," ","s","a","t"," ","u","n","h","a","p","p","i","l","y"] |
21 | ~256 |
| Word-level | ["The", "cat", "sat", "unhappily"] |
4 | 100,000+ |
| Subword (BPE) | ["The", " cat", " sat", " un", "happ", "ily"] |
6 | ~50,000 |
With characters, a fixed context window of 2048 tokens covers ~400 words. With subwords, the same window covers ~1500 words — nearly 4× more context for the model to reason over. Word-level is compact but brittle: “unhappily” might never appear in training data, making it an <UNK> token the model is completely blind to. But “un”, “happ”, and “ily” almost certainly do appear — the model can compose meaning from pieces it knows.
The algorithm that learns these splits is Byte Pair Encoding (BPE) — originally a data compression technique (Gage, 1994), adapted for NLP by Sennrich et al. (2016), and now used in GPT-2, GPT-3/4, LLaMA, and most modern language models. In this post, we’ll understand how it works, implement it from scratch, and see how GPT-2 refined the basic algorithm for production.
How BPE Works
The core insight is simple: if two symbols frequently appear next to each other, they probably belong together. Merge them into a single token, then look for the next most frequent pair, and repeat. It’s exactly how you’d compress a text file — find repeated patterns and replace them with shorter symbols. Frequent patterns get absorbed into single tokens; rare patterns stay as smaller pieces.
Think of it like learning abbreviations. If you keep writing “machine learning” in your notes, you’d eventually start writing “ML.” Then if “ML model” keeps appearing, maybe you’d abbreviate that too. BPE does the same thing, but systematically and bottom-up — starting from the smallest units (bytes) and building up to subwords.
Seeing It in Action
Before formalizing the algorithm, let’s watch it work on a real example. Consider a tiny training corpus containing the words "low lower lowest":
| Step | Token Sequence | Most Frequent Pair | New Token |
|---|---|---|---|
| Start | l o w _ l o w e r _ l o w e s t |
— | — |
| Merge 1 | lo w _ lo w e r _ lo w e s t |
(l, o) → lo |
3× |
| Merge 2 | low _ low e r _ low e s t |
(lo, w) → low |
3× |
| Merge 3 | low _ lowe r _ lowe s t |
(low, e) → lowe |
2× |
BPE discovered that l and o always appear together, then that lo and w always appear together, building up low as a token — effectively learning the word stem. Then it found lowe as a shared prefix of “lower” and “lowest.” Without any linguistic rules, purely from frequency, BPE learned morphological structure.
Notice what happened in merge 2: the algorithm merged lo with w, where lo itself was created in merge 1. BPE builds tokens hierarchically — later merges compose earlier ones. We can visualize this as a tree:
Each level in the tree depends on the levels below it. This is why merge order matters during encoding — you can’t build low until lo exists.
The Algorithm
With the intuition in place, here’s the formal procedure:
Initialize: Start with a base vocabulary of all 256 byte values (0–255). Every string can be represented as bytes, so this guarantees full coverage — no
<UNK>tokens, ever.Count pairs: Scan the corpus and count every adjacent pair of tokens.
Merge the most frequent pair: Create a new token for it and replace all occurrences in the corpus.
Repeat steps 2–3 until you’ve done
vocab_size - 256merges.
The output is a merge table: an ordered list of pair → token mappings. This table is the tokenizer.
Training vs. Encoding: A Subtle Difference
There’s an important asymmetry between how BPE learns merges (training) and how it applies them to new text (encoding).
During training, we always merge the globally most frequent pair — that’s how we decide which merges to learn. But during encoding, we apply merges in the order they were learned, not by their frequency in the new text.
A common misconception is that encoding finds the most frequent pair in the new text and merges it. It doesn’t. Encoding applies the training-time merges in their original order. The token low only exists after lo has been created (merge 1). If we tried to merge (lo, w) before creating lo, we’d never find the pair. In the implementation, this shows up as min(stats, key=lambda p: self.merges.get(p, float("inf"))) — picking the pair with the lowest merge index, not the highest frequency.
Why Bytes, Not Characters?
Starting from bytes (0–255) rather than Unicode code points is a practical decision. Unicode has over 150,000 code points — that’s an impractically large base vocabulary. By working at the byte level, we start with just 256 symbols and can represent any string in any language or script. BPE merges then learn to compose bytes into characters, characters into subwords, and subwords into common words — all driven by frequency in the training data.
Languages underrepresented in training data pay a tokenization tax. English “hello” might be one token, but the same greeting in a low-resource language could take 3–4 tokens because the byte sequences were never frequent enough to merge. This means the model burns more of its context window on the same content — a well-documented source of multilingual inefficiency (Petrov et al., 2023). It also means the model takes more compute per word for these languages, making inference more expensive.
Vocabulary Size: A Key Hyperparameter
How many merges should we do? This is the vocabulary size, and it’s a meaningful trade-off:
| Vocab Size | Tokens per Text | Embedding Table | Character |
|---|---|---|---|
| Small (~1k) | Many — close to character-level | Tiny | Better generalization on rare words, but sequences are long and training is slow |
| Medium (~32k–50k) | Moderate — good compression | Manageable | The sweet spot for most models (GPT-2: 50k, LLaMA: 32k) |
| Large (~100k+) | Few — common phrases become single tokens | Very large | Risk of overfitting to training distribution; rare tokens get poorly trained embeddings |
Larger vocabularies mean each token carries more information, so sequences are shorter and the model sees more context per forward pass. But each token also needs an embedding vector, so the embedding table grows linearly. And tokens that appear rarely in training will have poorly learned embeddings — they’ve simply not been seen enough times.
Most modern models settle in the 32k–100k range. GPT-2 uses ~50k tokens. LLaMA uses 32k. GPT-4 reportedly uses ~100k. The right size depends on the training data, the target languages, and the compute budget.
## | echo: false
## %load_ext lab_blackImplementation
Let’s turn the algorithm into code. The BPETokenizer class below has four core methods, each mapping directly to a step we’ve discussed:
train: The learning loop — encode the corpus to bytes, then greedily merge the most frequent pairvocab_size - 256times. Each merge is recorded inself.mergesas a(pair) → indexmapping. This ordered dictionary is the tokenizer.encode: The encoding step — convert new text to bytes, then apply merges in learned order (earliest first, using themintrick we discussed). This is where training-order matters: we pick the pair with the smallest merge index, not the most frequent.decode: The inverse — look up each token ID in the vocabulary to get its byte sequence, concatenate, and decode back to a string._get_stats/_merge: Helpers that count adjacent pairs and replace a specific pair with its merged token throughout a sequence.
One implementation detail: _build_vocab relies on Python 3.7+ dictionary insertion order. Since merges are inserted chronologically, iterating self.merges replays them in order — each merged token is the byte-concatenation of its two parents, which must already exist in the vocabulary.
## | code-fold: true
from typing import Iterable
import requestsclass BPETokenizer:
"""Byte-pair encoder."""
def __init__(self, vocab_sz: int):
"""
Args:
vocab_sz (int): Vocabulary size.
"""
self.vocab_sz = vocab_sz
self.vocab = {}
self.merges = {}
def train(self, text: Iterable[str]):
"""Train Byte-pair encoder."""
ids = list(text.encode("utf-8"))
for idx in range(256, self.vocab_sz):
stats = self._get_stats(ids)
pair = max(stats, key=stats.get)
self.merges[pair] = idx
ids = self._merge(ids, pair, idx)
self.vocab = self._build_vocab(ids)
def encode(self, text):
"""Encode string to bytes using vocabulary built during training."""
ids = list(text.encode("utf-8"))
## If text is empty or has one character -> it is already encoded from previous step
while len(ids) >= 2:
## stats is used only for getting pairs next to each other
stats = self._get_stats(ids)
## Because we built vocab (and merges) bottom-up, we need to encode
## idx from smallest index because some later pairs depend on pairs
## occured before
## If a pair doesn't exist, it wouldn't participate in the list
pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
if pair not in self.merges:
break ## No more pairs to merge
idx = self.merges[pair]
ids = self._merge(ids, pair, idx)
return ids
def decode(self, tokens: Iterable[int]):
"""Decode tokens into string using the vocabulary built during training."""
tokens = b"".join(self.vocab[idx] for idx in tokens)
## It is important to replace tokens that were not seen during training
## with `?`; otherwise, it would fail
return tokens.decode("utf-8", errors="replace")
def _get_stats(self, ids: Iterable[int]):
"""Get pair counts."""
counts = {}
for pair in zip(ids, ids[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
def _merge(self, ids: Iterable[int], pair: Iterable[int], idx: int):
"""Merge pairs that match `pair` with new index `idx`."""
newids = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and pair[0] == ids[i] and pair[1] == ids[i + 1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids
def _build_vocab(self, ids: Iterable[int]):
"""Build vocabulary from 0-255 bytes and merges."""
vocab = {idx: bytes([idx]) for idx in range(256)}
## Here we assume the items returned would be in the same order they were inserted.
## This is Okay Python 3.7+
for (p0, p1), idx in self.merges.items():
## This would be a concatenation of the bytes
vocab[idx] = vocab[p0] + vocab[p1]
return vocabtext = requests.get("https://docs.python.org/3/library/stdtypes.html#bytes.decode").texttokenizer = BPETokenizer(300)tokenizer.train(text)tokenizer.decode(tokenizer.encode(text)) == textTrue
From Vanilla BPE to GPT-2’s Tokenizer
The implementation above is vanilla byte-level BPE — it works, but it has a practical problem. Because merges are purely frequency-driven, the algorithm doesn’t respect word boundaries. The word “play” might appear in the corpus as “play.”, “play!”, “play,”, and “play” — and BPE will learn separate tokens for each variant, wasting vocabulary slots on what is essentially the same word with different punctuation.
GPT-2 (Radford et al., 2019) introduced a key refinement: pre-tokenization with a regex pattern that splits text into chunks before BPE runs. The regex prevents merges from crossing certain boundaries — letters can’t merge with digits, punctuation stays separate from words, and spaces attach to the beginning of words rather than the end.
The GPT-2 regex pattern:
'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
This ensures that:
- Contractions are split cleanly: “don’t” →
["don", "'t"] - Spaces attach to the next word: ” hello” stays together, preserving word boundaries
- Punctuation stays isolated: “play!” →
["play", "!"]instead of learning “play!” as one token - Digits don’t merge with letters: “h3llo” →
["h", "3", "llo"]
BPE then runs within each chunk independently. The result: a much cleaner vocabulary where tokens correspond to linguistically meaningful units rather than artifacts of adjacent punctuation.
Use Tiktokenizer to see how GPT-2 and GPT-4 tokenize arbitrary text. Try pasting the same sentence in English and another language — you’ll immediately see the multilingual tokenization tax in action: the non-English version will use significantly more tokens for the same meaning.
This pre-tokenization pattern has been refined in later models. GPT-4 uses a more sophisticated pattern that handles apostrophes, numbers, and whitespace more carefully, and also limits the length of digit sequences to avoid learning overly specific number tokens. The core idea remains the same: constrain where merges can happen to produce a more useful vocabulary.
References & Resources
- Gage, P. (1994). A New Algorithm for Data Compression. The C Users Journal. The original BPE paper — a compression algorithm that found new life in NLP.
- Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. The paper that adapted BPE for NLP tokenization.
- Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. The GPT-2 paper that introduced byte-level BPE with regex pre-tokenization.
- Karpathy, A. (2024). Let’s build the GPT Tokenizer. Excellent video walkthrough of building a BPE tokenizer from scratch.
- A Programmer’s Introduction to Unicode — why bytes vs. code points matters.
- UTF-8 Everywhere — the case for UTF-8 as the universal encoding.
- Tiktokenizer — interactive web app to visualize how different tokenizers split text.