Code
from typing import Iterable
import requests
Imad Dabbura
April 10, 2024
January 13, 2025
What if I told you that the way a language model “sees” the world is entirely shaped by one invisible yet critical process? A process so fundamental that it can make or break the model’s ability to understand context, predict meaning, and even handle different languages. That process is tokenization—the art of breaking down text into smaller, digestible pieces called tokens. And it’s not just a technical detail; it’s the very lens through which language models perceive the world.
But here’s the twist: tokenization isn’t perfect. It’s a balancing act, full of trade-offs and challenges. For instance, languages like English, which dominate training datasets, are tokenized into fewer, denser tokens, allowing models to process more context efficiently. Meanwhile, less-represented languages like Korean are fragmented into far more tokens, forcing the model to work harder to understand the same amount of text. This imbalance can lead to inefficiencies, biases, and even a loss of meaning.
So, how do we strike the right balance? Enter Byte Pair Encoding (BPE)—a clever algorithm that merges frequent character pairs into subwords, creating a compact yet expressive vocabulary. Starting from raw bytes, BPE builds tokens step by step, like assembling a puzzle, ensuring flexibility while maintaining efficiency. It’s a method that not only compresses text but also shapes how much context a model can attend to, directly influencing its performance.
In this post, we’ll unravel the mysteries of Byte Pair Encoding. We’ll explore how it works, why it’s so effective, and the surprising ways it impacts everything from multilingual understanding to computational efficiency. By the end, you’ll see how this seemingly simple process holds the key to unlocking the true potential of language models. Ready to dive in? Let’s decode the magic of tokenization!
Byte Pair Encoding (BPE) is a subword tokenization algorithm originally developed for data compression, now widely used in NLP to handle unknown or rare words by breaking text into frequent subword units. BPE strikes a balance between character-level and word-level representations, enabling open-vocabulary modeling.
Algorithm Steps: - Initialize Vocabulary: Begin with a base vocabulary containing all 256 possible byte values (0–255), ensuring full coverage of any input text - Treat each word in the training corpus as a sequence of characters - Count Symbol Pairs: Scan the corpus to count all adjacent symbol (character or subword) pairs - Merge Most Frequent Pair: Identify the most frequent pair of symbols - Merge them into a new symbol (e.g., merging “l” and “o” -> “lo”) - Update Vocabulary: Replace all occurrences of the merged pair in the corpus with the new symbol - Repeat steps 2–4 for a predefined number of merge operations (vocabulary size) or until no pairs remain
class BPETokenizer:
"""Byte-pair encoder."""
def __init__(self, vocab_sz: int):
"""
Args:
vocab_sz (int): Vocabulary size.
"""
self.vocab_sz = vocab_sz
self.vocab = {}
self.merges = {}
def train(self, text: Iterable[str]):
"""Train Byte-pair encoder."""
ids = list(text.encode("utf-8"))
for idx in range(256, self.vocab_sz):
stats = self._get_stats(ids)
pair = max(stats, key=stats.get)
self.merges[pair] = idx
ids = self._merge(ids, pair, idx)
self.vocab = self._build_vocab(ids)
def encode(self, text):
"""Encode string to bytes using vocabulary built during training."""
ids = list(text.encode("utf-8"))
# If text is empty or has one character -> it is already encoded from previous step
while len(ids) >= 2:
# stats is used only for getting pairs next to each other
stats = self._get_stats(ids)
# Because we built vocab (and merges) bottom-up, we need to encode
# idx from smallest index because some later pairs depend on pairs
# occured before
# If a pair doesn't exist, it wouldn't participate in the list
pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
if pair not in self.merges:
break # No more pairs to merge
idx = self.merges[pair]
ids = self._merge(ids, pair, idx)
return ids
def decode(self, tokens: Iterable[int]):
"""Decode tokens into string using the vocabulary built during training."""
tokens = b"".join(self.vocab[idx] for idx in tokens)
# It is important to replace tokens that were not seen during training
# with `?`; otherwise, it would fail
return tokens.decode("utf-8", errors="replace")
def _get_stats(self, ids: Iterable[int]):
"""Get pair counts."""
counts = {}
for pair in zip(ids, ids[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
def _merge(self, ids: Iterable[int], pair: Iterable[int], idx: int):
"""Merge pairs that match `pair` with new index `idx`."""
newids = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and pair[0] == ids[i] and pair[1] == ids[i + 1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids
def _build_vocab(self, ids: Iterable[int]):
"""Build vocabulary from 0-255 bytes and merges."""
vocab = {idx: bytes([idx]) for idx in range(256)}
# Here we assume the items returned would be in the same order they were inserted.
# This is Okay Python 3.7+
for (p0, p1), idx in self.merges.items():
# This would be a concatenation of the bytes
vocab[idx] = vocab[p0] + vocab[p1]
return vocab
Vanilla byte-level BPE applies merges directly on raw byte sequences using a greedy frequency-based heuristic. While this reduces the initial vocabulary size to 256, it leads to inefficient vocabulary usage. For example, common words like “play” might appear with different punctuation (e.g., “play:”, “play?”, “play!”) and the BPE algorithm tends to learn separate tokens for each variant. This redundancy consumes vocabulary slots that could be better utilized.
Additionally, standard BPE implementations operating on Unicode code points would require a base vocabulary of over 130,000 symbols to handle all Unicode strings. This is impractically large for most models, which typically use vocabularies in the 32k–64k range.
To address these issues, GPT-2 uses a modified byte-level BPE approach:
It starts with a 256-symbol byte vocabulary.
It prevents merges across character categories (e.g., letters and punctuation), reducing redundancy.
It allows merging with spaces to improve compression without overly fragmenting words.
This approach balances the open-vocabulary benefits of byte-level encoding with the compression and efficiency gains of BPE, while preserving the ability to assign probabilities to any Unicode string without requiring preprocessing or token normalization.