text = "I love NLP!"
list(text)['I', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']
From single characters to advanced subword splits — see how modern tokenizers like WordPiece and SentencePiece prepare language for AI.
Imad Dabbura
January 14, 2023
March 7, 2024
Tokenization sits at the foundation of every NLP system — and it’s where more bugs, performance failures, and cross-lingual headaches originate than most practitioners expect.
The core problem: neural networks can’t consume raw text. They need numbers. Tokenization is the bridge — converting a string into a sequence of integer IDs that the model can embed and process. But how you make that conversion has enormous downstream consequences: for vocabulary size, sequence length, out-of-vocabulary handling, and multilingual generalization.
There are three fundamental strategies, sitting on a spectrum from fine-grained to coarse:
We’ll work through each in turn with concrete code, then zoom in on the two subword algorithms that dominate modern NLP: WordPiece (BERT, DistilBERT) and BPE via SentencePiece (XLM-R, LLaMA, GPT-family models).
The tokenization pipeline has four stages, each with a distinct job:
Normalization: Clean the raw text before any splitting. Common operations include Unicode normalization (collapsing different byte representations of the same character), lowercasing, and accent stripping. Critically, what gets normalized here is permanent — the model never sees the original form.
Pretokenization: Split the normalized text into coarse units, typically words or word-like chunks. For English and German, splitting on whitespace and punctuation works well. For languages like Japanese or Chinese — which have no whitespace — language-specific rules or character-level splits are used instead.
Tokenizer model: Apply the learned subword splitting algorithm (WordPiece, BPE, Unigram, etc.) to each pretokenized chunk. This is the only trained stage — everything else is rule-based. The vocabulary and merge rules come from the pretraining corpus.
Postprocessing: Wrap the token sequence with any model-specific special tokens. BERT prepends [CLS] and inserts [SEP] between sequences. XLM-R uses <s> and </s>. These tokens have specific learned representations and must be consistent between pretraining and fine-tuning.
This four-stage structure underpins Hugging Face tokenizers, SentencePiece, and most production tokenizer implementations. Most unexpected token outputs trace back to either normalization (e.g., surprise lowercasing or accent stripping) or postprocessing (missing or double-added special tokens).
There are three core tokenization schemes. Before diving in, here’s a preview of the trade-offs that motivate the progression from characters to subwords:
| Strategy | Vocab size | Sequence length | OOV handling | Multilingual |
|---|---|---|---|---|
| Character | Tiny (~100s) | Very long | ✅ None | ✅ Natural |
| Word | Huge (millions) | Short | ❌ UNK collapse | ⚠️ Poor |
| Subword | Medium (10K–100K) | Medium | ✅ Decompose | ✅ Good |
The pattern is clear: characters and words are opposite extremes, each with a disqualifying flaw. Subword tokenization is the engineered middle ground — and why every modern LLM uses it.
Character tokenization is the simplest possible approach: split the input string into individual characters and treat each one as a token. No learned vocabulary, no language-specific rules — just list(text). It’s the floor of the granularity spectrum.
From here, it is easy to convert each character into integers that would be fed to the model. This step is called numericalization. We can numericalize the above text by first building the vocabulary, and then convert each character to its corresponding index as follows:
{' ': 0, '!': 1, 'I': 2, 'L': 3, 'N': 4, 'P': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}
Now we can simply map each token (character in this case) to its own corresponding index:
l, o, v, e together constitute a meaningful unit. Recovering word-level and phrase-level structure from raw characters requires far more data, compute, and model depth than most tasks justifyWord tokenization takes the opposite approach: split on whitespace (and often punctuation) and treat each word as an atomic token. Sequences stay short and tokens carry recognizable meaning — but the vocabulary problem quickly becomes unmanageable at scale.
{'I': 0, 'NLP!': 1, 'love': 2}
Most production word tokenizers go beyond whitespace splitting and include language-specific heuristics — for example, separating contractions like “doesn’t” into “does” and “n’t”, or splitting punctuation from adjacent words. These rules improve coverage but don’t solve the fundamental vocabulary size and OOV problems.
[UNK], which destroys information silently — the model has no way to recover what word was thereSubword tokenization is the engineered middle ground between the two extremes. The core insight: most words in any language are built from a small set of recurring morphemes — prefixes, roots, suffixes. “tokenization”, “tokenizer”, “tokenized” all share the root “token”. Word tokenization throws that structure away by treating each form as an unrelated atomic entry. Character tokenization preserves the raw signal but forces the model to discover linguistic structure from scratch, without any priors.
Subword algorithms exploit this structure directly. They learn a vocabulary of high-frequency subword units from a large pretraining corpus. Common words like “love” stay as single tokens. Rare or novel words get decomposed into familiar pieces: “tokenization” → ["token", "##ization"] in WordPiece, or ["▁token", "ization"] in SentencePiece. The model has seen “token” thousands of times and has a rich representation for it — that representation is now available even when encountering “detokenization” for the first time.
This also handles misspellings and out-of-domain terms gracefully. “GPT-4o” doesn’t need to be in the vocabulary — it gets decomposed into known subwords rather than collapsing to [UNK].
Two algorithms dominate modern NLP: WordPiece (BERT, DistilBERT) and BPE via SentencePiece (XLM-R, LLaMA, GPT-family models). Both learn subword vocabularies from corpus statistics, but they use different objectives and produce different tokenization behavior — differences that matter when debugging cross-lingual failures or unexpected token splits.
WordPiece is the subword algorithm behind BERT and DistilBERT. Like BPE, it starts with a character-level vocabulary and iteratively merges pairs — but the key difference is in how it chooses which pair to merge next.
BPE picks the most frequent pair. WordPiece picks the pair that maximizes the likelihood of the training corpus when merged. Concretely, for a candidate pair \((u, v)\), it evaluates:
\[\text{score}(u, v) = \frac{\text{count}(uv)}{\text{count}(u) \times \text{count}(v)}\]
This is a pointwise mutual information criterion: it rewards pairs that appear together more than their individual frequencies would predict. Merging “##iz” with “##ation” scores high not just because the bigram is frequent, but because seeing “##iz” almost always predicts “##ation” — the merge buys maximum information.
The training process:
## to all characters that don’t start a wordThe ## prefix is the signature of WordPiece. It marks continuation subwords — pieces that are not at the start of a word boundary. So ["nl", "##p"] means: “nl” starts a word, “##p” continues it. Reconstructing the original word means stripping ## and concatenating.
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
encoded_text = tokenizer(text)
encoded_text{'input_ids': [101, 1045, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'i', 'love', 'nl', '##p', '!', '[SEP]']
Reading the DistilBERT output token by token:
[CLS] — a special classification token prepended to every sequence. Its final hidden state is used as the aggregate sequence representation for classification tasksi — “I” was lowercased (DistilBERT uses distilbert-base-**uncased**)love — a common English word; gets its own tokennl — the first subword of “NLP”. “NLP” is rare enough in BERT’s training corpus that it was never merged into a single token##p — continues from “nl”. The ## prefix signals “this piece is not at a word boundary — attach it to the previous token”! — punctuation gets its own token[SEP] — marks the end of a sequence (or the boundary between two sequences in sentence-pair tasks)## Prefix
When you see ## in WordPiece output, it means: strip the ## and concatenate directly to the previous token. ["nl", "##p"] → "nlp". ["un", "##believ", "##able"] → "unbelievable". The ## is how WordPiece encodes which subwords are word-internal vs. word-initial — critical for reconstructing the original string.
SentencePiece is a language-agnostic tokenization library that implements both BPE and unigram language model algorithms. Two properties make it the dominant choice for multilingual models.
First: it treats the input as a raw Unicode character stream — no language-specific pretokenization required. It never assumes whitespace marks word boundaries, which means it works equally well on English, Chinese, Japanese, Arabic, and any language mixture. This is why XLM-R, mT5, and LLaMA all use SentencePiece.
Second: it uses ▁ (U+2581, lower one-eighth block) to encode the start of a new word. Rather than marking continuation pieces like WordPiece does with ##, SentencePiece marks word-starts. A ▁ at the beginning of a token means “there was a space before this character in the original text.” Absence of ▁ means “this token is a continuation.”
The BPE algorithm it implements:
Unlike WordPiece’s PMI-based selection, BPE uses raw frequency. It’s simpler but produces similar results in practice — both algorithms converge on vocabularies dominated by common morphemes.
SentencePiece supports two algorithms. BPE builds the vocabulary bottom-up by merging. Unigram starts with a large candidate vocabulary and prunes it by removing tokens that minimally reduce the likelihood of the training corpus — a top-down approach. Unigram is used by XLNet and some multilingual models; BPE is more common. Both are interchangeable in the SentencePiece API.
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
encoded_text = tokenizer(text)
encoded_text{'input_ids': [0, 87, 5161, 541, 37352, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
['<s>', '▁I', '▁love', '▁N', 'LP', '!', '</s>']
Reading the XLM-R output token by token:
<s> — sequence start token (XLM-R’s equivalent of [CLS])▁I — the ▁ prefix means “there was a space before this character.” Since “I” starts the sentence (treated as if preceded by whitespace), it gets ▁▁love — common word, single token; ▁ marks it as word-initial▁N — “NLP” is split; ▁N is the word-initial pieceLP — continues from ▁N, no ▁ prefix (it’s a word-internal continuation)! — punctuation token</s> — sequence end token (XLM-R’s equivalent of [SEP])## vs. SentencePiece ▁ — Two Sides of the Same Coin
These two prefixes encode word boundary information in opposite ways:
| Tokenizer | Marker | Meaning |
|---|---|---|
| WordPiece (BERT) | ##token |
This piece continues the previous word |
| SentencePiece (XLM-R, LLaMA) | ▁token |
A space preceded this character — new word starts here |
Both fully encode the original whitespace and allow perfect string reconstruction. The difference is convention, not capability. But you need to know which convention a tokenizer uses when writing postprocessing code to detokenize outputs.
The three tokenization strategies form a clear hierarchy in practice:
Character tokenization is essentially unused in production NLP. Sequence lengths become prohibitively long for Transformer attention, and the model must learn linguistic structure entirely from scratch. It survives in niche applications: character-level language models, certain byte-level models (GPT-2 uses byte-level BPE as a starting point), and as a fallback for extremely small vocabularies.
Word tokenization appears in legacy systems and simple bag-of-words pipelines, but fails at scale. Vocabulary explosion, [UNK] collapse, and multilingual brittleness make it unsuitable for anything pretrained on broad corpora.
Subword tokenization is the universal standard for pretrained language models. WordPiece and SentencePiece BPE both solve the core trade-offs: bounded vocabulary, graceful OOV handling, multilingual coverage, and sequences short enough for Transformer attention.
When fine-tuning a pretrained model, you must use the exact same tokenizer — not just the same algorithm, but the same vocabulary file. The model’s embedding matrix maps token ID 1045 to a learned vector for the word “i” (in DistilBERT). Swap in a different tokenizer and ID 1045 now refers to something else entirely. The embeddings become noise, the model is unrecoverable, and fine-tuning won’t fix it. This applies to vocabulary size, normalization rules, and special token placements — all of it must match pretraining exactly.
Most practical work doesn’t require building tokenizers from scratch — Hugging Face tokenizers and SentencePiece handle it. What matters operationally is understanding the output: recognizing ## vs ▁ markers, knowing which special tokens a model expects and in what order, and catching normalization surprises (casing, accent stripping) before they cause silent failures downstream.