Breaking Text Apart (The Smart Way)

From single characters to advanced subword splits — see how modern tokenizers like WordPiece and SentencePiece prepare language for AI.

NLP
Author

Imad Dabbura

Published

January 14, 2023

Modified

March 7, 2024

Introduction

Tokenization sits at the foundation of every NLP system — and it’s where more bugs, performance failures, and cross-lingual headaches originate than most practitioners expect.

The core problem: neural networks can’t consume raw text. They need numbers. Tokenization is the bridge — converting a string into a sequence of integer IDs that the model can embed and process. But how you make that conversion has enormous downstream consequences: for vocabulary size, sequence length, out-of-vocabulary handling, and multilingual generalization.

There are three fundamental strategies, sitting on a spectrum from fine-grained to coarse:

  • Character tokenization: split at every character — maximum granularity, minimum vocabulary
  • Word tokenization: split at word boundaries — minimum granularity, maximum vocabulary
  • Subword tokenization: split rules learned from corpus statistics — the practical sweet spot used by every modern LLM

We’ll work through each in turn with concrete code, then zoom in on the two subword algorithms that dominate modern NLP: WordPiece (BERT, DistilBERT) and BPE via SentencePiece (XLM-R, LLaMA, GPT-family models).

Tokenization Process

The tokenization pipeline has four stages, each with a distinct job:

  • Normalization: Clean the raw text before any splitting. Common operations include Unicode normalization (collapsing different byte representations of the same character), lowercasing, and accent stripping. Critically, what gets normalized here is permanent — the model never sees the original form.

  • Pretokenization: Split the normalized text into coarse units, typically words or word-like chunks. For English and German, splitting on whitespace and punctuation works well. For languages like Japanese or Chinese — which have no whitespace — language-specific rules or character-level splits are used instead.

  • Tokenizer model: Apply the learned subword splitting algorithm (WordPiece, BPE, Unigram, etc.) to each pretokenized chunk. This is the only trained stage — everything else is rule-based. The vocabulary and merge rules come from the pretraining corpus.

  • Postprocessing: Wrap the token sequence with any model-specific special tokens. BERT prepends [CLS] and inserts [SEP] between sequences. XLM-R uses <s> and </s>. These tokens have specific learned representations and must be consistent between pretraining and fine-tuning.

The Pipeline Is Framework-Agnostic

This four-stage structure underpins Hugging Face tokenizers, SentencePiece, and most production tokenizer implementations. Most unexpected token outputs trace back to either normalization (e.g., surprise lowercasing or accent stripping) or postprocessing (missing or double-added special tokens).

Tokenization Strategies

There are three core tokenization schemes. Before diving in, here’s a preview of the trade-offs that motivate the progression from characters to subwords:

Strategy Vocab size Sequence length OOV handling Multilingual
Character Tiny (~100s) Very long ✅ None ✅ Natural
Word Huge (millions) Short ❌ UNK collapse ⚠️ Poor
Subword Medium (10K–100K) Medium ✅ Decompose ✅ Good

The pattern is clear: characters and words are opposite extremes, each with a disqualifying flaw. Subword tokenization is the engineered middle ground — and why every modern LLM uses it.

Character Tokenization

Character tokenization is the simplest possible approach: split the input string into individual characters and treat each one as a token. No learned vocabulary, no language-specific rules — just list(text). It’s the floor of the granularity spectrum.

text = "I love NLP!"
list(text)
['I', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']

From here, it is easy to convert each character into integers that would be fed to the model. This step is called numericalization. We can numericalize the above text by first building the vocabulary, and then convert each character to its corresponding index as follows:

vocab = {char: idx for idx, char in enumerate(sorted(set(text)))}
print(vocab)
{' ': 0, '!': 1, 'I': 2, 'L': 3, 'N': 4, 'P': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}

Now we can simply map each token (character in this case) to its own corresponding index:

[vocab[char] for char in text]
[2, 0, 7, 8, 9, 6, 0, 4, 3, 5, 1]
Why Character Tokenization Is Appealing
  • No out-of-vocabulary problem: every possible input — misspellings, code, emojis, neologisms — is representable from the same small fixed alphabet
  • Tiny vocabulary: ~100 characters for English. The embedding matrix and output projection stay small, which reduces parameter count and memory
Why Character Tokenization Fails in Practice
  • Sequences become extremely long: “I love NLP!” becomes 11 tokens. A typical 512-word document becomes several thousand characters. For Transformers with quadratic attention cost, this is prohibitively expensive
  • No free linguistic priors: the model has no prior knowledge that l, o, v, e together constitute a meaningful unit. Recovering word-level and phrase-level structure from raw characters requires far more data, compute, and model depth than most tasks justify
  • Context window exhaustion: with fixed-length context windows, very long character sequences mean the model can attend to only a small slice of a document at a time, losing long-range dependencies that often carry the most important signal

Word Tokenization

Word tokenization takes the opposite approach: split on whitespace (and often punctuation) and treat each word as an atomic token. Sequences stay short and tokens carry recognizable meaning — but the vocabulary problem quickly becomes unmanageable at scale.

text.split()
['I', 'love', 'NLP!']
vocab = {char: idx for idx, char in enumerate(sorted(set(text.split())))}
print(vocab)
{'I': 0, 'NLP!': 1, 'love': 2}
[vocab[word] for word in text.split()]
[0, 2, 1]

Most production word tokenizers go beyond whitespace splitting and include language-specific heuristics — for example, separating contractions like “doesn’t” into “does” and “n’t”, or splitting punctuation from adjacent words. These rules improve coverage but don’t solve the fundamental vocabulary size and OOV problems.

Why Word Tokenization Seems Appealing
  • Short sequences: “I love NLP!” is 3 tokens. The model attends to far more context within the same fixed-length window
  • Tokens carry meaning directly: each token maps to a recognizable linguistic unit, giving the model useful priors without learning from scratch
Why Word Tokenization Breaks Down
  • Vocabulary explosion: a large corpus contains millions of distinct word forms — declinations, misspellings, punctuation variants, domain-specific terms. An embedding table with 1M entries at dimension 512 requires ~500M parameters for the embedding layer alone. Truncating to the top-N words forces everything else to [UNK], which destroys information silently — the model has no way to recover what word was there
  • Under-trained embeddings: rare words appear too infrequently to accumulate meaningful gradient signal. They occupy slots in the vocabulary without learning useful representations — wasted capacity
  • Language boundary failures: languages without clear word boundaries (Japanese, Chinese, Thai) have no natural whitespace to split on. Word tokenization either silently fails or requires expensive language-specific preprocessing at training and inference time

Subword Tokenization

Subword tokenization is the engineered middle ground between the two extremes. The core insight: most words in any language are built from a small set of recurring morphemes — prefixes, roots, suffixes. “tokenization”, “tokenizer”, “tokenized” all share the root “token”. Word tokenization throws that structure away by treating each form as an unrelated atomic entry. Character tokenization preserves the raw signal but forces the model to discover linguistic structure from scratch, without any priors.

Subword algorithms exploit this structure directly. They learn a vocabulary of high-frequency subword units from a large pretraining corpus. Common words like “love” stay as single tokens. Rare or novel words get decomposed into familiar pieces: “tokenization” → ["token", "##ization"] in WordPiece, or ["▁token", "ization"] in SentencePiece. The model has seen “token” thousands of times and has a rich representation for it — that representation is now available even when encountering “detokenization” for the first time.

This also handles misspellings and out-of-domain terms gracefully. “GPT-4o” doesn’t need to be in the vocabulary — it gets decomposed into known subwords rather than collapsing to [UNK].

Two algorithms dominate modern NLP: WordPiece (BERT, DistilBERT) and BPE via SentencePiece (XLM-R, LLaMA, GPT-family models). Both learn subword vocabularies from corpus statistics, but they use different objectives and produce different tokenization behavior — differences that matter when debugging cross-lingual failures or unexpected token splits.

WordPiece

WordPiece is the subword algorithm behind BERT and DistilBERT. Like BPE, it starts with a character-level vocabulary and iteratively merges pairs — but the key difference is in how it chooses which pair to merge next.

BPE picks the most frequent pair. WordPiece picks the pair that maximizes the likelihood of the training corpus when merged. Concretely, for a candidate pair \((u, v)\), it evaluates:

\[\text{score}(u, v) = \frac{\text{count}(uv)}{\text{count}(u) \times \text{count}(v)}\]

This is a pointwise mutual information criterion: it rewards pairs that appear together more than their individual frequencies would predict. Merging “##iz” with “##ation” scores high not just because the bigram is frequent, but because seeing “##iz” almost always predicts “##ation” — the merge buys maximum information.

The training process:

  1. Initialize the vocabulary with all characters in the corpus, prepending ## to all characters that don’t start a word
  2. Score every adjacent pair using the PMI formula above
  3. Merge the highest-scoring pair and add it to the vocabulary
  4. Repeat until the vocabulary reaches the target size (BERT uses 30,000)

The ## prefix is the signature of WordPiece. It marks continuation subwords — pieces that are not at the start of a word boundary. So ["nl", "##p"] means: “nl” starts a word, “##p” continues it. Reconstructing the original word means stripping ## and concatenating.

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
encoded_text = tokenizer(text)
encoded_text
{'input_ids': [101, 1045, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
['[CLS]', 'i', 'love', 'nl', '##p', '!', '[SEP]']

Reading the DistilBERT output token by token:

  • [CLS] — a special classification token prepended to every sequence. Its final hidden state is used as the aggregate sequence representation for classification tasks
  • i — “I” was lowercased (DistilBERT uses distilbert-base-**uncased**)
  • love — a common English word; gets its own token
  • nl — the first subword of “NLP”. “NLP” is rare enough in BERT’s training corpus that it was never merged into a single token
  • ##p — continues from “nl”. The ## prefix signals “this piece is not at a word boundary — attach it to the previous token”
  • ! — punctuation gets its own token
  • [SEP] — marks the end of a sequence (or the boundary between two sequences in sentence-pair tasks)
Decoding the ## Prefix

When you see ## in WordPiece output, it means: strip the ## and concatenate directly to the previous token. ["nl", "##p"]"nlp". ["un", "##believ", "##able"]"unbelievable". The ## is how WordPiece encodes which subwords are word-internal vs. word-initial — critical for reconstructing the original string.

tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
)
'[CLS] i love nlp ! [SEP]'

SentencePiece

SentencePiece is a language-agnostic tokenization library that implements both BPE and unigram language model algorithms. Two properties make it the dominant choice for multilingual models.

First: it treats the input as a raw Unicode character stream — no language-specific pretokenization required. It never assumes whitespace marks word boundaries, which means it works equally well on English, Chinese, Japanese, Arabic, and any language mixture. This is why XLM-R, mT5, and LLaMA all use SentencePiece.

Second: it uses (U+2581, lower one-eighth block) to encode the start of a new word. Rather than marking continuation pieces like WordPiece does with ##, SentencePiece marks word-starts. A at the beginning of a token means “there was a space before this character in the original text.” Absence of means “this token is a continuation.”

The BPE algorithm it implements:

  1. Initialize the vocabulary with individual Unicode characters plus an end-of-word marker
  2. Count all adjacent character pairs across the corpus
  3. Merge the most frequent pair into a new subword unit
  4. Repeat until the vocabulary reaches the target size

Unlike WordPiece’s PMI-based selection, BPE uses raw frequency. It’s simpler but produces similar results in practice — both algorithms converge on vocabularies dominated by common morphemes.

BPE vs. Unigram in SentencePiece

SentencePiece supports two algorithms. BPE builds the vocabulary bottom-up by merging. Unigram starts with a large candidate vocabulary and prunes it by removing tokens that minimally reduce the likelihood of the training corpus — a top-down approach. Unigram is used by XLNet and some multilingual models; BPE is more common. Both are interchangeable in the SentencePiece API.

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
encoded_text = tokenizer(text)
encoded_text
{'input_ids': [0, 87, 5161, 541, 37352, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
['<s>', '▁I', '▁love', '▁N', 'LP', '!', '</s>']

Reading the XLM-R output token by token:

  • <s> — sequence start token (XLM-R’s equivalent of [CLS])
  • ▁I — the prefix means “there was a space before this character.” Since “I” starts the sentence (treated as if preceded by whitespace), it gets
  • ▁love — common word, single token; marks it as word-initial
  • ▁N — “NLP” is split; ▁N is the word-initial piece
  • LP — continues from ▁N, no prefix (it’s a word-internal continuation)
  • ! — punctuation token
  • </s> — sequence end token (XLM-R’s equivalent of [SEP])
WordPiece ## vs. SentencePiece — Two Sides of the Same Coin

These two prefixes encode word boundary information in opposite ways:

Tokenizer Marker Meaning
WordPiece (BERT) ##token This piece continues the previous word
SentencePiece (XLM-R, LLaMA) ▁token A space preceded this character — new word starts here

Both fully encode the original whitespace and allow perfect string reconstruction. The difference is convention, not capability. But you need to know which convention a tokenizer uses when writing postprocessing code to detokenize outputs.

tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
)
'<s> I love NLP!</s>'

Conclusion

The three tokenization strategies form a clear hierarchy in practice:

  • Character tokenization is essentially unused in production NLP. Sequence lengths become prohibitively long for Transformer attention, and the model must learn linguistic structure entirely from scratch. It survives in niche applications: character-level language models, certain byte-level models (GPT-2 uses byte-level BPE as a starting point), and as a fallback for extremely small vocabularies.

  • Word tokenization appears in legacy systems and simple bag-of-words pipelines, but fails at scale. Vocabulary explosion, [UNK] collapse, and multilingual brittleness make it unsuitable for anything pretrained on broad corpora.

  • Subword tokenization is the universal standard for pretrained language models. WordPiece and SentencePiece BPE both solve the core trade-offs: bounded vocabulary, graceful OOV handling, multilingual coverage, and sequences short enough for Transformer attention.

Always Use the Tokenizer the Model Was Trained With

When fine-tuning a pretrained model, you must use the exact same tokenizer — not just the same algorithm, but the same vocabulary file. The model’s embedding matrix maps token ID 1045 to a learned vector for the word “i” (in DistilBERT). Swap in a different tokenizer and ID 1045 now refers to something else entirely. The embeddings become noise, the model is unrecoverable, and fine-tuning won’t fix it. This applies to vocabulary size, normalization rules, and special token placements — all of it must match pretraining exactly.

Most practical work doesn’t require building tokenizers from scratch — Hugging Face tokenizers and SentencePiece handle it. What matters operationally is understanding the output: recognizing ## vs markers, knowing which special tokens a model expects and in what order, and catching normalization surprises (casing, accent stripping) before they cause silent failures downstream.

Back to top