Introduction

Tokenization is the process of breaking down a string into smaller units of information that will be used by the model. This process sometimes involves some preprocessing steps such as converting to lowercase, stemming & lemmatization, etc. There are many tokenizations strategies that each has its own advantages and drawbacks. We will first give a brief introduction of the different steps involved in the tokenization process. Then we’ll consider the two extreme and easy tokenization strategies: character tokenization and word tokenization. Finally, we will discuss subword tokenizations where statistical methods and language heuristics are used to learn the optimal splitting of words such as WordPiece and SentencePiece tokenizers.

Tokenization Process

Tokenization process involves 4 steps:

Normalizations: This is the first step where a set of operations are applied to a raw text to make it cleaner. Such operations may include Unicode Normalization, lowercasing.
Pretokenization: This step involves splitting the normalized text into words. For languages like English and German, splitting on whitespaces and punctuations work really well. However, some languages such as Japanese and Chinese don’t have whitespace character and we would be better off using language-specific tokenizer.
Tokenizer: This step applies subword splitting model such as WordPiece and SentencePiece. The subword model would be learned/trained on the pretrained corpus.
Postprocessing: This is the last step where some additional tokens will be added to each sequence. For exampe, BERT tokenizer add [CLS] to the beginning of the sequence and [SEP] to separate two sequence. On the other hands, XLM-R adds <s> to indicate the beginning of a sequence and </s> to indicate the end of the sequence.

Tokenization Strategies

There are so many tokenizers that each have their own rules to split raw text into individual tokens. Each tokenization strategy has its own advantages and drawbacks. Depending on the task, some tokenizers maybe better suit your application than others. However, in the case of using pretrained model, you must use the same tokenizer that the pretrained model used during training.

Below are the most common tokenization schemes.

Character Tokenization

This is the simplest tokenization strategy where we simply break down the text at the character level. Then the characters will be fed to the model. Consider the following example:

text = "I love NLP!"
list(text)

['I', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']

From here, it is easy to convert each character into integers that would be fed to the model. This step is called numericalization. We can numericalize the above text by first building the vocabulary, and then convert each character to its corresponding index as follows:

vocab = {char: idx for idx, char in enumerate(sorted(list(text)))}
print(vocab)

{' ': 1, '!': 2, 'I': 3, 'L': 4, 'N': 5, 'P': 6, 'e': 7, 'l': 8, 'o': 9, 'v': 10}

Now we can simply map each token (character in this case) to its own corresponding index:

[vocab[char] for char in text]

[3, 1, 8, 9, 10, 7, 1, 5, 4, 6, 2]

Advantages:
- Helps us avoid misspellings and rare words
- Very small vocabulary. Therefore, embedding and output layer would be small which means less computation
Drawbacks:
- Sequences length will be very long
- Linguistic structures such as words now need to be learned from data. This requires much more data, memory, and computation
- Because we have fixed context length for most LLM and the sequence would be very long, we would lose the ability to attend the important tokens from before

Word Tokenization

The other extreme of word tokenization is to split text into words and then map each word to its corresponding index in the vocabulary. The simplest form would be to split on whitespace (which work well for English but not other languages such as Japanese that don’t have a well-defined idea of a word):

text.split()

['I', 'love', 'NLP!']

vocab = {char: idx for idx, char in enumerate(sorted(text.split()))}
print(vocab)

{'I': 0, 'NLP!': 1, 'love': 2}

[vocab[word] for word in text.split()]

[0, 2, 1]

Most tokenizers would include rules and heuristics that try to separate parts of meaning even when there are no spaces such as “doesn’t” into “does n’t”.

Advantages:
- Sequences length will be short
Drawbacks:
- Size of the vocabulary will explode for large corpus due to the fact that words can include declinations, misspellings, or punctuations. If the vocabulary size has 1m words and the embedding dimension is 512 -> the first embedding layer would be ~ 0.5 billion parameters!
  - We can work around this issue by including top n most frequent words. For example, if we include top 100,000 words -> the first embedding layer would be ~ 0.5 million parameters. However, because all other words will be mapped to the UNK token, the model has no idea about the words associated with the UNK token and we may lose some important information
- Some languages don’t have well-defined idea of what constitute a word
- Because we have so many tokens that are either rare or never happened in the training data, these tokens would either never been activated or maybe fe passes activated them which is not good to get good embedding vector for them. Therefore, they are occupying memory w/o being that useful

Subword Tokenization

Split words into smaller parts based on the most frequent sub-strings. Therefore, we want to keep the most frequent words as unique entities but split the rare words into smaller units to allow us to deal with misspellings and complex words. This will help us achieve the best of both wolds: 1) manageable vocabulary size, 2) keep frequent words as their own entities, and 3) deal with complex and misspelling words.

The subword tokenizers are typically learned from pretraining corpus using statistical rules and algorithms. We will cover the most common ones: WordPiece and SentencePiece:

WordPiece

WordPiece tokenizer is used by the DistilBERT model. The vocabulary is first initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:

Initialize the vocabulary with all the characters in the text.
Build a language model on the training corpus using vocabulary build previously.
Generate a new word by combining two units out of the current vocabulary to increment the word vocabulary by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
Goto 2 until a predefined limit of vocabulary size is reached or the likelihood increase falls below a certain threshold.

Let’s illustrate by example using 🤗 transformers library.

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
encoded_text = tokenizer(text)
encoded_text

{'input_ids': [101, 1045, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])

['[CLS]', 'i', 'love', 'nl', '##p', '!', '[SEP]']

Let’s explain the output of the DistilBERT tokenizer:

[CLS] is a special token that is used to indicate the start of a sequence
[SEP] is also a special token to separate multiple sequences
## prefix indicates that the previous string isn’t whitespace
- This shows that nlp is not common token, so it was split into two tokens
We can also see that ! has its own token

We can reconstruct that encoded text as follows:

tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
)

'[CLS] i love nlp ! [SEP]'

SentencePiece

SentencePiece implements byte-pair-encoding (BPE) and unigram language modeling. It encodes the raw text as a sequence of Unicode characters. This is very useful in multilingual corpora because many languages, such as Japanese, don’t have whitespace characters. Also, it is agnostic about accents and punctuations. That is why it is commonly used in multilingual model training.

Byte-pair-encoding works as follows:

Initialize the vocabulary with all the characters in the text plus end-of-word symbol
Find the most common adjacent characters
Replace instances of the character pair with the new subword
Goto step2 until desired vocab size

Let’s again use 🤗 transformers library to tokenize the same text.

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
encoded_text = tokenizer(text)
encoded_text

{'input_ids': [0, 87, 5161, 541, 37352, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])

['<s>', '▁I', '▁love', '▁N', 'LP', '!', '</s>']

Let’s explain the output of the XLM-ROBERTA tokenizer:

<s> is a special token that is used to indicate the start of a sequence
</s> is also a special token to indicate the end of the sequence
_ prefix indicates that the previous string is whitespace
- This shows that nlp is not common token, so it was split into two tokens
We can also see that ! has its own token

We can reconstruct that encoded text as follows:

tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
)

'<s> I love NLP!</s>'

Conclusion

Through this post, we covered three tokenization strategies along with their advantages and challenges/limitations. We mostly use tokenizers from well-known libraries such as spaCy because it is very hard to get it right ourselves.

NB: When using pretrained models such as DistilBERT, we must use the same tokenizer that the model used during training. Otherwise, what the model assumes token_id = 1 is will be completely different than what the new token_id = 1 represents. It has the same effect as shuffling the vocabulary.