= "I love NLP!"
text list(text)
['I', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']
Imad Dabbura
January 14, 2023
March 7, 2024
Tokenization is the process of breaking down a string into smaller units of information that will be used by the model. This process sometimes involves some preprocessing steps such as converting to lowercase, stemming & lemmatization, etc. There are many tokenizations strategies that each has its own advantages and drawbacks. We will first give a brief introduction of the different steps involved in the tokenization process. Then we’ll consider the two extreme and easy tokenization strategies: character tokenization and word tokenization. Finally, we will discuss subword tokenizations where statistical methods and language heuristics are used to learn the optimal splitting of words such as WordPiece and SentencePiece tokenizers.
Tokenization process involves 4 steps:
[CLS]
to the beginning of the sequence and [SEP]
to separate two sequence. On the other hands, XLM-R adds <s>
to indicate the beginning of a sequence and </s>
to indicate the end of the sequence.There are so many tokenizers that each have their own rules to split raw text into individual tokens. Each tokenization strategy has its own advantages and drawbacks. Depending on the task, some tokenizers maybe better suit your application than others. However, in the case of using pretrained model, you must use the same tokenizer that the pretrained model used during training.
Below are the most common tokenization schemes.
This is the simplest tokenization strategy where we simply break down the text at the character level. Then the characters will be fed to the model. Consider the following example:
From here, it is easy to convert each character into integers that would be fed to the model. This step is called numericalization. We can numericalize the above text by first building the vocabulary, and then convert each character to its corresponding index as follows:
{' ': 1, '!': 2, 'I': 3, 'L': 4, 'N': 5, 'P': 6, 'e': 7, 'l': 8, 'o': 9, 'v': 10}
Now we can simply map each token (character in this case) to its own corresponding index:
The other extreme of word tokenization is to split text into words and then map each word to its corresponding index in the vocabulary. The simplest form would be to split on whitespace (which work well for English but not other languages such as Japanese that don’t have a well-defined idea of a word):
{'I': 0, 'NLP!': 1, 'love': 2}
Most tokenizers would include rules and heuristics that try to separate parts of meaning even when there are no spaces such as “doesn’t” into “does n’t”.
UNK
token, the model has no idea about the words associated with the UNK
token and we may lose some important informationSplit words into smaller parts based on the most frequent sub-strings. Therefore, we want to keep the most frequent words as unique entities but split the rare words into smaller units to allow us to deal with misspellings and complex words. This will help us achieve the best of both wolds: 1) manageable vocabulary size, 2) keep frequent words as their own entities, and 3) deal with complex and misspelling words.
The subword tokenizers are typically learned from pretraining corpus using statistical rules and algorithms. We will cover the most common ones: WordPiece and SentencePiece:
WordPiece tokenizer is used by the DistilBERT model. The vocabulary is first initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:
Let’s illustrate by example using 🤗 transformers library.
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
encoded_text = tokenizer(text)
encoded_text
{'input_ids': [101, 1045, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'i', 'love', 'nl', '##p', '!', '[SEP]']
Let’s explain the output of the DistilBERT tokenizer:
[CLS]
is a special token that is used to indicate the start of a sequence[SEP]
is also a special token to separate multiple sequences##
prefix indicates that the previous string isn’t whitespace
!
has its own tokenWe can reconstruct that encoded text as follows:
SentencePiece implements byte-pair-encoding (BPE) and unigram language modeling. It encodes the raw text as a sequence of Unicode characters. This is very useful in multilingual corpora because many languages, such as Japanese, don’t have whitespace characters. Also, it is agnostic about accents and punctuations. That is why it is commonly used in multilingual model training.
Byte-pair-encoding works as follows:
Let’s again use 🤗 transformers library to tokenize the same text.
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
encoded_text = tokenizer(text)
encoded_text
{'input_ids': [0, 87, 5161, 541, 37352, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
['<s>', '▁I', '▁love', '▁N', 'LP', '!', '</s>']
Let’s explain the output of the XLM-ROBERTA tokenizer:
<s>
is a special token that is used to indicate the start of a sequence</s>
is also a special token to indicate the end of the sequence_
prefix indicates that the previous string is whitespace
!
has its own tokenWe can reconstruct that encoded text as follows:
Through this post, we covered three tokenization strategies along with their advantages and challenges/limitations. We mostly use tokenizers from well-known libraries such as spaCy because it is very hard to get it right ourselves.
NB: When using pretrained models such as DistilBERT, we must use the same tokenizer that the model used during training. Otherwise, what the model assumes token_id = 1 is will be completely different than what the new token_id = 1 represents. It has the same effect as shuffling the vocabulary.