T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Thesis: T5 aims to come up with unified framework for transfer learning using transformer’s Encoder-Decoder architecture and convert any NLP task into text-to-text tasks. As a result, we can use the same model, objective, and decoding for any downstream tasks
- Methods:
- Model:
- Encoder-Decoder transformer that follows original transformer paper (Attention is all you need) with few exceptions such as:
- Use relative position embedding instead of sin/cosine positional embeddings. Each head learns its own positional embedding which is scalar or 32 relative positions that scales logarithmically to reach 128 of offset. Anything beyond 128 would be assigned 128 positional embedding. The positional embeddings are shared among all transformer layers
- Use pre-layer normalization w/o bias term
- The authors didn’t look into Encoder-only transformer such as BERT because they are more meant for token/sequence classification
- The authors didn’t look into Decoder-only transformer such as GPT because they use causal self-attention that only considers previous tokens and don’t have access to future tokens, which is necessary for some tasks such as summarization
- To reduce computational cost, only masked/denoised input’s output is computed
- The authors also found that sharing the parameters between the encoder and decoder performs quite well and reduce the number of parameters by half
- Context window size is 512
- Batch size is 128
- All other transformer hyperparameters are similar to [[bert]]
- The model is trained on much less tokens that BERT
- The model will be also fine-tuned separately on different downstream tasks
- SentencePiece tokenizer
- Denoising objective is used where a span of tokens is masked/corrupted and the model has to predict these tokens
- Randomly selects 15% of tokens in the input sequence
- Replaces all consecutive spans of selected tokens with a single sentinel token
- Gives each sentinel token an ID that is unique to the current input sequence
- Constructs a target using all selected tokens, separated by the sentinel tokens
- Add the sentinel and its ID to the vocabulary learned by the tokenizer
- Span corruption is used with an average span length of 3
- The authors tested multiple denoising strategies and all performed similarly. So they opted for the one that reduces the target sequence length as it is computationally more efficient, which is replace \(15\%\) tokens strategy
- Different corruption rates performed similarly except for very large rates, for example \(50\%\)
- Encoder-Decoder transformer that follows original transformer paper (Attention is all you need) with few exceptions such as:
- Data:
- Input and output of the model is just text. To differentiate between different tasks, each task has its own prefix that is prepended to the input text before is fed to the model. The choice of prefixes has some impact on the model’s performance. Below are some examples for different prefixes:
- For classification tasks, the predicted token would be the class. The input would be something like:
mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.
for natural language inference. - For translation tasks,
Translate from English to German: That is good.
- For classification tasks, the predicted token would be the class. The input would be something like:
- Input and output of the model is just text. To differentiate between different tasks, each task has its own prefix that is prepended to the input text before is fed to the model. The choice of prefixes has some impact on the model’s performance. Below are some examples for different prefixes:
- Model:
- Contribution:
- The paper evaluated different model’s architecture and objectives to gain insights on what affect pre-training and fine-tuning performance. Different variants of transformer architecture were compared to the Encoder-Decoder baseline of T5:
- Decoder-only
- Prefix LM that uses self-attention for given prefix of the input and masked attention when generating the output
- T5 model that uses unified framework for transfer learning by casting all NLP tasks as text-to-text problems. The unified framework allows us to use the same model, objective, training procedure, and decoding method on any downstream task
- The paper evaluated different model’s architecture and objectives to gain insights on what affect pre-training and fine-tuning performance. Different variants of transformer architecture were compared to the Encoder-Decoder baseline of T5:
- Takeaways: Having a unified framework where all NLP tasks can be casted to take text as input and produce text as output with denoising objective for pre-training and fine-tuning on downstream tasks is all you need!
- Improvements:
- T5 is fine-tuned separately for each downstream task, which might not be feasible or yield good performance for low-resource tasks
- Only first-order effects were studied, but a lot of the hyperparameters and other architecture variations have second-order effects
- Notes:
- T5 stands for Text-to-Text Transfer Transformer
#llm #nlp