Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
- Thesis: Provide open-sources models of different sizes and checkpoints as well as training code to study how LLMs capabilities evolve during training such as biases and correlation of term frequencies task accuracy.
- Methods:
- Models:
- 8 models from 70M to 12B parameters with almost identical training setup
- Checkpoint of each model is saved after ~2B tokens
- BPE tokenizer
- Datasets:
- The Pile (300B tokens)
- Near de-duplicated version of the Pile (207B tokens)
- Models:
- Contribution: Released and make available the following:
- 154 checkpoints for each model
- Training/model code
- 8 model sizes that range from 70M to 12B
- Each model is trained twice:
- One on the Pile dataset (300B tokens)
- One on the near de-duplicated version of the Pile (207B tokens)
- Importance: Open sourced models, checkpoints, and training code to help researchers further study the learning dynamics of LLMs as training progress.
- Takeaways:
- Models trained on near de-duplicated data didn’t outperform models trained on the full dataset
- Changing corpus statistics 7% and 21% before end of training by replacing masculine pronouns with feminine pronouns led to reduce bias w/o hardly affecting the model’s perplexity. This may be due to the fact that LLM captures tokens’ correlations and distributions and such a change makes the model more robust. The effect is bigger for larger models as they tend to be more biased
- Placing sequences in the beginning of segments didn’t show any evidence of memorization
- Term frequencies related to a task affect the few-shot accuracy for the models. The more frequent the term the more accurate the model
- Small models don’t show better performance with more term frequencies, which means this is an emergent ability for large models
#nlp #llm