Code Llama: Open Foundation Models for Code
- Thesis: SOTA code-generation/completion models among open LLMs with infilling capabilities and large context windows. The code models are all based on Llama2 pre-trained models.
- Methods:
- Models:
- All models were initialized from pre-trained Llama2 models [[llama2]]
- All models were fine-tuned with infilling code objective on 500B tokens from code-heavy dataset except the 34B model, which was trained on 1T tokens
- Code-Llama is a foundational model for general code generation tasks
- Fine-tuned on 16k context sequences with infilling objective
- Code-Llama Python for python language
- Fine-tuned on Python code with 100B tokens w/o infilling objective
- Then fine-tuned on 16k context sequence with 20B tokens
- Code-Llama Instruct for zero-shot instruction and safer responses
- Fine-tuned on 16k context sequence with 20B tokens
- Then instruction fine-tuning using human-generated data and self-instruct dataset generated by Llama2 model prompted to generate code problems and Code Llama2 to generate unit tests and solutions (5B tokens)
- The model proved to be more helpful at a moderate cost in generation performance
- Sizes: 7B, 13B, 34B, 70B
- Code completion/generation models should have much larger contexts at the scope of full repo which is much larger than the 4096 tokens for Llama base model. Therefore, add fine-tuning step with much larger contexts up to 100k tokens. This is done by change RoPE parameters for positional encodings
- BPE tokenizer
- Datasets:
- Publicly available dataset
- 8% of the training dataset is sourced from natural language dataset that discusses questions around code
- Small proportion of batches are from natural language datasets to retain natural language understanding
- For Code-Llama Instruct, instruction datasets from [[llama2]] is used. Additionally, synthetic dataset was used to create programming questions, solution, and unit tests. For questions, Llama2 70B was used but Llama2 7B was used to generate solutions and unit tests.
- Infilling objective: Move parts of the training sequence to the end and the reordered sequence is predicted autoregressively. In other words, we move span of the tokens from the sequence to the end and the model’s context would be the token before the span and after and would predict the span itself [[fim]]
- FIM rate is 90%
- 50% of transformations are PSM while the other half are SPM
- Splitting the documents happens at the character level where spans are selected randomly from uniform distribution
- Models:
- Contribution:
- 4 Different sizes of open models that range from 7B to 70B parameters
- Python-specific code models
- Models that are more aligned with user instructions through Code-Llama2 Intsruct
- Takeaways:
- Model specialization improves performance. Code-Llama Python > Code-Llama > Llama2 for python code generation
- Perplexity decreases as context length increases up to 100K
- Model size adds performance gains
- Infilling models suffer slight drop in performance on downstream tasks
- Starting from base model that was trained on both code and language and then fine-tuned on code datasets outperforms a model that is trained from scratch on code only but is much more useful in real-world applications
#nlp #llm