Imad Dabbura - InstructGPT: Training language models to follow instructions with human feedback

InstructGPT: Training language models to follow instructions with human feedback

Thesis: Build a model that follows user’s instructions by being helpful, not harmful, and don’t make up facts (hallucinates) using reinforcement learning from human feedback.
Methods:
- Train LLM in a supervised manner on dataset created by 40 labellers that write demonstrations of prompts submitted to OpenAI API as well as hand-written prompts. Therefore, dataset would have prompts and desired output for the prompts. This will give us supervised LLM baseline
- Train Reward Model (RM) on dataset that labeller creates by ranking model’s outputs on prompts passed to supervised fine-tuned model from above. For each prompt, there would be multiple outputs that are ranked from best to worst
- Use reinforcement learning (RL), proximate policy optimization (PPO), to fine-tune train the supervised LLM baseline to maximize the reward model (RM) using PPO algorithm
- Models are evaluated on prompts from customers that were not present in the training data
- Model sizes: 1.3B, 6B, and 175B
- Data:
  - Split on user ID so that same user can’t be in both training and valid/test datasets.
  - Each user can have at most 200 prompts
  - PPO dataset doesn’t have labels
- Models:
  - Supervised fine-tuning (SFT): Trained on prompt-response pairs.
    - Model is the pre-trained model
    - We don’t compute loss on prompt tokens, only response tokens
    - It has similar language modeling objective as pre-training, but here we’re trying to align response to high quality responses and not just next word prediction as is the case in pre-training
  - Reward Model:
    - It is typically the SFT model but we change the classifier layer into linear layer that 1 feature_out that would be the score
Contribution:
- Provide procedure to align LLM with user instructions using SFT and RLHF
Importance:
- 1.3B InstructGPT is 85% preferred over 175B GPT model and 71% preferred over prompted 175B GPT model by labellers.
- InstructGPT generalize pretty well to labellers that weren’t part of the labellers that generated the training data. It also generalizes to follow instructions for tasks/languages that were rare or not present in the training data
Takeaways: 3-step fine-tuning using human feedback is all you need to align LLMs with user’s instructions and make such models usable and helpful as well as reduce harm and toxicity
Improvements:
- 40 labellers are not representative for the user population that will be affected by the model(s)
- Prompt comparisons are mostly generated by 1 labeller. Maybe we need to have multiple labellers and weight the labeller that belongs to the group affected by the response more than other labellers
- Prompt responses that look toxic/harmful may not be so given the context
- Sometimes it is desirable to generate harmful responses as per user’s instruction. Should this be allowed?
- Model still hallucinates and make up facts as well as generate toxic/biased responses

#nlp #llm #fine-tuning