Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Thesis: Chain-of-thought prompting facilitates Chain-of-thought reasoning, which is an emergent ability for large scale LLMs. Through Chain-of-thought prompting, we can improve reasoning capabilities significantly, which in turn improves the accuracy on arithmetic, commonsense, and symbolic reasoning tasks.
- Methods:
- Chain-of-thought prompting template
- Contribution:
- Chain-of-thought prompting technique
- CoT allows for easier debugging since the model provide the intermediate steps that it used to arrive at the final answer. This is also helpful for interpretability
- CoT uses more computation to solve intermediate/smaller tasks before getting the final answer
- CoT prompt works across different LLMs
- There is no meaningful difference in the performance of CoT prompting based on number of examples, their order, and the type of examples
- Takeaways: We can improve LLMs performance on reasoning tasks using few-shot CoT prompting w/o the need for training from scratch or fine-tuning. This is true especially for large scale LLMs as chain of thought reasoning is because such models acquire better semantic understanding and reasoning capabilities with scale an emergent ability of LLMs
- Improvements:
- CoT prompting works for sufficiently large LLMs, so it may not work smaller LLMs or have smaller gains
- We have to have task-specific prompt template where examples and intermediate steps (chain of thoughts) must be carefully chosen to realize the performance boost
- There is a decent variance in performance for different CoT prompting examples, which require more human effort to make sure we’re realizing the full potential of this approach. Even though we have variance in performance, CoT prompting performed better than few-shot prompting. So prompt engineering STILL MATTERS
- There is no guarantee that the intermediate steps generated by the model are correct and may also lead to the wrong answer
- We don’t know if the chain of thought the model is following is
reasoning
- Inference cost for large scale LLMs with few-shot CoT prompting
- Notes:
- Chain-of-thought is a series of intermediate steps
- Few-shot prompt doesn’t help on reasoning tasks
- CoT reasoning abilities improve with scale
#nlp #llm #agents
Chain-of-Thought Reasoning without Prompting
- Thesis: Provide unsupervised method to elicit reasoning capabilities of LLMs in the decoding space. Instead of relying on CoT prompting that is task-specific and requires humans to keep iterating and optimizing the prompt to yield the intended results, CoT-decoding uses top-k alternative tokens to get the best CoT path. For each token of the top-k tokens in the decoding step 1, we continue with greedy decoding algorithm.
- Methods:
- Models: PaLM2, Mistral-7B, and Gemma-7B
- Contribution:
- Elicit LLMs inherent reasoning capabilities w/o the need to use prompting by simply changing the decoding process
- Avoid using human priors that are task-specific with prompting to force the LLM how to solve a task through few-shot CoT prompting
- Improve the model’s confidence by traversing top-k alternative paths
- Takeaways: We can utilize reasoning capabilities of LLMs; both pre-trained and instruction-tuned, by operating in the decoding space that doesn’t require any human intervention or extensive resources to tune such models for reasoning intensive tasks
- Improvements:
- CoT-decoding adds computational complexity to the inference
- The paper only branches out of the first token to explore possible paths
- It is harder to use CoT-decoding in the case of open-ended answers
- The gains from CoT-decoding starts to drop as the tasks gets harder. LLMs still struggle with tasks that require \(>= 3\) manipulation steps. One possibility for this behavior is that pre-training data distribution is mostly simple-to-medium difficulty tasks
- Notes:
- LLM reasoning are typically elicited by prompting that comes in various forms:
- CoT with few-shot examples
- Zero-shot with detailed instructions of how to perform the task and showing the steps of how the LLM came up with the answer
- Training of instruction fine-tuning with CoT data
- LLMs struggle with reasoning when greedy decoding is used
- The difference between the top two tokens at the \(t^{th}\) decoding step for the \(k^{th}\) decoding path is very high, which means the confidence of the LLM in the answer is very high
- The answer’s confidence is the average of the difference of the top two tokens probability for each each time step
- The model’s own probability is not a reliable indicator for the confidence/correctness of the answer
- LLMs typically rush directly into problem solving when asked math/commonsense reasoning questions, which results in the wrong answer most of the times. This can be partially fixed with prompting but don’t yield results as good as CoT-decoding
- Branching at the first-token leads to more diverse paths as opposed to branching at later stages of the decoding process
- It is recommended to aggregate the top-k paths to get stable results
- There is a very high correlation between the answer confidence and the existence of CoT paths in the answer
- Model’s inherent reasoning varies according to the task difficulty. Models show less reasoning abilities with harder tasks
- top-k alternative tokens decoding reveals the existence of CoT reasoning paths which correlates with the model’s confidence in the final answer
- Both pre-trained models and Instruction-tuned models showed Improvements in accuracy from CoT-decoding
- Even thought instruction-tuned models have used a lot of CoT examples during instruction fine-tuning, these models still try to directly jump to problem solving. Therefore, CoT-decoding helped boost performance tremendously
- CoT-decoding closes the gap between pre-trained models and Instruction-tuned models in terms of reasoning capabilities
k
has significant effect on pre-trained models but has negligible effect for instruction-tuned models afterk = 2
. For pre-trained models, the best paths may start at a lowerk
, but for instruction-tuned models they are already trained to bring the best paths to the top- Combining CoT prompting with CoT-decoding yields even better performance from eliciting more reasoning capabilities
- LLM reasoning are typically elicited by prompting that comes in various forms:
#nlp #llm #agents