Chain-of-thought finetuning Proposal #3467
Replies: 4 comments
-
This could be very useful, although I am unsure how you would go about implementing this into the current dataset. This might require sorting the current datasets for how much they show the "chain of thought" which could prove problematic. This might also interfere with the fact you would probably want the initial data it's trained on to be simple. |
Beta Was this translation helpful? Give feedback.
-
As is, this would not involve any change in the existing datasets: the objective is to include chain-of-thought style datapoints into the SFT stage, not to replace the existing dataset with a chain-of-thought one. Mind you: nobody actually is aiming for the CoT stuff to be in our output. If we can do without the outputs are nicer without. In all this I am assuming that the distribution of the CoT and assistant replies aren't completely disjoint (or at least that the model doesn't treat them as completely disjoint), which I think is a fair assumption (and one we are making anyways when training on lots of different datasets) |
Beta Was this translation helpful? Give feedback.
-
when preparing the dataset, remember also to add non-maths problems,
|
Beta Was this translation helpful? Give feedback.
-
@MattAlexMiracle I'd like to bring your attention to look into Reasoning via Planning and Tree of Thought as an alternative to Chain of Thought |
Beta Was this translation helpful? Give feedback.
-
Background and Rational
Recent works on LLM has shown great performance increase in prompted tasks by externalising Tasks into the datastream (see https://arxiv.org/pdf/2205.11916.pdf, https://arxiv.org/pdf/2301.13379.pdf, https://arxiv.org/abs/2201.11903).
This process helps the model to ground answers by first drawing relevant facts into the foreground, and then continueing the answer:
Consider the prompt
While contemporary models may attempt to answer this question directly, chain-of-thought models will first retrieve relevant information, then answer the task
While the externalisation of such chains-of-thought may not always be necessary or appropriate, empirically it does seem to improve the overall performance of LLMs by giving models a temporary write-once-read-many memory.
It is unclear whether existing Chat models such as ChatGPT are trained on such tasks explicitly, but the overal structure of "summarize the task appropriately -> answer task" seems to be ingrained into ChatGPT.
Research Question
Due to RLHF reinforceing existing behaviour it seems logical that the initial state of the model strongly determines downstream behaviour: After all, the reward model may only boost or penalize already existing signals, as nonexistant signals will not get direct feedback.
The hypothesis is now that a chain-of-thought pretrained model will also carry forward the chain-of-thought style answers, which seem desirable both from a performance (through the write-once-read-many memory), and interpretability point-of-view (externalised reasoning can be more easily checked than internalized reasoning).
Research Methodology
Gather appropriate explicit reasoning datasets and finetuning those jointly with assistant feedback.
These datasets should contain correct and explicit reasoning chains.
Once these are done, existing reward models could be used to perform standard RLHF training.
In RLHF training, we may also be able to increase the utilization of explicit reasoning by using common seed-prompts like "Let's think step by step", to encourage the model to produce explicit argument chains.
Possible extensions
In the long run it might also be interesting to have these "additional information" memory compents be delimted and written in an efficient way, e.g.
this would allow for the filtering of these memory components from the user-facing ouput stream, and (should the scratch be sufficiently formatted), could also allow for addition of retrieval based information or explicit computation (e.g. the model might place
6/12 = 0.7
and an explicit validator first runs through<scratch/>
to fix math mistakes).Beta Was this translation helpful? Give feedback.
All reactions