Chain-of-thought finetuning Proposal #3467

MattAlexMiracle · 2023-03-08T19:47:26Z

MattAlexMiracle
Mar 8, 2023
Collaborator

Background and Rational

Recent works on LLM has shown great performance increase in prompted tasks by externalising Tasks into the datastream (see https://arxiv.org/pdf/2205.11916.pdf, https://arxiv.org/pdf/2301.13379.pdf, https://arxiv.org/abs/2201.11903).
This process helps the model to ground answers by first drawing relevant facts into the foreground, and then continueing the answer:
Consider the prompt

"A dozen apples cost €6, what does a single apple cost?"
Model: €0.5

While contemporary models may attempt to answer this question directly, chain-of-thought models will first retrieve relevant information, then answer the task

"A dozen apples cost €6, what does a single apple cost?"
Model: A dozen apples are 12 apples. 6/12 = 0.5. Therefore an apple costs €0.5.

While the externalisation of such chains-of-thought may not always be necessary or appropriate, empirically it does seem to improve the overall performance of LLMs by giving models a temporary write-once-read-many memory.
It is unclear whether existing Chat models such as ChatGPT are trained on such tasks explicitly, but the overal structure of "summarize the task appropriately -> answer task" seems to be ingrained into ChatGPT.

Research Question

Due to RLHF reinforceing existing behaviour it seems logical that the initial state of the model strongly determines downstream behaviour: After all, the reward model may only boost or penalize already existing signals, as nonexistant signals will not get direct feedback.
The hypothesis is now that a chain-of-thought pretrained model will also carry forward the chain-of-thought style answers, which seem desirable both from a performance (through the write-once-read-many memory), and interpretability point-of-view (externalised reasoning can be more easily checked than internalized reasoning).

Research Methodology

Gather appropriate explicit reasoning datasets and finetuning those jointly with assistant feedback.
These datasets should contain correct and explicit reasoning chains.
Once these are done, existing reward models could be used to perform standard RLHF training.
In RLHF training, we may also be able to increase the utilization of explicit reasoning by using common seed-prompts like "Let's think step by step", to encourage the model to produce explicit argument chains.

Possible extensions

In the long run it might also be interesting to have these "additional information" memory compents be delimted and written in an efficient way, e.g.

<scratch>
- a doze = 12
- apples isa fruit
- 6/12 = 0.5
- ...
</scratch>

this would allow for the filtering of these memory components from the user-facing ouput stream, and (should the scratch be sufficiently formatted), could also allow for addition of retrieval based information or explicit computation (e.g. the model might place 6/12 = 0.7 and an explicit validator first runs through <scratch/> to fix math mistakes).

TeYo001 · 2023-03-10T08:12:20Z

TeYo001
Mar 10, 2023

This could be very useful, although I am unsure how you would go about implementing this into the current dataset. This might require sorting the current datasets for how much they show the "chain of thought" which could prove problematic. This might also interfere with the fact you would probably want the initial data it's trained on to be simple.

0 replies

MattAlexMiracle · 2023-03-10T13:58:18Z

MattAlexMiracle
Mar 10, 2023
Collaborator Author

As is, this would not involve any change in the existing datasets: the objective is to include chain-of-thought style datapoints into the SFT stage, not to replace the existing dataset with a chain-of-thought one.
This would mean that we have a heterogeneous distribution at the beginning, but RLHF should merge those passively by encouraging high quality answers.

Mind you: nobody actually is aiming for the CoT stuff to be in our output. If we can do without the outputs are nicer without.
But in practice the models perform significantly better when chain-of-thoughting, so if we initialize the RLHF tuning phase with CoT style outputs, the model is already initialized in a position that is much more amendable to reinforcement learning.

In all this I am assuming that the distribution of the CoT and assistant replies aren't completely disjoint (or at least that the model doesn't treat them as completely disjoint), which I think is a fair assumption (and one we are making anyways when training on lots of different datasets)

0 replies

jerzydziewierz · 2023-04-08T07:57:58Z

jerzydziewierz
Apr 8, 2023

when preparing the dataset, remember also to add non-maths problems,
for example,

using reactivity data from chemistry
using stories of cause-effect in the history of politics and war
using details (and gateways) of medical procedures

0 replies

VarLad · 2023-06-23T12:40:20Z

VarLad
Jun 23, 2023

@MattAlexMiracle I'd like to bring your attention to look into Reasoning via Planning and Tree of Thought as an alternative to Chain of Thought

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chain-of-thought finetuning Proposal #3467

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Chain-of-thought finetuning Proposal #3467

MattAlexMiracle Mar 8, 2023 Collaborator

Background and Rational

Research Question

Research Methodology

Possible extensions

Replies: 4 comments

TeYo001 Mar 10, 2023

MattAlexMiracle Mar 10, 2023 Collaborator Author

jerzydziewierz Apr 8, 2023

VarLad Jun 23, 2023

MattAlexMiracle
Mar 8, 2023
Collaborator

TeYo001
Mar 10, 2023

MattAlexMiracle
Mar 10, 2023
Collaborator Author

jerzydziewierz
Apr 8, 2023

VarLad
Jun 23, 2023