[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704

kashif · 2023-08-29T13:34:44Z

https://arxiv.org/pdf/2304.06767.pdf RAFT
https://arxiv.org/pdf/2308.08998.pdf ReST

Please let me know if anyone wants to pair up to complete this.

Some suggestions from RAFT author:

The implementations of REST and RAFT should basically follow the same line so they can be implemented in a unified manner. For the implementation details:

One distinct nature of rejection-sampling-based algorithm is that it's an off-policy algorithm, so the data generation (inference), reward computation, finetuning can be implemented in a separate manner. The main advantage is that we only need to load ONE model at a time, although this requires loading and saving models sequentially. As TRL already has an SFT trainer, I think one option may be
1. implement an inference script;
2. and implementing a reward computation script.
Then, REST/RAFT = alternate among inference, reward computation, and SFT by a main script.
The main bottleneck of both time and performance is in the inference stage. Batch inference can largely accelerate this process. Also, I think we may provide some interface for users to change the inference configuration like temperature, top K, top P, and also some special generation modes.
It would be a good idea to record the mean reward of the filtered set (best-of-K set), and also the filtered dataset. In practice, we can easily find some reward hacking by monitoring these learning objectives.

HuggingFaceDocBuilderDev · 2023-08-29T17:39:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

metric-space · 2023-08-31T08:57:51Z

@kashif I can't pass up an opportunity to collaborate. I'm interested. FWIW started to read the paper. Any chance we could talk about the timeline of getting this done?

kashif · 2023-08-31T08:59:25Z

very cool @metric-space there is not deadline and i like to work in a relaxed pace so yes... lets do it! I am currently working on getting the algorithm hashed out in a jupyter notebook...

metric-space · 2023-08-31T09:16:17Z

@kashif re:deadline I'm glad there's no deadline and I too like to work at a relaxed place. I'll continue to read the paper and I'll try to do the same i.e. getting the algorithm hashed out in a jupyter notebook (for my amusement) . If you have any idea as to what I can do in the meanwhile to assist you and/or to make this a more fruitful collaboration, please let me know

kashif · 2023-08-31T09:19:41Z

awesome @metric-space I added you as a collaborator to my fork, and so yes ideally it would be nice to have some small reward and base model together with the appropriate dataset in order to start... I have an initial facebook/opt-350m base and reward model done... but something smaller might be even more helpful...

metric-space · 2023-09-01T08:51:54Z

@kashif Nothing to report from my end, I'm being very slow at the moment. Hoping to cover satisfactory grounds by the end of the weekend

gaetanlop · 2023-09-03T02:06:27Z

Hello @kashif @metric-space, I have worked on something similar for a project. Would love to collaborate on this too :)

1485840691-eng · 2023-09-03T09:02:07Z

Hello @kashif @metric-space @gaetanlop, I would love to collaborate on this too :). From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL. The mode to generate samples in Grow step, the reward model to generate reward scores could leverage the existing models in TRL. Please correct me if I am wrong. I would like to know which stage have you developed now and what extra could I contribute. Thanks

kashif · 2023-09-03T10:03:12Z

@1485840691-eng @gaetanlop so yes at the moment i am thinking we do not need a dedicated trainer and was cooking up a training script but haven't got far... I have added you to my fork as collaborators.

gaetanlop · 2023-09-03T19:52:50Z

I think adding an IterativeTrainer with a step and a generate function instead of using the HF or the SFT Trainer could be useful here (only the seq2seqtrainer has a way to generate from the model). It could be used for ReST, Rejection Sampling (#576 ) & for other works that alternate between generation and training such as GKD (https://arxiv.org/abs/2306.13649). @kashif @younesbelkada what do you think?

lvwerra · 2023-09-04T10:19:17Z

I'd be in favour of having a iterative version of the SFTTrainer which has a step function and can be fed a batch which can be reused in for rejection sampling or ResT. those approaches could then even be an example script rather than a dedicated trainer which would be a bit easier to maintain.

gaetanlop · 2023-09-05T00:02:22Z

I have started to work on an IterativeTrainer in #737

metric-space · 2023-09-05T05:25:54Z

I wanted to confirm something. I know @kashif was working on the script but given the progression of the PR, do we have something that shows the algorithm is working? Maybe I missed something in the commits

Perhaps this by @gaetanlop d08fa3e ?

kashif · 2023-09-05T05:48:03Z

@metric-space i havent tested the script from @gaetanlop so perhaps that is my next todo to try it out

metric-space · 2023-09-05T06:00:10Z

So I took a quick look, I think the ~~eval~~ improve step is dependent on the work taken up in #737 so as of now it is incomplete. That said, looks like things are moving pretty fast so most likely will be done soon by @gaetanlop . I guess it's either waiting on that work or throwing everything that exists in said script with an ad-hoc training step into your committed notebook and checking running metrics.

Edit: I meant improve

metric-space · 2023-09-05T18:08:02Z

Almost forgot but @1485840691-eng , could you explain this a bit more. It's likely I missed this in the paper:

From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL.

gaetanlop · 2023-09-06T02:15:26Z

@metric-space @kashif the script I have added is a work in progress. It misses the "improve step" that would require #737, so until we introduce the iterative trainer in trl, I don't think the rest script can be finished. I still did not finalized the script for the IterativeTrainer and I am unsure if what I have done in #737 is similar to what was expected for this trainer.

gaetanlop · 2023-09-06T02:20:30Z

Almost forgot but @1485840691-eng , could you explain this a bit more. It's likely I missed this in the paper:

From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL.

I think @1485840691-eng meant that we can use behavioral cloning (ie NLL training) instead of some offline RL losses like GOLD or BVMPO to make the example of how to use ReST. In the paper they show that the highest reward was achieved using BC.

1485840691-eng · 2023-09-10T09:05:28Z

@metric-space @gaetanlop @kashif Yes, I mean using simple BC with NLL loss suffices to achieve comparable performance according to paper. I have some thoughts brought up for discussion: 1) instead of creating a step() function in trainer, could we create a new brand new Trainer instance at each improve step, get the new combined dataset, and train the LM model? 2) The original paper filters the combined dataset with (original + generated samples) with reward score modeling, why not only filter the generated samples based on reward model scores and keep the original training dataset intact. I summarize my suggestions in this text diff based on example/rest.py: https://www.diffchecker.com/RUD0YB6C/ (left is original rest.py, right is updatd rest.py).

WeiXiongUST · 2023-09-21T19:28:04Z

Keeping the previous samples would be a good idea. I noticed that llama2 authors have conducted rejection sampling for the first several iterations and observed the following issue (see page 14 around llama2 paper):

In earlier versions of our model, up to RLHF V3, our approach was to confine answer selection solely to the “bag” of samples gathered from the preceding iteration. For example, RLHF V3 was trained using only samples from RLHF V2. However, despite continuous improvement, this method led to a regression in some capabilities. For example, RLHF V3 struggled more than previous versions to compose rhyming lines in poems, as discerned through qualitative analysis, suggesting that further investigation into the causes of and mitigations for forgetting (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) could be a fruitful area for additional future research.

In response, on subsequent iterations, we modified our strategy, incorporating top-performing samples from all prior iterations, such as those used in RLHF-V1 and RLHF-V2. Although we do not present specific figures, this adjustment demonstrated considerable enhancements in performance and effectively addressed the previously noted issues. This mitigation can be seen as analogous to Synnaeve et al. (2019) and Vinyals et al. (2019) in the RL literature.

In the implementation, we may also provide an option so that the user can combine the samples at current iteration with

the samples from 1st or 2nd iterations;
a fixed dataset provided by the user.

What do you think of it?

freQuensy23-coder · 2023-10-25T15:28:11Z

Dear collaborators,

I am writing to propose an optimization for the training process of our reinforcement learning agent. Currently, we generate a full reward dataset and then filter out a percentage of good responses to comprise the training set. However, this is inefficient because we recalculate forward passes during loss computation that could be avoided.

Instead, I suggest that we save computation graphs for each reward dataset example while generating the dataset. Then, we can prune the graph to only include operations needed for the successful examples that will be used for training. This should improve efficiency by preventing redundant loss calculation on unused examples.

Please let me know if you would like to discuss this further. I believe this change could meaningfully improve training time. Thank you for considering this suggestion.

Best regards,
Alexey Mametyev
t.me/freQuensy23

github-actions · 2023-11-19T15:05:24Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

lvwerra · 2023-11-27T15:25:58Z

Let's keep it open :)

initial skeleton

cb6c2dd

kashif marked this pull request as draft August 29, 2023 13:34

fixed typo

7f425de

Merge branch 'main' into rest

e7875cd

initial sft and reward model

0ca59b1

gaetanlop added 2 commits September 3, 2023 21:36

initial rest example

fe36a4a

fix typo for quality

12284eb

lvwerra mentioned this pull request Sep 4, 2023

Adds a Rejection Sampling Trainer and sentiment tuning example #576

Closed

rest training script example

d08fa3e

gaetanlop mentioned this pull request Sep 4, 2023

Introducing the Iterative Trainer #737

Merged

kashif changed the title ~~[WIP] Reinforced Self-Training (ReST)~~ [WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) Sep 16, 2023

gaetanlop added 4 commits September 16, 2023 14:30

refactoring

0ec84a3

refactoring rest main script

2246c6a

fixed typos

5f312f3

handle multiple grow steps

b8830df

gaetanlop and others added 12 commits October 1, 2023 21:14

raft & rest scripts

512dff6

fixed formatting

bff42cc

initial raft tests

6b0e1f1

rename the docs

bd8f0f8

add to README index

b63405a

remove example notebook

49b8a78

removed unused file

827c07a

undo readme

8aa0ea0

new line

ac5c4a1

Merge remote-tracking branch 'upstream/main' into rest

f95f972

pre-commit fix

12feafc

typo

67fa77d

gaetanlop mentioned this pull request Oct 22, 2023

RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

Closed

8 tasks

github-actions bot closed this Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704

[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704

kashif commented Aug 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 29, 2023

metric-space commented Aug 31, 2023

kashif commented Aug 31, 2023

metric-space commented Aug 31, 2023

kashif commented Aug 31, 2023

metric-space commented Sep 1, 2023

gaetanlop commented Sep 3, 2023 •

edited

Loading

1485840691-eng commented Sep 3, 2023

kashif commented Sep 3, 2023

gaetanlop commented Sep 3, 2023 •

edited

Loading

lvwerra commented Sep 4, 2023

gaetanlop commented Sep 5, 2023

metric-space commented Sep 5, 2023 •

edited

Loading

kashif commented Sep 5, 2023

metric-space commented Sep 5, 2023 •

edited

Loading

metric-space commented Sep 5, 2023

gaetanlop commented Sep 6, 2023

gaetanlop commented Sep 6, 2023

1485840691-eng commented Sep 10, 2023

WeiXiongUST commented Sep 21, 2023

freQuensy23-coder commented Oct 25, 2023

github-actions bot commented Nov 19, 2023

lvwerra commented Nov 27, 2023

[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704

[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704

Conversation

kashif commented Aug 29, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Aug 29, 2023

metric-space commented Aug 31, 2023

kashif commented Aug 31, 2023

metric-space commented Aug 31, 2023

kashif commented Aug 31, 2023

metric-space commented Sep 1, 2023

gaetanlop commented Sep 3, 2023 • edited Loading

1485840691-eng commented Sep 3, 2023

kashif commented Sep 3, 2023

gaetanlop commented Sep 3, 2023 • edited Loading

lvwerra commented Sep 4, 2023

gaetanlop commented Sep 5, 2023

metric-space commented Sep 5, 2023 • edited Loading

kashif commented Sep 5, 2023

metric-space commented Sep 5, 2023 • edited Loading

metric-space commented Sep 5, 2023

gaetanlop commented Sep 6, 2023

gaetanlop commented Sep 6, 2023

1485840691-eng commented Sep 10, 2023

WeiXiongUST commented Sep 21, 2023

freQuensy23-coder commented Oct 25, 2023

github-actions bot commented Nov 19, 2023

lvwerra commented Nov 27, 2023

kashif commented Aug 29, 2023 •

edited

Loading

gaetanlop commented Sep 3, 2023 •

edited

Loading

gaetanlop commented Sep 3, 2023 •

edited

Loading

metric-space commented Sep 5, 2023 •

edited

Loading

metric-space commented Sep 5, 2023 •

edited

Loading