Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704

Closed
wants to merge 23 commits into from

Conversation

kashif
Copy link
Collaborator

@kashif kashif commented Aug 29, 2023

https://arxiv.org/pdf/2304.06767.pdf RAFT
https://arxiv.org/pdf/2308.08998.pdf ReST

Please let me know if anyone wants to pair up to complete this.

Some suggestions from RAFT author:

The implementations of REST and RAFT should basically follow the same line so they can be implemented in a unified manner. For the implementation details:

  • One distinct nature of rejection-sampling-based algorithm is that it's an off-policy algorithm, so the data generation (inference), reward computation, finetuning can be implemented in a separate manner. The main advantage is that we only need to load ONE model at a time, although this requires loading and saving models sequentially. As TRL already has an SFT trainer, I think one option may be
    1. implement an inference script;
    2. and implementing a reward computation script.
    Then, REST/RAFT = alternate among inference, reward computation, and SFT by a main script.
  • The main bottleneck of both time and performance is in the inference stage. Batch inference can largely accelerate this process. Also, I think we may provide some interface for users to change the inference configuration like temperature, top K, top P, and also some special generation modes.
  • It would be a good idea to record the mean reward of the filtered set (best-of-K set), and also the filtered dataset. In practice, we can easily find some reward hacking by monitoring these learning objectives.

@kashif kashif marked this pull request as draft August 29, 2023 13:34
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@metric-space
Copy link
Contributor

@kashif I can't pass up an opportunity to collaborate. I'm interested. FWIW started to read the paper. Any chance we could talk about the timeline of getting this done?

@kashif
Copy link
Collaborator Author

kashif commented Aug 31, 2023

very cool @metric-space there is not deadline and i like to work in a relaxed pace so yes... lets do it! I am currently working on getting the algorithm hashed out in a jupyter notebook...

@metric-space
Copy link
Contributor

@kashif re:deadline I'm glad there's no deadline and I too like to work at a relaxed place. I'll continue to read the paper and I'll try to do the same i.e. getting the algorithm hashed out in a jupyter notebook (for my amusement) . If you have any idea as to what I can do in the meanwhile to assist you and/or to make this a more fruitful collaboration, please let me know

@kashif
Copy link
Collaborator Author

kashif commented Aug 31, 2023

awesome @metric-space I added you as a collaborator to my fork, and so yes ideally it would be nice to have some small reward and base model together with the appropriate dataset in order to start... I have an initial facebook/opt-350m base and reward model done... but something smaller might be even more helpful...

@metric-space
Copy link
Contributor

@kashif Nothing to report from my end, I'm being very slow at the moment. Hoping to cover satisfactory grounds by the end of the weekend

@gaetanlop
Copy link
Contributor

gaetanlop commented Sep 3, 2023

Hello @kashif @metric-space, I have worked on something similar for a project. Would love to collaborate on this too :)

@1485840691-eng
Copy link
Contributor

Hello @kashif @metric-space @gaetanlop, I would love to collaborate on this too :). From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL. The mode to generate samples in Grow step, the reward model to generate reward scores could leverage the existing models in TRL. Please correct me if I am wrong. I would like to know which stage have you developed now and what extra could I contribute. Thanks

@kashif
Copy link
Collaborator Author

kashif commented Sep 3, 2023

@1485840691-eng @gaetanlop so yes at the moment i am thinking we do not need a dedicated trainer and was cooking up a training script but haven't got far... I have added you to my fork as collaborators.

@gaetanlop
Copy link
Contributor

gaetanlop commented Sep 3, 2023

I think adding an IterativeTrainer with a step and a generate function instead of using the HF or the SFT Trainer could be useful here (only the seq2seqtrainer has a way to generate from the model). It could be used for ReST, Rejection Sampling (#576 ) & for other works that alternate between generation and training such as GKD (https://arxiv.org/abs/2306.13649). @kashif @younesbelkada what do you think?

@lvwerra
Copy link
Member

lvwerra commented Sep 4, 2023

I'd be in favour of having a iterative version of the SFTTrainer which has a step function and can be fed a batch which can be reused in for rejection sampling or ResT. those approaches could then even be an example script rather than a dedicated trainer which would be a bit easier to maintain.

@gaetanlop
Copy link
Contributor

I have started to work on an IterativeTrainer in #737

@metric-space
Copy link
Contributor

metric-space commented Sep 5, 2023

I wanted to confirm something. I know @kashif was working on the script but given the progression of the PR, do we have something that shows the algorithm is working? Maybe I missed something in the commits

Perhaps this by @gaetanlop d08fa3e ?

@kashif
Copy link
Collaborator Author

kashif commented Sep 5, 2023

@metric-space i havent tested the script from @gaetanlop so perhaps that is my next todo to try it out

@metric-space
Copy link
Contributor

metric-space commented Sep 5, 2023

So I took a quick look, I think the eval improve step is dependent on the work taken up in #737 so as of now it is incomplete. That said, looks like things are moving pretty fast so most likely will be done soon by @gaetanlop . I guess it's either waiting on that work or throwing everything that exists in said script with an ad-hoc training step into your committed notebook and checking running metrics.

Edit: I meant improve

@metric-space
Copy link
Contributor

Almost forgot but @1485840691-eng , could you explain this a bit more. It's likely I missed this in the paper:

From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL.

@gaetanlop
Copy link
Contributor

@metric-space @kashif the script I have added is a work in progress. It misses the "improve step" that would require #737, so until we introduce the iterative trainer in trl, I don't think the rest script can be finished. I still did not finalized the script for the IterativeTrainer and I am unsure if what I have done in #737 is similar to what was expected for this trainer.

@gaetanlop
Copy link
Contributor

Almost forgot but @1485840691-eng , could you explain this a bit more. It's likely I missed this in the paper:

From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL.

I think @1485840691-eng meant that we can use behavioral cloning (ie NLL training) instead of some offline RL losses like GOLD or BVMPO to make the example of how to use ReST. In the paper they show that the highest reward was achieved using BC.

@1485840691-eng
Copy link
Contributor

@metric-space @gaetanlop @kashif Yes, I mean using simple BC with NLL loss suffices to achieve comparable performance according to paper. I have some thoughts brought up for discussion: 1) instead of creating a step() function in trainer, could we create a new brand new Trainer instance at each improve step, get the new combined dataset, and train the LM model? 2) The original paper filters the combined dataset with (original + generated samples) with reward score modeling, why not only filter the generated samples based on reward model scores and keep the original training dataset intact. I summarize my suggestions in this text diff based on example/rest.py: https://www.diffchecker.com/RUD0YB6C/ (left is original rest.py, right is updatd rest.py).

@kashif kashif changed the title [WIP] Reinforced Self-Training (ReST) [WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) Sep 16, 2023
@WeiXiongUST
Copy link

Keeping the previous samples would be a good idea. I noticed that llama2 authors have conducted rejection sampling for the first several iterations and observed the following issue (see page 14 around llama2 paper):


In earlier versions of our model, up to RLHF V3, our approach was to confine answer selection solely to the “bag” of samples gathered from the preceding iteration. For example, RLHF V3 was trained using only samples from RLHF V2. However, despite continuous improvement, this method led to a regression in some capabilities. For example, RLHF V3 struggled more than previous versions to compose rhyming lines in poems, as discerned through qualitative analysis, suggesting that further investigation into the causes of and mitigations for forgetting (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) could be a fruitful area for additional future research.

In response, on subsequent iterations, we modified our strategy, incorporating top-performing samples from all prior iterations, such as those used in RLHF-V1 and RLHF-V2. Although we do not present specific figures, this adjustment demonstrated considerable enhancements in performance and effectively addressed the previously noted issues. This mitigation can be seen as analogous to Synnaeve et al. (2019) and Vinyals et al. (2019) in the RL literature.


In the implementation, we may also provide an option so that the user can combine the samples at current iteration with

  • the samples from 1st or 2nd iterations;
  • a fixed dataset provided by the user.

What do you think of it?

@freQuensy23-coder
Copy link

Dear collaborators,

I am writing to propose an optimization for the training process of our reinforcement learning agent. Currently, we generate a full reward dataset and then filter out a percentage of good responses to comprise the training set. However, this is inefficient because we recalculate forward passes during loss computation that could be avoided.

Instead, I suggest that we save computation graphs for each reward dataset example while generating the dataset. Then, we can prune the graph to only include operations needed for the successful examples that will be used for training. This should improve efficiency by preventing redundant loss calculation on unused examples.

Please let me know if you would like to discuss this further. I believe this change could meaningfully improve training time. Thank you for considering this suggestion.

Best regards,
Alexey Mametyev
t.me/freQuensy23

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this Nov 27, 2023
@lvwerra
Copy link
Member

lvwerra commented Nov 27, 2023

Let's keep it open :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants