-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@kashif I can't pass up an opportunity to collaborate. I'm interested. FWIW started to read the paper. Any chance we could talk about the timeline of getting this done? |
very cool @metric-space there is not deadline and i like to work in a relaxed pace so yes... lets do it! I am currently working on getting the algorithm hashed out in a jupyter notebook... |
@kashif re:deadline I'm glad there's no deadline and I too like to work at a relaxed place. I'll continue to read the paper and I'll try to do the same i.e. getting the algorithm hashed out in a jupyter notebook (for my amusement) . If you have any idea as to what I can do in the meanwhile to assist you and/or to make this a more fruitful collaboration, please let me know |
awesome @metric-space I added you as a collaborator to my fork, and so yes ideally it would be nice to have some small reward and base model together with the appropriate dataset in order to start... I have an initial facebook/opt-350m base and reward model done... but something smaller might be even more helpful... |
@kashif Nothing to report from my end, I'm being very slow at the moment. Hoping to cover satisfactory grounds by the end of the weekend |
Hello @kashif @metric-space, I have worked on something similar for a project. Would love to collaborate on this too :) |
Hello @kashif @metric-space @gaetanlop, I would love to collaborate on this too :). From my understanding of the paper, we could target to pull out interleaving Grow and Improve steps with simple BC losses, no need to introduce extra offline RL. The mode to generate samples in Grow step, the reward model to generate reward scores could leverage the existing models in TRL. Please correct me if I am wrong. I would like to know which stage have you developed now and what extra could I contribute. Thanks |
@1485840691-eng @gaetanlop so yes at the moment i am thinking we do not need a dedicated trainer and was cooking up a training script but haven't got far... I have added you to my fork as collaborators. |
I think adding an IterativeTrainer with a step and a generate function instead of using the HF or the SFT Trainer could be useful here (only the seq2seqtrainer has a way to generate from the model). It could be used for ReST, Rejection Sampling (#576 ) & for other works that alternate between generation and training such as GKD (https://arxiv.org/abs/2306.13649). @kashif @younesbelkada what do you think? |
I'd be in favour of having a iterative version of the |
I have started to work on an IterativeTrainer in #737 |
I wanted to confirm something. I know @kashif was working on the script but given the progression of the PR, do we have something that shows the algorithm is working? Maybe I missed something in the commits Perhaps this by @gaetanlop d08fa3e ? |
@metric-space i havent tested the script from @gaetanlop so perhaps that is my next todo to try it out |
So I took a quick look, I think the Edit: I meant |
Almost forgot but @1485840691-eng , could you explain this a bit more. It's likely I missed this in the paper:
|
@metric-space @kashif the script I have added is a work in progress. It misses the "improve step" that would require #737, so until we introduce the iterative trainer in trl, I don't think the rest script can be finished. I still did not finalized the script for the IterativeTrainer and I am unsure if what I have done in #737 is similar to what was expected for this trainer. |
I think @1485840691-eng meant that we can use behavioral cloning (ie NLL training) instead of some offline RL losses like GOLD or BVMPO to make the example of how to use ReST. In the paper they show that the highest reward was achieved using BC. |
@metric-space @gaetanlop @kashif Yes, I mean using simple BC with NLL loss suffices to achieve comparable performance according to paper. I have some thoughts brought up for discussion: 1) instead of creating a step() function in trainer, could we create a new brand new Trainer instance at each improve step, get the new combined dataset, and train the LM model? 2) The original paper filters the combined dataset with (original + generated samples) with reward score modeling, why not only filter the generated samples based on reward model scores and keep the original training dataset intact. I summarize my suggestions in this text diff based on example/rest.py: https://www.diffchecker.com/RUD0YB6C/ (left is original rest.py, right is updatd rest.py). |
Keeping the previous samples would be a good idea. I noticed that llama2 authors have conducted rejection sampling for the first several iterations and observed the following issue (see page 14 around llama2 paper): In earlier versions of our model, up to RLHF V3, our approach was to confine answer selection solely to the “bag” of samples gathered from the preceding iteration. For example, RLHF V3 was trained using only samples from RLHF V2. However, despite continuous improvement, this method led to a regression in some capabilities. For example, RLHF V3 struggled more than previous versions to compose rhyming lines in poems, as discerned through qualitative analysis, suggesting that further investigation into the causes of and mitigations for forgetting (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) could be a fruitful area for additional future research. In response, on subsequent iterations, we modified our strategy, incorporating top-performing samples from all prior iterations, such as those used in RLHF-V1 and RLHF-V2. Although we do not present specific figures, this adjustment demonstrated considerable enhancements in performance and effectively addressed the previously noted issues. This mitigation can be seen as analogous to Synnaeve et al. (2019) and Vinyals et al. (2019) in the RL literature. In the implementation, we may also provide an option so that the user can combine the samples at current iteration with
What do you think of it? |
Dear collaborators, I am writing to propose an optimization for the training process of our reinforcement learning agent. Currently, we generate a full reward dataset and then filter out a percentage of good responses to comprise the training set. However, this is inefficient because we recalculate forward passes during loss computation that could be avoided. Instead, I suggest that we save computation graphs for each reward dataset example while generating the dataset. Then, we can prune the graph to only include operations needed for the successful examples that will be used for training. This should improve efficiency by preventing redundant loss calculation on unused examples. Please let me know if you would like to discuss this further. I believe this change could meaningfully improve training time. Thank you for considering this suggestion. Best regards, |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Let's keep it open :) |
https://arxiv.org/pdf/2304.06767.pdf RAFT
https://arxiv.org/pdf/2308.08998.pdf ReST
Please let me know if anyone wants to pair up to complete this.
Some suggestions from RAFT author:
The implementations of REST and RAFT should basically follow the same line so they can be implemented in a unified manner. For the implementation details:
1. implement an inference script;
2. and implementing a reward computation script.
Then, REST/RAFT = alternate among inference, reward computation, and SFT by a main script.