RewardTrainer Producing Same Reward Score During and After Training #2265

deepakpandita57 · 2024-10-24T01:24:15Z

deepakpandita57
Oct 24, 2024

I'm currently working on a project using the TRL library for training a reward model with the RewardTrainer following the example from the trl repo. However, I've encountered an issue where the model consistently produces the same reward score during training and after evaluation. Here are some relevant details about my setup:

Model and Setup:

Model Name: meta-llama/Meta-Llama-3-8B-Instruct
Training Epochs: 3
Batch Size: 8 (for both training and evaluation)
Learning Rate: 1e-5
Max Length: 1024
Dataset: CarperAI/openai_summarize_comparisons
Initially I started with a sample of 1,000 training examples and gradually increased to 20,000, validation on 100 examples just to test

Observations:

The reward scores are identical for the training and validation data points, indicating no learning is occurring.
The metrics during training do not show any improvement over epochs.
I’ve monitored the training and validation loss, but they remain stagnant throughout the process. Sometimes with different settings with/without PEFT, the training loss does change and the trainer.evaluate() shows different logits for chosen and rejected text indicating that the model is learning. However, it outputs the same score for chosen vs rejected text and the score is sometimes different for different data points.

Questions:

What could be the possible reasons for the model not learning effectively?
Are there specific hyperparameters or configurations in the TRL library that I should consider adjusting?
Could there be any issues with how I’m implementing the reward model or the training process?

I understand you might need further details to point out the specific issue, I'm willing to provide the necessary details.

Any insights or suggestions would be greatly appreciated!
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RewardTrainer Producing Same Reward Score During and After Training #2265

{{title}}

Replies: 0 comments

Select a reply

RewardTrainer Producing Same Reward Score During and After Training #2265

deepakpandita57 Oct 24, 2024

Replies: 0 comments

deepakpandita57
Oct 24, 2024