-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewardbench.py results are different for different batch size for beaver-7b #137
Comments
This matters because sometimes when we're ranking two (prompt, response) pairs, the output difference is large enough that the ordinal preference changes. Thank you. |
Hey @andrewsiah can you confirm this works for other models? We want to make sure this is isolated to this model. When it comes to the beaver models, the only code I added is actually https://github.com/allenai/reward-bench/blob/e59cf242c316f18f73d77568653f56e99255658e/rewardbench/models/beaver.py#L482C1-L504C35 The rest is copied directly from the safe RLHF repo. https://github.com/PKU-Alignment/safe-rlhf |
To be clear, the beaver models aren't designed to be used at inference like this, so if it's isolated to this model it is interesting but not surprising to me. They're designed to be used for a training signal, so maybe there is some uncertainty built in at inference. |
Documenting my results:
May be an issue with accelerate.prepare() |
Note this is a different model. I'm guessing this is related to padding, where different reward models handle pad tokens differently. |
This is for Running
The results are:
|
This is for Batch_size = [1,2,3] Results are different as well. |
I think it might not be due to accelerate. Cause I wrapped your pipeline in our code without using accelerate (I wrote a custom multiprocess pipeline), and the reward difference is still there. |
Unless model_pipeline from transformers uses accelerate intrinsically? Am unsure. |
I think it's padding. I'm testing with
The changes (pushing to #138 )
Which makes sense, as we already needed this to make models score correctly.
Results with change:
Results without change:
Seems like that's not it, but trying on one of the models you just shared. |
Yeah padding changes didn't help |
I really think it is padding related but hard to spin up a minimal example. When reward models are trained, they're getting tokens in a very different manner than we are handing them now. So, we need to make sure that inference is static under the addition of pad tokens (which wouldn't be that surprising if they're not). |
Trying the simple thing, |
Better but not completely there? |
Intuitively, if the models are trained in different ways, Padding = False, Truncation = Left, is the most intuitive, but there is still variance. |
Note: This is both for models with chat template in the tokenizer and fast chat template, so that's not the cause (most likely) |
@ValentinaPy is checking how this impacts benchmark scores. |
Relevant to padding: AutoModelForSequenceClassification takes the score of the first padding index. Left padding shouldn't work then. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1376 Here are some examples of one reward model, no padding, with different batches.
The last number is particularly damning, different numbers for the same input. |
Could this be a dropout issues? You can try
|
I did set model.eval(), which would disable dropout? Granted I didn't test it on the side. |
Update: We've learned that Deberta models do not work with batch sizes >1. We've confirmed that my new pipeline works deterministically for pythia RMs. E.g. when passing in the same prompt 5 times.
And on Deberta
|
also for run_rm.py (but with deberta ahahah) |
@andrewsiah the difference is definitely through something with batching. I set up things like this with different configurations, and all of them have slightly different results in batch >1 then alone.
No configuration of padding/truncation I've used makes them identical. Will try a couple more models. Some are pretty close, e.g. with the beaver reward model with padding to max length
|
thanks for working on this publicly! I suspect it might be one of the configs we passed into model_kwargs, or when initializing pipeline_builder. One possible debug idea is using same prompts across different batch_size (ie what you did), then printing out the tokens (after tokenizer) that are passed in to the model. (by editing the pipeline codebase in our local lib using a print somewhere). This can then isolate as to whether it's before model.forward (tokenizer) or after model.forward (model setup/config). |
@andrewsiah I don't think it's the config because it's on multiple types of models and I'm hardcoding (and have checked within the pipeline a bit), but I'm looking at the tokenizer now yes. |
Tokenizer is promising that there's an issue.
2 is the padding token (also EOS token index). All sequences should have one of those at the end of the sequence to then predict reward (this is with 1 is the BOS token, which every sequence should have. 28723 is ., which makes sense. Examples 3 and 4 seem truncated weirdly to 512 samples. Checking. The examples are:
|
Ahh yea, in our pipeline we also increased max_token_length to a higher number, cause some of the (prompt,response) pairs sum up to more than 512 tokens. |
@andrewsiah with |
ahh, and the truncation will be passed on to the reward model? |
the truncation only ever applies to the tokenizer for things like this. Models will error if given tokens of incorrect size. |
Lol bad new, Changing padding of a single input with correct attention mask changes outputs. |
Wouldn't setting |
For many of these models, RE Truncations, maybe, we should try it @andrewsiah. Much more deterministic |
Also @andrewsiah the RLHFFLOW RM you sent cannot be applied in this case. It has 10 classes and outputs 10 logits, which isn't suited to the RB usecase. Not sure how averaging was happening, but that one seems out of scope. |
Potentially related: huggingface/transformers#2401, huggingface/transformers#25921, huggingface/transformers#31267
I tested FP32 and FP16 with this model, https://huggingface.co/weqweasdas/RM-Mistral-7B, which has the correct configuration, and the logit differences are pretty minor (~1%). This is not great, but with the level of investigation it seems like this is a fundamental implementation problem with most reward models and not specific to reward bench. Truncation and tokenization aside, there may be minor issues there, but I think they're not the root that we are looking at. I would put truncation in a separate issue than "numerical weirdness." |
My debugging code:
|
Thanks for sharing the above, is your finding saying changing fp32 and fp16 is interrelated with the batchsize issue? Or that the output score changes but it isn't related? |
@andrewsiah I think it's all related to the underlying handling of compute + weird positional embeddings. Better RMs seem to have less variance. |
Ahh, thanks for that. |
Seems like this is expected behavior. |
Hello, I have a question here when I training a reward model. In the code from transformers: The code takes the first pad_token(or eos_token, if they are the same). However, when we set tokenizer.padding_side="left". The tokenizer will pad all sentences with pad_token from the left. I guess the padded input_ids maybe as follows:
it will find the position of the first PAD_TOKEN. However, when we use batch_size=2 and left_padding in reward model training, the sequence_lengths is wrong, therefore the position is wrong. Then the training fails? In the ppov2 trainer from trl, we set the tokenizer with left padding. https://github.com/huggingface/trl/blob/b68ff96f0c74368961e194081e122959cd1f4d4d/examples/scripts/ppo/ppo_tldr.py#L57 |
I got it. If the first token is PAD, then it will get 0. Then the position will be -1. Brilliant code! |
|
Thank you for the great work on rewardbench, as it's been super helpful in evaluating/researching reward models.
I've been wrapping your rewardbench.py code to run the reward models published on the leaderboard.
I noticed however that my reward scores are different when my batch sizes are different. EG here the only difference is batch_size= 1,2,3
Eg: Running
rewardbench --model=PKU-Alignment/beaver-7b-v1.0-cost --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=1
rewardbench --model=PKU-Alignment/beaver-7b-v1.0-cost --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=2
rewardbench --model=PKU-Alignment/beaver-7b-v1.0-cost --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=3
results in these different scores:
Please help, I've been trying to debug and I think it's to do with the modelpipeline in itself? Because I tracked to make sure the texts that goes in are the same, but when batch sizes are different, the outputscores are different.
Does it have to do with padding or truncation?
I made sure the max_length are the same.
@natolambert
The text was updated successfully, but these errors were encountered: