-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch size affecting output. #2401
Comments
It is possible to get slightly different results. Could you share more details on which evaluation script are you running and for which model/configuration etc? |
I'm getting having the same issue. But with XLM-R: I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:
Script Output:
For BERT the results don't change that much... But for XLM-R the results are shockingly different! Am I missing something? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
unstale |
I think I'm getting a similar issue. I'm using DistilBERT in this case, but depending on the batch size, I see different outputs. The differences are slight, but confusing nonetheless. It seems like the difference happens once the batch size goes beyond 3. All batch sizes beyond 3 are identical, but <=3 and >3 are diffierent. My example:
Outputs
Anyone have thoughts here? This causes some confusion when I run an individual sample through the model, as it's not the same as if I run it with 3 other samples. |
Also experienced same issue using BertForPreTraining. This doesn't make sense to me --- there's no component in Bert which depends on the batch size. I mean things like BatchNorm in training mode output different results with changed batch sizes. But no such component is in Bert AFAIK. Anything I missed? |
I'm also facing with this issue. BERT returns different embeddings if I change the batch size. This happens only in the train() mode. Did any one figure out the reason? |
same problem over here, any thoughts about it ? |
I'm having the same issue with BERT. Slightly differnt outputs, while only changing the batch size. It's driving me crazy, cause I don't understand where's the mistake |
Not working on BERT, but I see this phenomenon also on a transformer I am working on. |
Deleted, there is bug 😂 |
Having the same issue with T5 model. |
I'm seeing similar issues on a fine-tuned distilbert-base-uncased model, sometimes the norm of the difference of tensors can go up to 0.2 which seems huge to me (for Semantic Search applications it means hundreds of items would move around in the ranking depending on the size of the batch used for computing the embeddings). |
Having the same issue. Any update? |
Met same issue. At file transformers/models/roberta/modeling_roberta.py under function RobertaEncoder, If I call The results are different from the below version: I wonder if this is a bug in the huggingface? The only difference between the two versions for me is I change the input batch size. |
having the same issue with bart model |
Hi! @osanseviero this is the bug I mentioned to you at Khipu. I can reproduce the behavior using @bpben's code with transformers 4.27.1 and torch 2.0.0 on a RTX 3090 GPU. At least for me, it results in consistent generations for models such as Flan-T5 XL, albeit I haven't been able to get it to happen with a minimal enough example. Nevertheless, the issue made by @infinitylogesh mentioning this one shows that more people are struggling with it. Let me know if I should open a new issue for this. |
The issue still exists, any solutions @gsarti ? |
I don't fully understand it yet, but it is not a huggingface issue. It seems like the matrix multiplication of PyTorch (used inside the linear layers) already returns different results for batches in combination with transpose on a Ryzen 5 2500U and on Colab: import torch
x = torch.randn(3,4)
y = torch.randn(5,4).t()
torch.set_printoptions(precision=10)
# batch size 1
print(x[0].matmul(y))
# batch size 4 but only returning the first row
print(x.matmul(y)[0])
# element-wise comparison batch-size 1 with first row of the result
print(x[0].matmul(y) == x.matmul(y)[0]) Output:
All comparisons with batch sizes >1 return identical results on a Ryzen 5 2500U but not on Colab: # comparing batch size 2 with the first two rows of the result
print(x[0:2].matmul(y) == x.matmul(y)[0:2]) Output:
Maybe I have made a mistake because I only get different results when I also use transpose on the Ryzen 5 2500U (results still differ on Colab): import torch
x = torch.randn(3,4)
y = torch.randn(4,5)
torch.set_printoptions(precision=10)
# batch size 1
print(x[0].matmul(y))
# batch size 4 but only returning the first row
print(x.matmul(y)[0])
# element-wise comparison batch-size 1 with first row of the result
print(x[0].matmul(y) == x.matmul(y)[0]) Output:
Can someone please check my logic? I don't understand how the transpose can have such an effect. I am afraid that I have made a mistake. Otherwise, it looks like the matrix multiplication implementation of the hardware is the root cause for the differences we get. This paper, even if it didn't pass the review, also seems to point in that direction. It investigated this issue for cuBLAS (cuda matrix multiplication). |
I believe I may also be expericing this issue. Changing the contents of a batch, even if the size remains static, will change the resulting embedding for the same text. |
Also got a difference when used encode , but no difference if use apply method on series. import pandas as pd s=[str(i**2) for i in range(10)] embed1=encoder(df["num"]) embed2=(df["num"]).apply(encoder) Method 1 difference: 3.997695879885389e-07 |
I have the same issue, are there any updates? |
Same issue here with LLAMA, though the difference is tiny. The logits keep changing when setting different batch sizes. |
@dts1fl @a514514772 Very tiny numerical differences can be expected when varying batch sizes. Differences on the order of 1e-7 can easily be seen just using a different GPU or even slight differences in the running environment. For context, when we port models into the library, a difference of 1e-5 between the original and ported model is considered acceptable. Differences of this order shouldn't manifest in very large differences in output e.g. generated sequences |
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m", trust_remote_code=True)
input1 = [
[2000, 3000, 4000],
]
input2 = [
[2000, 3000, 4000],
[5000, 6000, 7000],
]
input3 = [
[2000, 3000, 4000],
[5000, 6000, 7000],
[8000, 9000, 10000],
]
model.eval()
output1 = model(torch.tensor(input1)).logits
output2 = model(torch.tensor(input2)).logits
output3 = model(torch.tensor(input3)).logits
print(torch.max(torch.abs(output1[0] - output2[0]))) # tensor(0.0005, grad_fn=<MaxBackward1>)
print(torch.max(torch.abs(output2[0] - output3[0]))) # tensor(0., grad_fn=<MaxBackward1>) I personally get differences of up to 5e-4 on logits. I tend to agree that in practice this shouldn't be a problem (and yet, has anyone quantified it?), but still, it's very surprising and deserves to be investigated. |
Even more surprising, using Cuda, I can get a difference greater than 1e-3 (between batch size =2 and 3 this time): output1 = model(torch.tensor(input1, device="cuda")).logits
output2 = model(torch.tensor(input2, device="cuda")).logits
output3 = model(torch.tensor(input3, device="cuda")).logits
print(torch.max(torch.abs(output1[0] - output2[0]))) # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
print(torch.max(torch.abs(output2[0] - output3[0]))) # tensor(0.0015, device='cuda:0', grad_fn=<MaxBackward1>) GPU: NVIDIA H100 80GB |
@qgallouedec Thanks for looking into this and sharing these results! Are these order of differences also observed for non-remote models? |
By "non-remote", do you mean: config = AutoConfig.from_pretrained("EleutherAI/pythia-160m", trust_remote_code=True)
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True) ? Is so the diff seems smaller: print(torch.max(torch.abs(output1[0] - output2[0]))) # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
print(torch.max(torch.abs(output2[0] - output3[0]))) # tensor(2.7418e-06, device='cuda:0', grad_fn=<MaxBackward1>) I've double-checked with other inputs. |
@qgallouedec Sorry, I wasn't clear, by In this case, for this checkpoint, I can see it uses transformers. The reason I ask is that there might be a lot of reasons for differences, but if the architecture definition is on the hub, then there isn't really much we can do apart from raise a discussion on the checkpoint pages itself.
Great! |
My bad, misleading use of So yes, I can see the same diff (around 1e6) with other model like Based on #2401 (comment), this PyTorch discussion and this other PyTorch discussion, we can conclude that every difference under 1e6 is the "expected numerical noise". However, this noise can accumulate and we can end up with potentially annoying noise, as in #2401 (comment). I'm not sure there's anything to do but keep that in mind. |
❓ Questions & Help
When running evaluation, why am i getting slightly different output when running a batch size of 1 compared to batch size greater than 1?
The text was updated successfully, but these errors were encountered: