Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch size affecting output. #2401

Closed
eriher opened this issue Jan 4, 2020 · 30 comments
Closed

Batch size affecting output. #2401

eriher opened this issue Jan 4, 2020 · 30 comments
Labels

Comments

@eriher
Copy link

eriher commented Jan 4, 2020

❓ Questions & Help

When running evaluation, why am i getting slightly different output when running a batch size of 1 compared to batch size greater than 1?

@eriher eriher changed the title Batch size affecting embeddings. Batch size affecting output. Jan 4, 2020
@NaxAlpha
Copy link

NaxAlpha commented Jan 5, 2020

It is possible to get slightly different results. Could you share more details on which evaluation script are you running and for which model/configuration etc?

@ricardorei
Copy link

ricardorei commented Jan 9, 2020

I'm getting having the same issue. But with XLM-R:

I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:

import torch
from torchnlp.encoders.text import stack_and_pad_tensors
from torchnlp.utils import lengths_to_mask
from transformers import (BertModel, BertTokenizer, XLMRobertaModel,
                          XLMRobertaTokenizer)

torch.set_printoptions(precision=6)

def batch_encoder(samples, tokenizer):
    batch = []
    for sequence in samples:
        batch.append(torch.tensor(tokenizer.encode(sequence)))
    return stack_and_pad_tensors(batch, tokenizer.pad_token_id)

xlm = XLMRobertaModel.from_pretrained(
            'xlm-roberta-base', output_hidden_states=True
        )

bert = BertModel.from_pretrained(
            'bert-base-multilingual-cased', output_hidden_states=True
        )


xlm.eval()
bert.eval()
with torch.no_grad():
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    xlm_tokenizer  = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

    samples = ["hello world!", "This is a batch and the first sentence will be padded"]

    bert_tokens, bert_lengths = batch_encoder(samples, bert_tokenizer)
    bert_attention_mask = lengths_to_mask(bert_lengths)

    xlm_tokens, xlm_lengths = batch_encoder(samples, bert_tokenizer)
    xlm_attention_mask = lengths_to_mask(xlm_lengths)

     # Forward
    bert_out = bert(input_ids=bert_tokens, attention_mask=bert_attention_mask)
    xlm_out = xlm(input_ids=xlm_tokens, attention_mask=xlm_attention_mask)
    bert_last_hidden_states, bert_pooler_output, bert_all_layers = bert_out
    xlm_last_hidden_states, xlm_pooler_output, xlm_all_layers = xlm_out

    # Testing by comparing pooler_out
    bert_first_sample_tokens = torch.tensor(bert_tokenizer.encode(samples[0])).unsqueeze(0)
    xlm_first_sample_tokens = torch.tensor(xlm_tokenizer.encode(samples[0])).unsqueeze(0)
    bert_out = bert(input_ids=bert_first_sample_tokens)
    xlm_out = xlm(input_ids=xlm_first_sample_tokens)
    _, bert_pooler_output_1 , _ = bert_out
    _, xlm_pooler_output_1 , _ = xlm_out

    print (bert_pooler_output_1[0][:5])
    print (bert_pooler_output[0][:5])
    print ()
    #assert torch.equal(bert_pooler_output_1[0], bert_pooler_output[0])

    print (xlm_pooler_output_1[0][:5])
    print (xlm_pooler_output[0][:5])

    #assert torch.equal(xlm_pooler_output_1[0], xlm_pooler_output[0])```

Script Output:

tensor([ 0.264619,  0.191050,  0.120784, -0.024288, -0.186887])
tensor([ 0.264619,  0.191049,  0.120784, -0.024288, -0.186887])

tensor([-0.114997, -0.025624, -0.171540,  0.725383,  0.318024])
tensor([-0.042580,  0.237069,  0.136827,  0.484221,  0.019779])

For BERT the results don't change that much... But for XLM-R the results are shockingly different!

Am I missing something?

@stale
Copy link

stale bot commented Mar 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 12, 2020
@stale stale bot closed this as completed Mar 19, 2020
@AdityaSoni19031997
Copy link
Contributor

unstale

@bpben
Copy link

bpben commented Jun 2, 2020

I think I'm getting a similar issue. I'm using DistilBERT in this case, but depending on the batch size, I see different outputs. The differences are slight, but confusing nonetheless. It seems like the difference happens once the batch size goes beyond 3. All batch sizes beyond 3 are identical, but <=3 and >3 are diffierent. My example:

from transformers import DistilBertModel, DistilBertTokenizer
MODEL_NAME = 'distilbert-base-uncased'
distil_model = DistilBertModel.from_pretrained(MODEL_NAME)
distil_tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

distil_model.eval()
torch.set_printoptions(precision=6)
samples = ["hello world!", 
           "goodbye world!",
           "hello hello!",
           "And so on and so on.",
           "And so on and so forth."]
cond_output = {}
for cond in [2, 3, 5]:
  tokens = distil_tokenizer.batch_encode_plus(
          samples[:cond],
          pad_to_max_length=True, 
          return_tensors="pt")
  tokens.to(device)
  outputs = distil_model(**tokens)
  \# just taking the first token of the first sample
  cond_output[cond] = outputs[0][:,0][0][:10].cpu().detach().numpy()
print(cond_output)

Outputs

{2: array([-0.18292062, -0.12333887,  0.1573697 , -0.1744302 , -0.25663155,
       -0.20508605,  0.31887087,  0.45650607, -0.21000467, -0.14479966],
      dtype=float32), 3: array([-0.18292062, -0.12333887,  0.1573697 , -0.1744302 , -0.25663155,
       -0.20508605,  0.31887087,  0.45650607, -0.21000467, -0.14479966],
      dtype=float32), 5: array([-0.1829206 , -0.12333884,  0.15736982, -0.1744302 , -0.25663146,
       -0.20508616,  0.318871  ,  0.45650616, -0.21000458, -0.14479981],
      dtype=float32)}

Anyone have thoughts here? This causes some confusion when I run an individual sample through the model, as it's not the same as if I run it with 3 other samples.

@askerlee
Copy link

askerlee commented Oct 23, 2020

I'm getting having the same issue. But with XLM-R:

I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:

import torch
from torchnlp.encoders.text import stack_and_pad_tensors
from torchnlp.utils import lengths_to_mask
from transformers import (BertModel, BertTokenizer, XLMRobertaModel,
                          XLMRobertaTokenizer)

torch.set_printoptions(precision=6)

def batch_encoder(samples, tokenizer):
    batch = []
    for sequence in samples:
        batch.append(torch.tensor(tokenizer.encode(sequence)))
    return stack_and_pad_tensors(batch, tokenizer.pad_token_id)

xlm = XLMRobertaModel.from_pretrained(
            'xlm-roberta-base', output_hidden_states=True
        )

bert = BertModel.from_pretrained(
            'bert-base-multilingual-cased', output_hidden_states=True
        )


xlm.eval()
bert.eval()
with torch.no_grad():
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    xlm_tokenizer  = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

    samples = ["hello world!", "This is a batch and the first sentence will be padded"]

    bert_tokens, bert_lengths = batch_encoder(samples, bert_tokenizer)
    bert_attention_mask = lengths_to_mask(bert_lengths)

    xlm_tokens, xlm_lengths = batch_encoder(samples, bert_tokenizer)
    xlm_attention_mask = lengths_to_mask(xlm_lengths)

     # Forward
    bert_out = bert(input_ids=bert_tokens, attention_mask=bert_attention_mask)
    xlm_out = xlm(input_ids=xlm_tokens, attention_mask=xlm_attention_mask)
    bert_last_hidden_states, bert_pooler_output, bert_all_layers = bert_out
    xlm_last_hidden_states, xlm_pooler_output, xlm_all_layers = xlm_out

    # Testing by comparing pooler_out
    bert_first_sample_tokens = torch.tensor(bert_tokenizer.encode(samples[0])).unsqueeze(0)
    xlm_first_sample_tokens = torch.tensor(xlm_tokenizer.encode(samples[0])).unsqueeze(0)
    bert_out = bert(input_ids=bert_first_sample_tokens)
    xlm_out = xlm(input_ids=xlm_first_sample_tokens)
    _, bert_pooler_output_1 , _ = bert_out
    _, xlm_pooler_output_1 , _ = xlm_out

    print (bert_pooler_output_1[0][:5])
    print (bert_pooler_output[0][:5])
    print ()
    #assert torch.equal(bert_pooler_output_1[0], bert_pooler_output[0])

    print (xlm_pooler_output_1[0][:5])
    print (xlm_pooler_output[0][:5])

    #assert torch.equal(xlm_pooler_output_1[0], xlm_pooler_output[0])```

Script Output:

tensor([ 0.264619,  0.191050,  0.120784, -0.024288, -0.186887])
tensor([ 0.264619,  0.191049,  0.120784, -0.024288, -0.186887])

tensor([-0.114997, -0.025624, -0.171540,  0.725383,  0.318024])
tensor([-0.042580,  0.237069,  0.136827,  0.484221,  0.019779])

For BERT the results don't change that much... But for XLM-R the results are shockingly different!

Am I missing something?

Also experienced same issue using BertForPreTraining. This doesn't make sense to me --- there's no component in Bert which depends on the batch size. I mean things like BatchNorm in training mode output different results with changed batch sizes. But no such component is in Bert AFAIK. Anything I missed?
Another thing I noticed is that if I use FP16, some instances yield quite different embeddings, but some instances have totally identical embeddings (across different batch sizes). If I use FP32, all instances have only slightly different embeddings (but none of them are identical).

@MMesgar
Copy link

MMesgar commented Jun 1, 2021

I'm also facing with this issue. BERT returns different embeddings if I change the batch size. This happens only in the train() mode. Did any one figure out the reason?

@CaesarWWK
Copy link

same problem over here, any thoughts about it ?

@Syndorik
Copy link

I'm having the same issue with BERT. Slightly differnt outputs, while only changing the batch size. It's driving me crazy, cause I don't understand where's the mistake

@BrianG13
Copy link

Not working on BERT, but I see this phenomenon also on a transformer I am working on.
Any news?

@zhezh
Copy link

zhezh commented Nov 23, 2021

Deleted, there is bug 😂

@yongzx
Copy link

yongzx commented Apr 27, 2022

Having the same issue with T5 model.

@GerardBurnside
Copy link

I'm seeing similar issues on a fine-tuned distilbert-base-uncased model, sometimes the norm of the difference of tensors can go up to 0.2 which seems huge to me (for Semantic Search applications it means hundreds of items would move around in the ranking depending on the size of the batch used for computing the embeddings).
Is this issue closed ?
PS: I tried using float64 precision but it makes no difference.

@zachflam
Copy link

Having the same issue. Any update?

@CSerxy
Copy link

CSerxy commented Sep 22, 2022

Met same issue.

At file transformers/models/roberta/modeling_roberta.py under function RobertaEncoder,

If I call
layer_outputs = layer_module(
hidden_states[:2],
attention_mask[:2],
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,)
and print
hidden_states = layer_outputs[0]
print(hidden_states[0,0,:10])

The results are different from the below version:
layer_outputs = layer_module(
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,)

I wonder if this is a bug in the huggingface? The only difference between the two versions for me is I change the input batch size.

@iz2late
Copy link

iz2late commented Feb 15, 2023

having the same issue with bart model

@alceballosa
Copy link
Contributor

Hi! @osanseviero this is the bug I mentioned to you at Khipu. I can reproduce the behavior using @bpben's code with transformers 4.27.1 and torch 2.0.0 on a RTX 3090 GPU. At least for me, it results in consistent generations for models such as Flan-T5 XL, albeit I haven't been able to get it to happen with a minimal enough example. Nevertheless, the issue made by @infinitylogesh mentioning this one shows that more people are struggling with it.

Let me know if I should open a new issue for this.

@gustavz
Copy link

gustavz commented Oct 6, 2023

The issue still exists, any solutions @gsarti ?

@cronoik
Copy link
Contributor

cronoik commented Nov 1, 2023

I don't fully understand it yet, but it is not a huggingface issue. It seems like the matrix multiplication of PyTorch (used inside the linear layers) already returns different results for batches in combination with transpose on a Ryzen 5 2500U and on Colab:

import torch
x = torch.randn(3,4)
y = torch.randn(5,4).t()
torch.set_printoptions(precision=10)

# batch size 1
print(x[0].matmul(y))
# batch size 4 but only returning the first row
print(x.matmul(y)[0])
# element-wise comparison batch-size 1 with first row of the result
print(x[0].matmul(y) == x.matmul(y)[0])

Output:

tensor([ 1.4397521019, -1.0296567678, -0.9089178443,  0.3109838367,
         0.2965016961])
tensor([ 1.4397521019, -1.0296568871, -0.9089177847,  0.3109837770,
         0.2965016961])
tensor([ True, False, False, False,  True])

All comparisons with batch sizes >1 return identical results on a Ryzen 5 2500U but not on Colab:

# comparing batch size 2 with the first two rows of the result
print(x[0:2].matmul(y) == x.matmul(y)[0:2])

Output:

tensor([[True, True, True, True, True],
        [True, True, True, True, True]])

Maybe I have made a mistake because I only get different results when I also use transpose on the Ryzen 5 2500U (results still differ on Colab):

import torch
x = torch.randn(3,4)
y = torch.randn(4,5)
torch.set_printoptions(precision=10)

# batch size 1
print(x[0].matmul(y))
# batch size 4 but only returning the first row
print(x.matmul(y)[0])
# element-wise comparison batch-size 1 with first row of the result
print(x[0].matmul(y) == x.matmul(y)[0])

Output:

tensor([-1.9365643263, -1.9082145691,  4.3417339325, -0.4087761641,
         1.2496384382])
tensor([-1.9365643263, -1.9082145691,  4.3417339325, -0.4087761641,
         1.2496384382])
tensor([True, True, True, True, True])

Can someone please check my logic? I don't understand how the transpose can have such an effect. I am afraid that I have made a mistake.

Otherwise, it looks like the matrix multiplication implementation of the hardware is the root cause for the differences we get. This paper, even if it didn't pass the review, also seems to point in that direction. It investigated this issue for cuBLAS (cuda matrix multiplication).

@umarbutler
Copy link
Contributor

I believe I may also be expericing this issue. Changing the contents of a batch, even if the size remains static, will change the resulting embedding for the same text.

@dts1fl
Copy link

dts1fl commented Dec 18, 2023

Also got a difference when used encode , but no difference if use apply method on series.

import pandas as pd
sentence_transformer_path="distiluse-base-multilingual-cased-v1"
from sentence_transformers import SentenceTransformer
encoder=SentenceTransformer(sentence_transformer_path).encode

s=[str(i**2) for i in range(10)]
df=pd.DataFrame()
df["num"]=s
row=1

embed1=encoder(df["num"])
embed1=encoder(df["num"])
difference=(encoder(df.loc[row,"num"])-embed1[row,:])
print('Method 1 difference: ',(sum(difference**2))**0.5)

embed2=(df["num"]).apply(encoder)
difference=(encoder(df.loc[row,"num"])-embed2[row])
print('Method 2 difference: ',(sum(difference**2))**0.5)

Method 1 difference: 3.997695879885389e-07
Method 2 difference: 0.0

@hyang0511
Copy link

Having the same issue with T5 model.

I have the same issue, are there any updates?

@hui-po-wang
Copy link

Same issue here with LLAMA, though the difference is tiny. The logits keep changing when setting different batch sizes.

@amyeroberts
Copy link
Collaborator

@dts1fl @a514514772 Very tiny numerical differences can be expected when varying batch sizes. Differences on the order of 1e-7 can easily be seen just using a different GPU or even slight differences in the running environment. For context, when we port models into the library, a difference of 1e-5 between the original and ported model is considered acceptable. Differences of this order shouldn't manifest in very large differences in output e.g. generated sequences

@qgallouedec
Copy link
Member

qgallouedec commented May 1, 2024

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m", trust_remote_code=True)

input1 = [
    [2000, 3000, 4000],
]
input2 = [
    [2000, 3000, 4000],
    [5000, 6000, 7000],
]
input3 = [
    [2000, 3000, 4000],
    [5000, 6000, 7000],
    [8000, 9000, 10000],
]
model.eval()
output1 = model(torch.tensor(input1)).logits
output2 = model(torch.tensor(input2)).logits
output3 = model(torch.tensor(input3)).logits

print(torch.max(torch.abs(output1[0] - output2[0])))  # tensor(0.0005, grad_fn=<MaxBackward1>)
print(torch.max(torch.abs(output2[0] - output3[0])))  # tensor(0., grad_fn=<MaxBackward1>)

I personally get differences of up to 5e-4 on logits. I tend to agree that in practice this shouldn't be a problem (and yet, has anyone quantified it?), but still, it's very surprising and deserves to be investigated.

@qgallouedec
Copy link
Member

qgallouedec commented May 1, 2024

Even more surprising, using Cuda, I can get a difference greater than 1e-3 (between batch size =2 and 3 this time):

output1 = model(torch.tensor(input1, device="cuda")).logits
output2 = model(torch.tensor(input2, device="cuda")).logits
output3 = model(torch.tensor(input3, device="cuda")).logits

print(torch.max(torch.abs(output1[0] - output2[0])))  # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
print(torch.max(torch.abs(output2[0] - output3[0])))  # tensor(0.0015, device='cuda:0', grad_fn=<MaxBackward1>)

GPU: NVIDIA H100 80GB

@amyeroberts
Copy link
Collaborator

@qgallouedec Thanks for looking into this and sharing these results!

Are these order of differences also observed for non-remote models?

@qgallouedec
Copy link
Member

By "non-remote", do you mean:

config = AutoConfig.from_pretrained("EleutherAI/pythia-160m", trust_remote_code=True)
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

?

Is so the diff seems smaller:

print(torch.max(torch.abs(output1[0] - output2[0])))  # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
print(torch.max(torch.abs(output2[0] - output3[0])))  # tensor(2.7418e-06, device='cuda:0', grad_fn=<MaxBackward1>)

I've double-checked with other inputs.

@amyeroberts
Copy link
Collaborator

amyeroberts commented May 1, 2024

@qgallouedec Sorry, I wasn't clear, by non-remote I mean for a model whose architecture is defined in the transformers library, rather than on the hub i.e. it's not necessary to pass trust_remote_code=True to load.

In this case, for this checkpoint, I can see it uses transformers.

The reason I ask is that there might be a lot of reasons for differences, but if the architecture definition is on the hub, then there isn't really much we can do apart from raise a discussion on the checkpoint pages itself.

Is so the diff seems smaller:

Great!

@qgallouedec
Copy link
Member

My bad, misleading use of trust_remote_code.

So yes, I can see the same diff (around 1e6) with other model like MistralForCausalLM.

Based on #2401 (comment), this PyTorch discussion and this other PyTorch discussion, we can conclude that every difference under 1e6 is the "expected numerical noise". However, this noise can accumulate and we can end up with potentially annoying noise, as in #2401 (comment). I'm not sure there's anything to do but keep that in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests