Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Llava for 0-embeddings #30473

Merged
merged 1 commit into from
Apr 25, 2024
Merged

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

Fixes #29835. In llava-next the embedding weights of some tokens are rounded to 0 when cast to fp-16, which results in incorrect calculation for image_positions. This PR fixes it by getting image positions as "anything that was not in text_positions", so that we do not rely on values.

All llava tests (+slow) are passing locally and I added one test for the < unk > token in llva-next. "Unk" is one of the tokens that get cast to 0, but there are around 200 such tokens in llava-mistral-7b version/

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice handling - thanks for digging into this and adding a test! 🚀

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean, thanks for looking into this! cc @ArthurZucker

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call thanks

@zucchini-nlp zucchini-nlp merged commit e60491a into huggingface:main Apr 25, 2024
20 checks passed
itazap pushed a commit that referenced this pull request May 14, 2024
@DingYX0731
Copy link

It seems that the problem still exists for llava when using the new code of #30473 . I upgraded 'transformers' to version 4.41.0

The error still occurs:
ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

My code is:

from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"
quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        )

model = LlavaForConditionalGeneration.from_pretrained("llava-v1.5-7b", quantization_config=quantization_config, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-v1.5-7b")
prompts = [
            "USER: <image>\nWhat are the things I should be cautious about when I visit this place? What should I bring with me?\nASSISTANT:",
            "USER: <image>\nPlease describe this image\nASSISTANT:",
]
image_file_1 = "image_1.png"
image_file_2 = "image_2.png"
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=20)

Is it because I use quantization? I followed the tutorial from @NielsRogge , and I couldn't run throught the code.

@zucchini-nlp
Copy link
Member Author

@DingYX0731 hey! I couldn't reproduce the error on the latest version of transformers, the tutorial code works fine for me if I run it in Colab as it is. Note that the tutorial already installs the latest transformers from main in the first cell

Can you check if it works for you in Colab and if yes, the problem might be in your local setup/hardware? For ex in #30294 the problem was in using "mps" as device

@DingYX0731
Copy link

The tutorial code works fine in Colab for me as well (with transformers==4.42.0.dev0, also well with 4.41.0 which is the same version as my local setting). The problem is very likely caused by my local setup, which is ubuntu20.04 and RTX 4090. But I am still confused... I have also tried older version like transformers==4.37.2, which is also work in Colab...not locally...

@zucchini-nlp
Copy link
Member Author

Hmm, then it would hard for me to help you locate the bug. Let's try the following

  1. Can you verify that the correct version of transformers in installed and being used by adding this:
import transformers
print(transformers.__version__)
  1. Print input_ids as it seems like the special image tokens in input_ids are not being found:
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to(model.device)

print(inputs.input_ids)
print(model.condig.image_token_index)
  1. Verify that the error is occurring for the pre-fill stage, i.e. the first forward call. If yes, the following will fail
outputs = model(**inputs, use_cache=True)

@DingYX0731
Copy link

Sorry to bother you so much @zucchini-nlp

For old version of transformer (4.37.2):
When running:

inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to(torch_device)

print(inputs.input_ids)
print(model.config.image_token_index)

the output is:

tensor([[    1,  3148,  1001, 29901,   529,  3027, 29958,    13,  5618,   526,
           278,  2712,   306,   881,   367,   274,  1300,  2738,  1048,   746,
           306,  6493,   445,  2058, 29973,  1724,   881,   306,  6963,   411,
           592, 29973,    13, 22933,  9047, 13566, 29901],
        [    1,  3148,  1001, 29901,   529,  3027, 29958,    13, 12148,  8453,
           445,  1967,    13, 22933,  9047, 13566, 29901,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
32000

There appears lots of zeros. Does them indicate the existence of special tokens?

And the code:

outputs = model(**inputs, use_cache=True)

still has problem:

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

For newer version of transformers (4.41.0), input_ids and image_token_index are exactly the same as before, and the error is the same:

    332 image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
    334 if image_to_overwrite.sum() != image_features.shape[:-1].numel():
--> 335     raise ValueError(
    336         f"The input provided to the model are wrong. The number of image tokens is {torch.sum(special_image_token_mask)} while"
    337         f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
    338     )
    340 final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
    341 final_attention_mask |= image_to_overwrite

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

The modified code indeed exists in the new package:
image

Note that when I upgrade transformer==4.41.0, such messages occur:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llava 1.2.2.post1 requires tokenizers==0.15.1, but you have tokenizers 0.19.1 which is incompatible.
llava 1.2.2.post1 requires transformers==4.37.2, but you have transformers 4.41.0 which is incompatible.

But the llava-torch version is already the latest. Could this leads to the problem?

@DingYX0731
Copy link

I just found out the problem and solved it!
The issue was not inside the transformer, but the model I used.
I used the original version of llava (llava-v1.5-7b) rather than llava-v1.5-7b-hf.
The problem solved when I implemented the later one.
Thank you so much for your time and contribution !!! @zucchini-nlp

@hxhcreate
Copy link

In transformers version4.41. I stilll encounter the error belows:
ValueError: The input provided to the model are wrong. The number of image tokens is 4 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
the cause of this error is that the data itself has <image> token.
Could this be properly solved, or should I change transformer version?
thx

@zucchini-nlp
Copy link
Member Author

Hey @hxhcreate! What do you mean by data itself has <image> token? LLaVa models expect as many image tokens as there are images, and if you have 4 image tokens in the text you need to use 4 images

@hxhcreate
Copy link

I mean the data text itself wrongly contain several <image> str, while I only need one image to input.

Could the correct images nums be infered from the input images themself, not the text?

@zucchini-nlp
Copy link
Member Author

@hxhcreate Ah I see, unfortunately we can't infer that in processing. While it is doable, I think it will cause more errors in the future and we should better delegate to users to prepare their text and images correctly. You can preprocess you dataset manually by checking how many images you have and replacing extra image tokens with empty str

@hxhcreate
Copy link

I see, that's easy to do, thanks for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants