Fix Llava for 0-embeddings #30473

zucchini-nlp · 2024-04-25T08:24:48Z

What does this PR do?

Fixes #29835. In llava-next the embedding weights of some tokens are rounded to 0 when cast to fp-16, which results in incorrect calculation for image_positions. This PR fixes it by getting image positions as "anything that was not in text_positions", so that we do not rely on values.

All llava tests (+slow) are passing locally and I added one test for the < unk > token in llva-next. "Unk" is one of the tokens that get cast to 0, but there are around 200 such tokens in llava-mistral-7b version/

HuggingFaceDocBuilderDev · 2024-04-25T08:45:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Very nice handling - thanks for digging into this and adding a test! 🚀

NielsRogge

Very clean, thanks for looking into this! cc @ArthurZucker

ArthurZucker

good call thanks

DingYX0731 · 2024-05-19T09:23:18Z

It seems that the problem still exists for llava when using the new code of #30473 . I upgraded 'transformers' to version 4.41.0

The error still occurs:
ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

My code is:

from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"
quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        )

model = LlavaForConditionalGeneration.from_pretrained("llava-v1.5-7b", quantization_config=quantization_config, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-v1.5-7b")
prompts = [
            "USER: <image>\nWhat are the things I should be cautious about when I visit this place? What should I bring with me?\nASSISTANT:",
            "USER: <image>\nPlease describe this image\nASSISTANT:",
]
image_file_1 = "image_1.png"
image_file_2 = "image_2.png"
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=20)

Is it because I use quantization? I followed the tutorial from @NielsRogge , and I couldn't run throught the code.

zucchini-nlp · 2024-05-19T10:40:06Z

@DingYX0731 hey! I couldn't reproduce the error on the latest version of transformers, the tutorial code works fine for me if I run it in Colab as it is. Note that the tutorial already installs the latest transformers from main in the first cell

Can you check if it works for you in Colab and if yes, the problem might be in your local setup/hardware? For ex in #30294 the problem was in using "mps" as device

DingYX0731 · 2024-05-19T11:49:06Z

The tutorial code works fine in Colab for me as well (with transformers==4.42.0.dev0, also well with 4.41.0 which is the same version as my local setting). The problem is very likely caused by my local setup, which is ubuntu20.04 and RTX 4090. But I am still confused... I have also tried older version like transformers==4.37.2, which is also work in Colab...not locally...

zucchini-nlp · 2024-05-19T12:59:14Z

Hmm, then it would hard for me to help you locate the bug. Let's try the following

Can you verify that the correct version of transformers in installed and being used by adding this:

import transformers
print(transformers.__version__)

Print input_ids as it seems like the special image tokens in input_ids are not being found:

inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to(model.device)

print(inputs.input_ids)
print(model.condig.image_token_index)

Verify that the error is occurring for the pre-fill stage, i.e. the first forward call. If yes, the following will fail

outputs = model(**inputs, use_cache=True)

DingYX0731 · 2024-05-19T14:26:23Z

Sorry to bother you so much @zucchini-nlp

For old version of transformer (4.37.2):
When running:

inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to(torch_device)

print(inputs.input_ids)
print(model.config.image_token_index)

the output is:

tensor([[    1,  3148,  1001, 29901,   529,  3027, 29958,    13,  5618,   526,
           278,  2712,   306,   881,   367,   274,  1300,  2738,  1048,   746,
           306,  6493,   445,  2058, 29973,  1724,   881,   306,  6963,   411,
           592, 29973,    13, 22933,  9047, 13566, 29901],
        [    1,  3148,  1001, 29901,   529,  3027, 29958,    13, 12148,  8453,
           445,  1967,    13, 22933,  9047, 13566, 29901,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
32000

There appears lots of zeros. Does them indicate the existence of special tokens?

And the code:

outputs = model(**inputs, use_cache=True)

still has problem:

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

For newer version of transformers (4.41.0), input_ids and image_token_index are exactly the same as before, and the error is the same:

    332 image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
    334 if image_to_overwrite.sum() != image_features.shape[:-1].numel():
--> 335     raise ValueError(
    336         f"The input provided to the model are wrong. The number of image tokens is {torch.sum(special_image_token_mask)} while"
    337         f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
    338     )
    340 final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
    341 final_attention_mask |= image_to_overwrite

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

The modified code indeed exists in the new package:

Note that when I upgrade transformer==4.41.0, such messages occur:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llava 1.2.2.post1 requires tokenizers==0.15.1, but you have tokenizers 0.19.1 which is incompatible.
llava 1.2.2.post1 requires transformers==4.37.2, but you have transformers 4.41.0 which is incompatible.

But the llava-torch version is already the latest. Could this leads to the problem?

DingYX0731 · 2024-05-19T15:14:30Z

I just found out the problem and solved it!
The issue was not inside the transformer, but the model I used.
I used the original version of llava (llava-v1.5-7b) rather than llava-v1.5-7b-hf.
The problem solved when I implemented the later one.
Thank you so much for your time and contribution !!! @zucchini-nlp

hxhcreate · 2024-08-20T10:06:45Z

In transformers version4.41. I stilll encounter the error belows:
ValueError: The input provided to the model are wrong. The number of image tokens is 4 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
the cause of this error is that the data itself has <image> token.
Could this be properly solved, or should I change transformer version?
thx

zucchini-nlp · 2024-08-20T10:14:12Z

Hey @hxhcreate! What do you mean by data itself has <image> token? LLaVa models expect as many image tokens as there are images, and if you have 4 image tokens in the text you need to use 4 images

hxhcreate · 2024-08-20T10:21:48Z

I mean the data text itself wrongly contain several <image> str, while I only need one image to input.

Could the correct images nums be infered from the input images themself, not the text?

zucchini-nlp · 2024-08-20T10:27:42Z

@hxhcreate Ah I see, unfortunately we can't infer that in processing. While it is doable, I think it will cause more errors in the future and we should better delegate to users to prepare their text and images correctly. You can preprocess you dataset manually by checking how many images you have and replacing extra image tokens with empty str

hxhcreate · 2024-08-20T11:47:11Z

I see, that's easy to do, thanks for your help

Fix Llava for 0-embeddings

201f9b6

zucchini-nlp requested review from amyeroberts and NielsRogge April 25, 2024 08:24

amyeroberts approved these changes Apr 25, 2024

View reviewed changes

NielsRogge approved these changes Apr 25, 2024

View reviewed changes

ArthurZucker approved these changes Apr 25, 2024

View reviewed changes

zucchini-nlp merged commit e60491a into huggingface:main Apr 25, 2024
20 checks passed

itazap pushed a commit that referenced this pull request May 14, 2024

Fix Llava for 0-embeddings (#30473)

45b51ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Llava for 0-embeddings #30473

Fix Llava for 0-embeddings #30473

zucchini-nlp commented Apr 25, 2024

HuggingFaceDocBuilderDev commented Apr 25, 2024

amyeroberts left a comment

NielsRogge left a comment

ArthurZucker left a comment

DingYX0731 commented May 19, 2024

zucchini-nlp commented May 19, 2024

DingYX0731 commented May 19, 2024

zucchini-nlp commented May 19, 2024

DingYX0731 commented May 19, 2024

DingYX0731 commented May 19, 2024

hxhcreate commented Aug 20, 2024

zucchini-nlp commented Aug 20, 2024

hxhcreate commented Aug 20, 2024

zucchini-nlp commented Aug 20, 2024

hxhcreate commented Aug 20, 2024

Fix Llava for 0-embeddings #30473

Fix Llava for 0-embeddings #30473

Conversation

zucchini-nlp commented Apr 25, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 25, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

NielsRogge left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

DingYX0731 commented May 19, 2024

zucchini-nlp commented May 19, 2024

DingYX0731 commented May 19, 2024

zucchini-nlp commented May 19, 2024

DingYX0731 commented May 19, 2024

DingYX0731 commented May 19, 2024

hxhcreate commented Aug 20, 2024

zucchini-nlp commented Aug 20, 2024

hxhcreate commented Aug 20, 2024

zucchini-nlp commented Aug 20, 2024

hxhcreate commented Aug 20, 2024