Video-Llava model's generation error due to causal mask shape mismatch #34696

jiqing-feng · 2024-11-12T04:17:12Z

System Info

The regression happens after transformers==4.45.2.

- `transformers` version: 4.47.0.dev0
- Platform: Linux-6.6.0-gnr.bkc.6.6.9.3.15.x86_64-x86_64-with-glibc2.34
- Python version: 3.10.15
- Huggingface_hub version: 0.26.1
- Safetensors version: 0.4.5
- Accelerate version: 1.1.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.6.0.dev20241014+cpu (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The code is from LanguageBind/Video-LLaVA-7B-hf
It's also the official codes in modeling_video_llava

python

from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

# Generate from images and videos mix
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = [
    "USER: <image> How many cats are there in the image? ASSISTANT:",
    "USER: <video>Why is this video funny? ASSISTANT:"
]
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=50)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True))

Trace back:

Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
USER: Why is this video funny? ASSISTANT: The and? and??????????? [? [ and, [ [ [ [ [ [ [ [ [ [, [, [ and, [ and, and, and, and, and, and, and, and, and, and, and, and, [
Traceback (most recent call last):
  File "/home/jiqing/test_llava.py", line 58, in <module>
    generate_ids = model.generate(**inputs, max_length=50)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 2231, in generate
    result = self._sample(
  File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 3222, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/video_llava/modeling_video_llava.py", line 663, in forward
    outputs = self.language_model(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
    outputs = self.model(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 955, in forward
    layer_outputs = decoder_layer(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 685, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 611, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The size of tensor a (2332) must match the size of tensor b (22) at non-singleton dimension 3

The causal mask shape: [2, 1, 1, 22]

Expected behavior

The transformers==4.45.2 can output the correct generated texts:

Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote, which is an unusual sight.Ъ because babies are not typically known for playing video games. The baby's actions with the remote control create a humorous and unexpected scene, making it entertain
['USER:  How many cats are there in the image? ASSISTANT: There are two cats in the image. (or three, depending on the interpretation of the image).', 'USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote... The baby is holding the']

The causal mask shape [2, 1, 1, 2332]

The text was updated successfully, but these errors were encountered:

cw235 · 2024-11-12T19:47:03Z

Thanks for providing those details and the traceback.

It seems like the core issue is related to the tensor sizes not matching during the attention mechanism in the model. Here are some steps to potentially resolve this:

Update Model's Processing Config: Ensure that patch_size and vision_feature_select_strategy are set in the processor's config.
```
processor.patch_size = <appropriate_patch_size>
processor.vision_feature_select_strategy = <appropriate_strategy>
```
Debugging Shapes: Check the shapes of the inputs and the masks before passing them to the model to make sure they match.
```
print(inputs.shape)
```
Compare Configurations: Ensure that your configurations in the newer version of transformers align with those used in version 4.45.2. Sometimes, even minor changes in default settings can lead to such issues.
Review Model Changes: Look into the changes made in the modeling_llama.py and modeling_video_llava.py files between these versions to spot differences in the implementation of forward calls or handling of inputs.
Community Help: If none of the above steps work, it might be useful to raise this issue on the official GitHub repository or the Hugging Face forums. Including the traceback and detailed description you’ve provided here will be very helpful for anyone trying to assist.

I hope these steps help you narrow down and resolve the issue! If you need more specific advice or further assistance, don't hesitate to ask.

LysandreJik · 2024-11-15T10:34:51Z

@cw235, is this response pasted from ChatGPT? It doesn't seem helpful to the question asked.

cc @zucchini-nlp on the initial question

zucchini-nlp · 2024-11-18T11:38:23Z

@jiqing-feng For video-llava we have to get rid of the legacy path already, but unfortunately I cannot get in contact with the autor/repo owner. I suggest for now to add these two lines in the code after loading processor

processor.vision_feature_select_strategy = "default"
processor.patch_size = 14

zucchini-nlp · 2024-12-11T13:31:21Z

Done, we have added the values as defaults in v4.47 so should be working now

jiqing-feng added the bug label Nov 12, 2024

jiqing-feng changed the title ~~Llava model's generation error due to causal mask shape mismatch~~ Video-Llava model's generation error due to causal mask shape mismatch Nov 12, 2024

jiqing-feng mentioned this issue Nov 15, 2024

LLaVA: latency issues #34460

Merged

zucchini-nlp mentioned this issue Nov 15, 2024

Video-LLaVa now available in the Transformers library! PKU-YuanGroup/Video-LLaVA#156

Open

zucchini-nlp closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video-Llava model's generation error due to causal mask shape mismatch #34696

Video-Llava model's generation error due to causal mask shape mismatch #34696

jiqing-feng commented Nov 12, 2024 •

edited

Loading

cw235 commented Nov 12, 2024

LysandreJik commented Nov 15, 2024

zucchini-nlp commented Nov 18, 2024

zucchini-nlp commented Dec 11, 2024

Video-Llava model's generation error due to causal mask shape mismatch #34696

Video-Llava model's generation error due to causal mask shape mismatch #34696

Comments

jiqing-feng commented Nov 12, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

cw235 commented Nov 12, 2024

LysandreJik commented Nov 15, 2024

zucchini-nlp commented Nov 18, 2024

zucchini-nlp commented Dec 11, 2024

jiqing-feng commented Nov 12, 2024 •

edited

Loading