-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video-Llava model's generation error due to causal mask shape mismatch #34696
Comments
Thanks for providing those details and the traceback. It seems like the core issue is related to the tensor sizes not matching during the attention mechanism in the model. Here are some steps to potentially resolve this:
I hope these steps help you narrow down and resolve the issue! If you need more specific advice or further assistance, don't hesitate to ask. |
@cw235, is this response pasted from ChatGPT? It doesn't seem helpful to the question asked. cc @zucchini-nlp on the initial question |
@jiqing-feng For video-llava we have to get rid of the legacy path already, but unfortunately I cannot get in contact with the autor/repo owner. I suggest for now to add these two lines in the code after loading processor processor.vision_feature_select_strategy = "default"
processor.patch_size = 14 |
Done, we have added the values as defaults in v4.47 so should be working now |
System Info
The regression happens after transformers==4.45.2.
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The code is from LanguageBind/Video-LLaVA-7B-hf
It's also the official codes in modeling_video_llava
python
Trace back:
The causal mask shape: [2, 1, 1, 22]
Expected behavior
The transformers==4.45.2 can output the correct generated texts:
The causal mask shape [2, 1, 1, 2332]
The text was updated successfully, but these errors were encountered: