You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found attn_implementation="flash_attention_2" activates Qwen2VLFlashAttention2 which will throw a out-of-index error during training. When I switch to attn_implementation="sdpa", the error does not come out and training goes smoothly.
After some time of debugging, I located that the problem comes from this line where rotary_seq_len does not properly reflect the length of the input sequence but rather the real length minus 1. I modified this line to rotary_seq_len = cache_position[-1] + 1 in my local transformers offline package and it turns out that the training with flash_attention_2 goes smoothly.
My input batch to the model is as follow:
batch
input_ids: Tensor (B, seq_len)
attention_mask: Tensor (B, seq_len)
labels: Tensor (B, seq_len)
pixel_values: Tensor (B, res_h, res_w) # res_h and res_w are the shape of image after processor()
image_grid_thw: Tensor (B, 3)
I suspect that my input batch to the model has the correct shape, so I'm wondering whether my tiny workaround is the optimal solution to the problem. I really appreciate it if you could tell me some better solutions.
Expected behavior
As Reproduction section. Thanks for your patience for my issue.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.45.0.dev0Who can help?
@ArthurZucker @amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, I'm finetuning the newly-released
Qwen2VLForConditionalGeneration
model by LoRA. I'm building the model byI found
attn_implementation="flash_attention_2"
activatesQwen2VLFlashAttention2
which will throw a out-of-index error during training. When I switch toattn_implementation="sdpa"
, the error does not come out and training goes smoothly.After some time of debugging, I located that the problem comes from this line where
rotary_seq_len
does not properly reflect the length of the input sequence but rather the real length minus 1. I modified this line torotary_seq_len = cache_position[-1] + 1
in my local transformers offline package and it turns out that the training withflash_attention_2
goes smoothly.My input batch to the model is as follow:
I suspect that my input batch to the model has the correct shape, so I'm wondering whether my tiny workaround is the optimal solution to the problem. I really appreciate it if you could tell me some better solutions.
Expected behavior
As Reproduction section. Thanks for your patience for my issue.
The text was updated successfully, but these errors were encountered: