Batched inference for `OS-Atlas-Base-7B` is broken with `attn_implementation="flash_attention_2"` #17

jasonlee-sf · 2024-11-19T03:22:33Z

Hi, it seems like batched inference is broken when flash_attention is used. When running inference on the 1st example of ScreenSpot test example with flash_attention_2, the output changes depending on the batch size

Batch_size = 1: <|object_ref_start|>close button<|object_ref_end|><|box_start|>(954,148),(988,196)<|box_end|><|im_end|>
Batch_size = 4: 降序<|im_end|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

When I disable flash attention_2, the results look fine.

Batch_size = 1: <|object_ref_start|>close button<|object_ref_end|><|box_start|>(954,148),(988,196)<|box_end|><|im_end|>
Batch_size = 4: <|object_ref_start|>close button<|object_ref_end|><|box_start|>(954,148),(988,196)<|box_end|><|im_end|>

Flash attention with batch_size=1 is fast enough so this bug is not a deal breaker for me, although it'd be nice if this is addressed.

The text was updated successfully, but these errors were encountered:

jasonlee-sf mentioned this issue Nov 19, 2024

Conversational Model #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched inference for `OS-Atlas-Base-7B` is broken with `attn_implementation="flash_attention_2"` #17

Batched inference for `OS-Atlas-Base-7B` is broken with `attn_implementation="flash_attention_2"` #17

jasonlee-sf commented Nov 19, 2024

Batched inference for OS-Atlas-Base-7B is broken with attn_implementation="flash_attention_2" #17

Batched inference for OS-Atlas-Base-7B is broken with attn_implementation="flash_attention_2" #17

Comments

jasonlee-sf commented Nov 19, 2024

Batched inference for `OS-Atlas-Base-7B` is broken with `attn_implementation="flash_attention_2"` #17

Batched inference for `OS-Atlas-Base-7B` is broken with `attn_implementation="flash_attention_2"` #17