-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWQ performance on RTX3090, with flash_attn2 #8
Comments
2 tokens/s is significantly lower than the speed-benchmark we have tested, which is around 30 tokens/s. Your environment may not be configured correctly. You can check if Maybe you can try setting up your environment with: pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/huggingface/transformers.git@27903de
pip install autoawq==0.2.6 Here, use the it seems there isn't a precompiled flash-attn wheel for cu121. If you prefer not to compile it yourself, you can simply do: model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
) |
Thank you for response and advice. I tried as you suggested. I was on torch-2.4.0 and autoawq-0.2.3, now on torch-2.3.1 and autoawq-0.2.6. Anyway, great new VL model; beside my performance issues, results are great! |
I get ~44 seconds inference time using demo code on Qwen2-VL-7B-Instruct-AWQ with 774x772 image input, using flash_attn2, bfloat16, with output of 104 tokens:
['The image shows a round wall clock with a black frame and a white face. The clock has black hour, minute, and second hands. The numbers from 1 to 12 are displayed in black, with each number being bold and easy to read. The clock is mounted on a black bracket that is attached to the wall. The brand name "VISIPLEX" is printed in blue at the bottom of the clock face. The background is plain and neutral, which makes the clock the focal point of the image.']
Inference time: 43.6408166885376
Is that expected performance? Am I missing something?
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="cuda",
)
The text was updated successfully, but these errors were encountered: