Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ performance on RTX3090, with flash_attn2 #8

Open
nmandic78 opened this issue Aug 29, 2024 · 2 comments
Open

AWQ performance on RTX3090, with flash_attn2 #8

nmandic78 opened this issue Aug 29, 2024 · 2 comments

Comments

@nmandic78
Copy link

I get ~44 seconds inference time using demo code on Qwen2-VL-7B-Instruct-AWQ with 774x772 image input, using flash_attn2, bfloat16, with output of 104 tokens:
['The image shows a round wall clock with a black frame and a white face. The clock has black hour, minute, and second hands. The numbers from 1 to 12 are displayed in black, with each number being bold and easy to read. The clock is mounted on a black bracket that is attached to the wall. The brand name "VISIPLEX" is printed in blue at the bottom of the clock face. The background is plain and neutral, which makes the clock the focal point of the image.']
Inference time: 43.6408166885376
Is that expected performance? Am I missing something?
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="cuda",
)

@kq-chen
Copy link
Collaborator

kq-chen commented Aug 31, 2024

2 tokens/s is significantly lower than the speed-benchmark we have tested, which is around 30 tokens/s.

Your environment may not be configured correctly. You can check if autoawq-kernel is installed correctly. Noteworthy, the dependencies for autoawq/autoawq-kernel are quite strict, and it appears that the latest version of autoawq only supports torch 2.3.1.

Maybe you can try setting up your environment with:

pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/huggingface/transformers.git@27903de
pip install autoawq==0.2.6

Here, use the 27903de commit of transformers due to a recent bug. In short, #31502 causes an error 'torch.nn' has no attribute 'RMSNorm' for torch<2.4. The corresponding fix has not yet been merged. Therefore, we are installing a version of transformers before #31502.

it seems there isn't a precompiled flash-attn wheel for cu121. If you prefer not to compile it yourself, you can simply do:

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
)

@nmandic78
Copy link
Author

Thank you for response and advice.

I tried as you suggested. I was on torch-2.4.0 and autoawq-0.2.3, now on torch-2.3.1 and autoawq-0.2.6.
Tried without flash_attn and result is same (46.81 sec for example above ~2 t/s
I see some warnings (warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")) and I don't know if it has any influence?
I again tried with flash_attn and get almost the same speed (43.40 sec, same input)
Maybe I missed something, but am happy to wait a week until all interdependencies are resolved.

Anyway, great new VL model; beside my performance issues, results are great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants