Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please make it clear in the install guide it doesn't work for sm_75 GPUs yet #421

Closed
horiacristescu opened this issue Aug 5, 2024 · 10 comments

Comments

@horiacristescu
Copy link

I wasted a lot of time trying to install flashinfer only to find out that it doesn't actually support sm_75

It would be a good idea to put this information up so people know from the start not to go into that direction

@Amrabdelhamed611
Copy link

Amrabdelhamed611 commented Aug 5, 2024

they working on it,
from flashinfer installation
Supported GPU architectures: sm80, sm86, sm89, sm90 (sm75 / sm70 support is working in progress).
i think they need to add this info here on githup, i also spent hours try to get it work 😅🤡

@yzh119
Copy link
Collaborator

yzh119 commented Aug 5, 2024

Thanks for your suggestion, the codebase used to work with sm_75 (#128 ), but since that a lot of new features were introduced and I haven't tested them on sm_75.

Do you have any concrete error messages when compiling flashinfer from source on sm_75, that would be helpful for me to fix the issues (I don't have a sm_75 dev machine at the moment).

@Amrabdelhamed611
Copy link

Amrabdelhamed611 commented Aug 6, 2024

mainly those 2 errors:
RuntimeError: FlashAttention only supports Ampere GPUs or newer. , as my GPU not supported yet

CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1), i solved thise error by installing flashinfer==0.1.2 through pip,
pip install flashinfer==0.1.2 -i https://flashinfer.ai/whl/cu121/torch2.3, the solution was from vllm issue VLLM#7070

@yzh119
Copy link
Collaborator

yzh119 commented Aug 7, 2024

@Amrabdelhamed611 the first error was not reported by flashinfer, that might be related to flashattn package and flashinfer doesn't depend on that.

Regarding the second issue, check my reply here vllm-project/vllm#7070 (comment)

@esmeetu
Copy link
Contributor

esmeetu commented Aug 8, 2024

Currently Flashinfer only support decode function with #128 , but not for prefill function. Some condition check failed when running prefill:

Invalid configuration : num_frags_x=1 num_frags_y=4 num_frags_z=1 num_warps_x=1 num_warps_z=4

After tuning some parameters, it work but got wrong result. And i thought if i lower some parameter, it will affect performance, and also Flashattention-2 which Flashinfer use in prefill, doesn't support sm_75, so I lost expectations for performance boost.
And i tried refactor vLLM flashinfer backend using xformers's prefill function and Flashinfer‘s decode (https://github.com/esmeetu/vllm/tree/sm75-flashinfer), it works well and get right answer. But didn't get better performance than xformers (almost same).

@yzh119
Copy link
Collaborator

yzh119 commented Aug 8, 2024

@esmeetu thanks for confirming, I'll take a look at the correctness issue of prefill kernels on sm75, but I don't expect too much about its performance as well because of its small shared memory size and missing of async copy features in Ampere.

@yzh119
Copy link
Collaborator

yzh119 commented Aug 8, 2024

In general I think having a decode kernels and sampling kernels on hopper for sm75 is great. I'm considering releasing a sm75 wheel for these kernels at v0.1.5.

Update: flashinfer v0.1.6 officially supports sm75, not only decode/sampling, but also prefill kernels.

@zhyncs
Copy link
Member

zhyncs commented Aug 27, 2024

fixed with #449

@zhyncs zhyncs closed this as completed Aug 27, 2024
@qism
Copy link

qism commented Sep 5, 2024

@Amrabdelhamed611 the first error was not reported by flashinfer, that might be related to flashattn package and flashinfer doesn't depend on that.

Regarding the second issue, check my reply here vllm-project/vllm#7070 (comment)

@yzh119

use vLLm0.5.5 and FlashInfer0.1.6 on T4
I met the same error : RuntimeError: FlashAttention only supports Ampere GPUs or newer.

issue

@yzh119
Copy link
Collaborator

yzh119 commented Sep 5, 2024

@qism , that error is reported by flash-attn package which flashinfer do not rely on. If you see that error, it suggests your are not using flashinfer backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants