-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flashinfer paged attention #2772
base: main
Are you sure you want to change the base?
Conversation
I got error on T4 GPU with half dtype model. |
are you using the kvcache2 branch? also, try setting to float16 instead of half |
Hi @sumo43 Yes i do.
And when i using the tp2 for testing, the engine got stuck. It seems tensor parallelism not supported. |
@sumo43 Please feel free to ping me when the PR is ready for review! |
Sounds good. so far I made the kv cache compatible with flashinfer and checked that the outputs were coherent. I'm currently debugging a few issues, like the sampler potentially taking longer to run (?) but i'll make it ready for review soon. Thanks |
so, i tested the core functionality and it works. however, my code doesn't support cudagraphs so the tests fail (with eager mode they pass). also, flashinfer is only available with python3.10 and 3.11 wheels, so the docker tests using python3.8 don't pass. |
Regarding CUDA graphs, this PR should help (though it may not be the only thing needed) - flashinfer-ai/flashinfer#111 |
Hi @sumo43, thanks for submitting the PR! To accelerate the merge, we'd like to directly push some modifications to the PR. For example, we'd like to use FlashInfer's C++ APIs rather than the Python APIs. Would you allow us to directly commit the changes to this PR? Of course, you'll remain as a co-author of the PR. |
Hi @WoosukKwon. Absolutely, feel free to make any changes you need. |
hi @sumo43 Try to run branch/kvcache2, found an error as follows: script command: error log: |
Got the same error |
I using 300 requests to test LLAMA13B with flashinfer and with original pageattention ,original pageattention throughtput is faster than flashinfer 10%. I wonder flashinfer is only works in GQA construct ??? |
@pythonononer yeah i noticed it too. I'm looking into whether C++ API is faster or not. Also, @shanshanpt i'd recommend using the |
I think c++ api is equal to python api. in custom , using pybind to turn c++ interface to python just improve a little. Maybe we need the arthur to make the optimization. There are some restrictions I conclude: 1. python version>=3.9, torch >=2.1, cuda>11.8. 2. must open eager mode and tp==1 ,so LLAMA 70B not works. |
Is this PR still active? I also get the same error .
|
Please close as stale AFAIK flashinfer is now merged, right? |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
Description
This PR implements FlashInfer's implementation of GQA PagedAttention which is up to 2-3x faster than vLLM's version. I implement flashinfer for prefill and decoding, while still using cache_ops.
https://github.com/flashinfer-ai/flashinfer/
https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Performance Results
I used the following setup:
Throughput with flashinfer: 2.63 requests/s, 5258.27 tokens/s
Throughput without flashinfer: 1.82 requests/s, 3642.17 tokens/s
TODOS