-
-
Couldn't load subscription status.
- Fork 10.9k
[Kernel] Add FP8 support with FlashMLA backend #22668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Add FP8 support with FlashMLA backend #22668
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds FP8 support for the FlashMLA attention backend. The changes include updating the FlashMLA sources in CMake, modifying the FlashMLA Python ops to handle FP8 scaling factors, and adding FP8 data types to the FlashMLA tests. The logic for enabling FP8 in the attention backend seems correct and is consistently applied across the codebase. I have one suggestion to improve code clarity in the test suite.
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
2c339b7 to
b548b10
Compare
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
79bfbd5 to
8dfbf29
Compare
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
@alexm-redhat Thanks for your review!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
…supported (#96) Fixes vllm-project/vllm#22668 - we need to take one more arg. Signed-off-by: Marcin Swiniarski <mswiniarski@habana.ai>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: root <xwq391974@alibaba-inc.com>
…supported (#96) Fixes vllm-project/vllm#22668 - we need to take one more arg. Signed-off-by: Marcin Swiniarski <mswiniarski@habana.ai> Signed-off-by: Marcin Swiniarski <marcin.swiniarski@intel.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Enable FP8 KV cache with MLA
Test Plan
Correctness
Accuracy
With
kv_cache_type = "auto":VLLM_ATTENTION_BACKEND=FLASHMLA lm_eval --model vllm --model_args '{"pretrained": "deepseek-ai/DeepSeek-V2-Lite-Chat", "trust_remote_code": true, "kv_cache_dtype": "auto"}' --tasks gsm8k --batch_size autoWith
kv_cache_type = "fp8":VLLM_ATTENTION_BACKEND=FLASHMLA lm_eval --model vllm --model_args '{"pretrained": "deepseek-ai/DeepSeek-V2-Lite-Chat", "trust_remote_code": true, "kv_cache_dtype": "fp8"}' --tasks gsm8k --batch_size autoPerformance
With
kv_cache_type = "auto":VLLM_ATTENTION_BACKEND=FLASHMLA chg run --gpus 1 -- vllm bench throughput --model=deepseek-ai/DeepSeek-V2-Lite-Chat --dataset-name=random --input-len=512 --output-len=512 --num-prompts=10000 --kv-cache-dtype=autoWith
kv_cache_type = "fp8":VLLM_ATTENTION_BACKEND=FLASHMLA chg run --gpus 1 -- vllm bench throughput --model=deepseek-ai/DeepSeek-V2-Lite-Chat --dataset-name=random --input-len=512 --output-len=512 --num-prompts=10000 --kv-cache-dtype=fp8Test Result
Correctness
Tests pass
Accuracy
With
kv_cache_type = "auto":With
kv_cache_type = "fp8":Performance
On 1x H100:
Here are the results for 512/512:
--kv-cache-dtype=auto: Throughput: 26.37 requests/s, 26975.81 total tokens/s, 13499.72 output tokens/s
--kv-cache-dtype=fp8: Throughput: 27.99 requests/s, 28635.39 total tokens/s, 14330.23 output tokens/s
Here are the results for 8192/1024:
--kv-cache-dtype=auto: Throughput: 2.40 requests/s, 22143.47 total tokens/s, 2460.56 output tokens/s
--kv-cache-dtype=fp8: Throughput: 3.25 requests/s, 29971.81 total tokens/s, 3330.44 output tokens/s
(Optional) Documentation Update