-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Attention] Flash MLA for V1 #13867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] Flash MLA for V1 #13867
Conversation
This PR is co-authored with Lucas Wilkinson. ``` VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100 ... vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.66|± |0.0476| | | |strict-match | 5|exact_match|↑ | 0.66|± |0.0476| ``` Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> format Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> format Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> format Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
This PR is co-authored with Lucas Wilkinson. ``` VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100 ... vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.66|± |0.0476| | | |strict-match | 5|exact_match|↑ | 0.66|± |0.0476| ``` Signed-off-by: Yang Chen <yangche@fb.com>
|
QQ: Just curiosity, what's the main reason that V1 is slower than V0 (say all use Flash MLA, and we look at ITL)? Is it because of chunked prefill? |
Its mostly likely because in V1 CUDA graphs are not used for attention, we need to keep optimizing away alot of the small operations in MLA (for low QPS, for the throughput case it may be chunked prefill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks clean to me, nice work! Have you run an accuracy smoke test?
|
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com>
@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test. |
|
Hi, @LucasWilkinson, I got errors while running deepseek r1 awq with FlashMLA and V1 engine in 8 * H100: command: |
@zuozi2810 can you please provide the full log and hugging face link to a failing model, that would be really helpful in debugging |
8xH200, what setup are you seeing OOM on? |
8xH200 have enough GPU memory. My setup is 8xH20. Thank you for your reply. |
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com>


use via:
VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1Results:
https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing