[Attention] Flash MLA for V1 #13867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

mgoin merged 42 commits into vllm-project:main from neuralmagic:lwilkinson/flash-mla-v1

Feb 27, 2025

Collaborator

LucasWilkinson commented Feb 26, 2025 •

edited by github-actions bot

Loading

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1

Results:

https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

chenyang78 and others added 30 commits

February 24, 2025 15:19


          [Attention] MLA support for V1

998803e

This PR is co-authored with Lucas Wilkinson.

```
VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100
...
vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|
```

Signed-off-by: Yang Chen <yangche@fb.com>


          torch library bindings, unit tests running

12a5221

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          working in eager mode

955cead

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          format

1d5c868

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          cuda-graphs still broken but closer i think

eae4787

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          better comments

c79927d

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          remove extra files

37c4f9e

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          add attribution

084b031

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          fix cuda graphs

07a9bad

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          cleaner build fallbacks

4dc8c35

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          ok cuda-graphs actually fixed now I think

a6c36cc

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          format

3ae4a6e

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          fix deepseek-v2

a6213a4

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          Merge branch 'lwilkinson/fix-deepseek-v2' into lwilkinson/flashmla-in…

68895a2

…tegration


          clean up

5e7cd97

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          Merge remote-tracking branch 'origin/main' into lwilkinson/flashmla-i…

8bb3bdc

…ntegration


          review comment

d439969

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          fix mypy

aa42226

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          review comments

d18261c

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          cleanup

4c08a0a

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          fix bad logic

07332bf

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          review comments

c4434d9

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          update to latest flashMLA which supports fp16

f570fe0

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>


          update to use fork

0bbcf27

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>


          remove unnessary include

177ee29

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          add fp16 source

642456f

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>


          missing symbol

2fa62a9

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>


          [Attention] MLA support for V1

4b7ef4d

This PR is co-authored with Lucas Wilkinson.

```
VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100
...
vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|
```

Signed-off-by: Yang Chen <yangche@fb.com>


          Merge remote-tracking branch 'yang/mla-v1' into lwilkinson/flash-mla-v1

0ae026a


          cleanup

e6e5789

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson marked this pull request as ready for review

February 27, 2025 20:43

LucasWilkinson requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners

February 27, 2025 20:43

Collaborator

LiuXiaoxuanPKU commented Feb 27, 2025

QQ: Just curiosity, what's the main reason that V1 is slower than V0 (say all use Flash MLA, and we look at ITL)? Is it because of chunked prefill?

Collaborator Author

LucasWilkinson commented Feb 27, 2025 •

edited

Loading

QQ: Just curiosity, what's the main reason that V1 is slower than V0 (say all use Flash MLA, and we look at ITL)? Is it because of chunked prefill?

Its mostly likely because in V1 CUDA graphs are not used for attention, we need to keep optimizing away alot of the small operations in MLA (for low QPS, for the throughput case it may be chunked prefill)

mgoin added the ready label

mgoin approved these changes

View reviewed changes

Member

mgoin left a comment

This looks clean to me, nice work! Have you run an accuracy smoke test?

Collaborator Author

LucasWilkinson commented Feb 27, 2025

This looks clean to me, nice work! Have you run an accuracy smoke test?

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHMLA lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|

mgoin enabled auto-merge (squash)

February 27, 2025 22:48

mgoin added this to DeepSeek V3/R1

github-project-automation bot moved this to Backlog in DeepSeek V3/R1

mgoin moved this from Backlog to Done in DeepSeek V3/R1

mgoin merged commit 2e94b9c into vllm-project:main

47 of 48 checks passed

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request


          [Attention] Flash MLA for V1 (vllm-project#13867)

Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>

samuellees commented Mar 10, 2025 •

edited

Loading

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1

Results:

https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test.

zuozi2810 commented Mar 10, 2025 •

edited

Loading

Hi, @LucasWilkinson, I got errors while running deepseek r1 awq with FlashMLA and V1 engine in 8 * H100:

command:
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ATTENTION_BACKEND=FLASHMLA python benchmark_throughput.py --model /workspace/model --trust-remote-code --input-len 2000 --output-len 1000 --num-prompts 100 -tp 8 --gpu_memory_utilization 0.97 --max-model-len 5120

Collaborator Author

LucasWilkinson commented Mar 10, 2025

Hi, @LucasWilkinson, I got errors while running deepseek r1 awq with FlashMLA and V1 engine in 8 * H100:

command: VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ATTENTION_BACKEND=FLASHMLA python benchmark_throughput.py --model /workspace/model --trust-remote-code --input-len 2000 --output-len 1000 --num-prompts 100 -tp 8 --gpu_memory_utilization 0.97 --max-model-len 5120

@zuozi2810 can you please provide the full log and hugging face link to a failing model, that would be really helpful in debugging

Collaborator Author

LucasWilkinson commented Mar 10, 2025

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1
Results:
https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test.

8xH200, what setup are you seeing OOM on?

samuellees commented Mar 11, 2025

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1
Results:
https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test.

8xH200, what setup are you seeing OOM on?

8xH200 have enough GPU memory. My setup is 8xH20. Thank you for your reply.

hmellor mentioned this pull request

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request


          [Attention] Flash MLA for V1 (vllm-project#13867)

22fda0a

Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request


          [Attention] Flash MLA for V1 (vllm-project#13867)

f150435

Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

mgoin mgoin approved these changes

WoosukKwon Awaiting requested review from WoosukKwon

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat

njhill Awaiting requested review from njhill

ywang96 Awaiting requested review from ywang96

comaniac Awaiting requested review from comaniac

alexm-redhat Awaiting requested review from alexm-redhat

Labels

ci/build ready v1