Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Feb 26, 2025

chenyang78 and others added 30 commits February 24, 2025 15:19
This PR is co-authored with Lucas Wilkinson.

```
VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100
...
vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|
```

Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
This PR is co-authored with Lucas Wilkinson.

```
VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100
...
vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|
```

Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
@LiuXiaoxuanPKU
Copy link
Collaborator

QQ: Just curiosity, what's the main reason that V1 is slower than V0 (say all use Flash MLA, and we look at ITL)? Is it because of chunked prefill?

@LucasWilkinson
Copy link
Collaborator Author

LucasWilkinson commented Feb 27, 2025

QQ: Just curiosity, what's the main reason that V1 is slower than V0 (say all use Flash MLA, and we look at ITL)? Is it because of chunked prefill?

Its mostly likely because in V1 CUDA graphs are not used for attention, we need to keep optimizing away alot of the small operations in MLA (for low QPS, for the throughput case it may be chunked prefill)

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks clean to me, nice work! Have you run an accuracy smoke test?

@LucasWilkinson
Copy link
Collaborator Author

This looks clean to me, nice work! Have you run an accuracy smoke test?

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHMLA lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|

@mgoin mgoin enabled auto-merge (squash) February 27, 2025 22:48
@mgoin mgoin moved this from Backlog to Done in DeepSeek V3/R1 Feb 27, 2025
@mgoin mgoin merged commit 2e94b9c into vllm-project:main Feb 27, 2025
47 of 48 checks passed
Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
@samuellees
Copy link

samuellees commented Mar 10, 2025

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1

Results:

https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test.

@zuozi2810
Copy link

zuozi2810 commented Mar 10, 2025

Hi, @LucasWilkinson, I got errors while running deepseek r1 awq with FlashMLA and V1 engine in 8 * H100:
WX20250310-185401@2x

command:
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ATTENTION_BACKEND=FLASHMLA python benchmark_throughput.py --model /workspace/model --trust-remote-code --input-len 2000 --output-len 1000 --num-prompts 100 -tp 8 --gpu_memory_utilization 0.97 --max-model-len 5120

@LucasWilkinson
Copy link
Collaborator Author

Hi, @LucasWilkinson, I got errors while running deepseek r1 awq with FlashMLA and V1 engine in 8 * H100: WX20250310-185401@2x

command: VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ATTENTION_BACKEND=FLASHMLA python benchmark_throughput.py --model /workspace/model --trust-remote-code --input-len 2000 --output-len 1000 --num-prompts 100 -tp 8 --gpu_memory_utilization 0.97 --max-model-len 5120

@zuozi2810 can you please provide the full log and hugging face link to a failing model, that would be really helpful in debugging

@LucasWilkinson
Copy link
Collaborator Author

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1
Results:
https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test.

8xH200, what setup are you seeing OOM on?

@samuellees
Copy link

use via: VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1
Results:
https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing

@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test.

8xH200, what setup are you seeing OOM on?

8xH200 have enough GPU memory. My setup is 8xH20. Thank you for your reply.

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants