Skip to content

Conversation

@KuntaiDu
Copy link
Collaborator

@KuntaiDu KuntaiDu commented May 9, 2025

This PR solves a performance regression issue on A100. Related PR: #17073

Token throughput (Before this PR) Token throughput (After this PR)
Llama 70B, TP=2 1054.01 1438.49 (+36%)
Llama 70B, TP=4 3053.03 2994.03 (-2%)

vLLM launching command:
TP=2:

vllm serve meta-llama/Llama-3.1-70B-Instruct     --port 8000     -tp 2     --max-model-len 10000     --gpu-memory-utilization 0.95     --swap-space 0

TP=4:

vllm serve meta-llama/Llama-3.1-70B-Instruct     --port 8000     -tp 4     --max-model-len 10000     --gpu-memory-utilization 0.95     --swap-space 0

Benchmarking command:

python benchmark_serving.py --port 8000     --dataset-name sharegpt     --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json     --num-prompts 400     --request-rate inf     --model meta-llama/Llama-3.1-70B-Instruct

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@github-actions
Copy link

github-actions bot commented May 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@hmellor
Copy link
Member

hmellor commented May 9, 2025

It might make sense to move all these heuristics into the respective platform file?

As it is right now we have side effects where non-GPU systems (except TPU) are having their defaults set based on memory heuristics meant for GPUs. i.e. if you have a CPU platform with >70GB RAM you're get the large GPU defaults for max_num_batched_tokens...

edit: I've just noticed this is V1 only (no CPU support yet), but this will probably cause weird behaviour once we do support CPU in V1

@DarkLight1337
Copy link
Member

Can you merge from main to fix pre-commit?

@heheda12345
Copy link
Collaborator

Would cudaDeviceProp.multiProcessorCount be a better metric? We can get this metric from numba.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented May 9, 2025

Let me fix the precommit

@KuntaiDu KuntaiDu enabled auto-merge (squash) May 9, 2025 18:52
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 9, 2025
@KuntaiDu
Copy link
Collaborator Author

The CI seems to stall, rebuild the CI to try to make it green.

@KuntaiDu KuntaiDu merged commit 9112155 into vllm-project:main May 11, 2025
72 checks passed
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025
@KuntaiDu KuntaiDu deleted the kuntai-fix-a100-perf branch May 14, 2025 17:47
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
@MingzhenHan
Copy link
Contributor

@KuntaiDu Hello Kuntai, Why do we need to use smaller max_num_batched_tokens on A100? What is the reason for this? Looking forward to your guidance🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants