[Perf] Use small max_num_batched_tokens for A100 #17885

KuntaiDu · 2025-05-09T06:26:00Z

This PR solves a performance regression issue on A100. Related PR: #17073

	Token throughput (Before this PR)	Token throughput (After this PR)
Llama 70B, TP=2	1054.01	1438.49 (+36%)
Llama 70B, TP=4	3053.03	2994.03 (-2%)

vLLM launching command:
TP=2:

vllm serve meta-llama/Llama-3.1-70B-Instruct     --port 8000     -tp 2     --max-model-len 10000     --gpu-memory-utilization 0.95     --swap-space 0

TP=4:

vllm serve meta-llama/Llama-3.1-70B-Instruct     --port 8000     -tp 4     --max-model-len 10000     --gpu-memory-utilization 0.95     --swap-space 0

Benchmarking command:

python benchmark_serving.py --port 8000     --dataset-name sharegpt     --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json     --num-prompts 400     --request-rate inf     --model meta-llama/Llama-3.1-70B-Instruct

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

github-actions · 2025-05-09T06:26:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

hmellor · 2025-05-09T06:58:24Z

It might make sense to move all these heuristics into the respective platform file?

As it is right now we have side effects where non-GPU systems (except TPU) are having their defaults set based on memory heuristics meant for GPUs. i.e. if you have a CPU platform with >70GB RAM you're get the large GPU defaults for max_num_batched_tokens...

edit: I've just noticed this is V1 only (no CPU support yet), but this will probably cause weird behaviour once we do support CPU in V1

DarkLight1337 · 2025-05-09T08:42:15Z

Can you merge from main to fix pre-commit?

heheda12345 · 2025-05-09T13:09:08Z

Would cudaDeviceProp.multiProcessorCount be a better metric? We can get this metric from numba.

KuntaiDu · 2025-05-09T18:28:16Z

Let me fix the precommit

KuntaiDu · 2025-05-10T19:41:41Z

The CI seems to stall, rebuild the CI to try to make it green.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

MingzhenHan · 2025-07-06T12:38:00Z

@KuntaiDu Hello Kuntai, Why do we need to use smaller max_num_batched_tokens on A100? What is the reason for this? Looking forward to your guidance🙏

use small max_num_batched_tokens for A100

e5530f0

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

link PR number

2ed6ffc

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

hmellor approved these changes May 9, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into kuntai-fix-a100-perf

1c6c748

KuntaiDu enabled auto-merge (squash) May 9, 2025 18:52

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 9, 2025

KuntaiDu merged commit 9112155 into vllm-project:main May 11, 2025
72 checks passed

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Perf] Use small max_num_batched_tokens for A100 (vllm-project#17885)

e34309d

Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[Perf] Use small max_num_batched_tokens for A100 (vllm-project#17885)

20189d9

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

KuntaiDu deleted the kuntai-fix-a100-perf branch May 14, 2025 17:47

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Perf] Use small max_num_batched_tokens for A100 (vllm-project#17885)

6ffc2bc

Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Perf] Use small max_num_batched_tokens for A100 #17885

[Perf] Use small max_num_batched_tokens for A100 #17885

Uh oh!

KuntaiDu commented May 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

hmellor commented May 9, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented May 9, 2025

Uh oh!

heheda12345 commented May 9, 2025

Uh oh!

KuntaiDu commented May 9, 2025

Uh oh!

KuntaiDu commented May 10, 2025

Uh oh!

Uh oh!

MingzhenHan commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Uh oh!

[Perf] Use small max_num_batched_tokens for A100 #17885

[Perf] Use small max_num_batched_tokens for A100 #17885

Uh oh!

Conversation

KuntaiDu commented May 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

hmellor commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 9, 2025

Uh oh!

heheda12345 commented May 9, 2025

Uh oh!

KuntaiDu commented May 9, 2025

Uh oh!

KuntaiDu commented May 10, 2025

Uh oh!

Uh oh!

MingzhenHan commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KuntaiDu commented May 9, 2025 •

edited by github-actions bot

Loading

hmellor commented May 9, 2025 •

edited

Loading