Skip to content

Commit 20189d9

Browse files
KuntaiDumawong-amd
authored andcommitted
[Perf] Use small max_num_batched_tokens for A100 (vllm-project#17885)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
1 parent 17072c8 commit 20189d9

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

vllm/engine/arg_utils.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1438,11 +1438,15 @@ def _set_default_args_v1(self, usage_context: UsageContext) -> None:
14381438
from vllm.platforms import current_platform
14391439
try:
14401440
device_memory = current_platform.get_device_total_memory()
1441+
device_name = current_platform.get_device_name().lower()
14411442
except Exception:
14421443
# This is only used to set default_max_num_batched_tokens
14431444
device_memory = 0
14441445

1445-
if device_memory >= 70 * GiB_bytes:
1446+
# NOTE(Kuntai): Setting large `max_num_batched_tokens` for A100 reduces
1447+
# throughput, see PR #17885 for more details.
1448+
# So here we do an extra device name check to prevent such regression.
1449+
if device_memory >= 70 * GiB_bytes and "a100" not in device_name:
14461450
# For GPUs like H100 and MI300x, use larger default values.
14471451
default_max_num_batched_tokens = {
14481452
UsageContext.LLM_CLASS: 16384,

0 commit comments

Comments
 (0)