Skip to content

Conversation

@tomasruizt
Copy link
Contributor

@tomasruizt tomasruizt commented Sep 10, 2025

Purpose

The goal is easily profile a throughput workload. Concretely the command vllm bench throughput --profile ... will generate a profiling file, which is useful to debug performance gaps using the https://ui.perfetto.dev/ UI.

Test Plan

The vllm bench command can take 4 different paths. Here is how I tested them.
First I set env vars to enable profiling and also enable CUDA blocking to make long GPU ops show long runtime (optional).

export VLLM_USE_V1=1
export VLLM_TORCH_PROFILER_DIR=./profiles/
export CUDA_LAUNCH_BLOCKING=1
  1. The run_vllm() path
vllm bench throughput     --model=Qwen/Qwen3-1.7B     --dataset-name=hf     --dataset-path=likaixin/InstructCoder     --max-num-seqs=100     --num-prompts=10     --input-len=1000     --output-len=10     --max-model-len=2048     --gpu-memory-utilization=0.6     --profile --enforce-eager
  1. The run_vllm_async() path
vllm bench throughput     --model=Qwen/Qwen3-1.7B     --dataset-name=hf     --dataset-path=likaixin/InstructCoder     --max-num-seqs=100     --num-prompts=10     --input-len=1000     --output-len=10     --max-model-len=2048     --gpu-memory-utilization=0.6     --profile --enforce-eager --async-engine
  1. the backend=vllm-chat path for multimodal models
vllm bench throughput     --model=Qwen/Qwen2.5-VL-3B-Instruct     --dataset-name=hf --dataset-path=lmarena-ai/VisionArena-Chat     --max-num-seqs=100     --num-prompts=10     --input-len=1000     --output-len=10     --max-model-len=2048     --gpu-memory-utilization=0.6     --profile --enforce-eager  --backend=vllm-chat
  1. The backend=hf path
vllm bench throughput     --model=Qwen/Qwen3-1.7B     --dataset-name=sharegpt --max-num-seqs=100     --num-prompts=10 --input-len=1000 --output-len=10 --profile --enforce-eager --backend=hf  --hf-max-batch-size=10

Test Result

All paths except backend=hf generate a profile file, like this one: file.pt.trace.json.gz. As mentioned, they can be opened with https://ui.perfetto.dev/.

The reason is that the class AutoModelForCausalLM used in backend=hf does not implement .start_profiler() and stop_profiler(). Therefore, I raise a NotImplementedError, so the user knows to remove the --profile flag.

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
@mergify mergify bot added the performance Performance-related issues label Sep 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for profiling throughput benchmarks using the --profile flag. The implementation correctly wraps the model execution calls with start_profile() and stop_profile(). My main feedback is to ensure the profiler is always stopped, even if an error occurs during model execution. This can be achieved by using try...finally blocks, which will make the profiling logic more robust. I've added specific suggestions for this improvement.

@tomasruizt
Copy link
Contributor Author

tomasruizt commented Sep 10, 2025

I don't agree with Geminis review. One problem with its suggestion that I've seen in real life is that the generation fails, the profiler is graciously stopped, and it creates a profile result (which is very short obviously). Therefore, it appears as if the generation had been successful and super fast based on this short profile.

That is why, if the generation fails, the profiler should not dump a profiling result IMO. It should not be graciously stopped.

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 10, 2025 13:39
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
@vllm-bot vllm-bot merged commit ee0bc5e into vllm-project:main Sep 11, 2025
36 of 38 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
sducouedic pushed a commit to sducouedic/vllm that referenced this pull request Oct 16, 2025
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants