- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
Enable --profile in 'vllm bench throughput' #24575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable --profile in 'vllm bench throughput' #24575
Conversation
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for profiling throughput benchmarks using the --profile flag. The implementation correctly wraps the model execution calls with start_profile() and stop_profile(). My main feedback is to ensure the profiler is always stopped, even if an error occurs during model execution. This can be achieved by using try...finally blocks, which will make the profiling logic more robust. I've added specific suggestions for this improvement.
| 
           I don't agree with Geminis review. One problem with its suggestion that I've seen in real life is that the generation fails, the profiler is graciously stopped, and it creates a profile result (which is very short obviously). Therefore, it appears as if the generation had been successful and super fast based on this short profile. That is why, if the generation fails, the profiler should not dump a profiling result IMO. It should not be graciously stopped.  | 
    
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
The goal is easily profile a throughput workload. Concretely the command
vllm bench throughput --profile ...will generate a profiling file, which is useful to debug performance gaps using the https://ui.perfetto.dev/ UI.Test Plan
The
vllm benchcommand can take 4 different paths. Here is how I tested them.First I set env vars to enable profiling and also enable CUDA blocking to make long GPU ops show long runtime (optional).
run_vllm()pathrun_vllm_async()pathbackend=vllm-chatpath for multimodal modelsbackend=hfpathTest Result
All paths except
backend=hfgenerate a profile file, like this one: file.pt.trace.json.gz. As mentioned, they can be opened with https://ui.perfetto.dev/.The reason is that the class
AutoModelForCausalLMused inbackend=hfdoes not implement.start_profiler()andstop_profiler(). Therefore, I raise a NotImplementedError, so the user knows to remove the--profileflag.