Skip to content

Conversation

@simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented Sep 25, 2025

Found by @rizar that our container's DeepEP doesn't work on Blackwell, TY!

Summary

  • define TORCH_CUDA_ARCH_LIST in the vllm-base stage so EP kernel installation defaults to Hopper and Blackwell support
  • revert the EP kernel installer to rely on the environment instead of forcing its own CUDA architectures

Testing

  • not run

https://chatgpt.com/codex/tasks/task_e_68d4c5364338832997fa34fc45f06432

@mergify mergify bot added the ci/build label Sep 25, 2025
@simon-mo simon-mo changed the title Define EP kernel arch list in Dockerfile [Misc] Define EP kernel arch list in Dockerfile Sep 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request defines the TORCH_CUDA_ARCH_LIST in the vllm-base stage of the Dockerfile, ensuring that EP kernel installation defaults to supporting Hopper and Blackwell architectures. It also updates a fallback value for the architecture list. The changes are logical and improve the build process's configurability and defaults. However, there's a redundancy that can be cleaned up.

Comment on lines +465 to 466
RUN export TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST:-9.0a 10.0a+PTX}" \
&& bash install_python_libraries.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since TORCH_CUDA_ARCH_LIST is now set via an ENV instruction earlier in this build stage (line 288), this export command with a fallback is redundant. The environment variable will already be available to the install_python_libraries.sh script. You can simplify this RUN command by removing the export part.

RUN bash install_python_libraries.sh

@simon-mo simon-mo enabled auto-merge (squash) October 6, 2025 21:40
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 6, 2025
@simon-mo simon-mo merged commit 8229280 into main Oct 7, 2025
88 checks passed
@simon-mo simon-mo deleted the codex/fix-torch_cuda_arch_list-in-dockerfile branch October 7, 2025 00:05
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
mrasquinha-g pushed a commit to mrasquinha-g/vllm that referenced this pull request Oct 9, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build codex ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants