Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Sep 11, 2025

Purpose

Fixes #24694

#24111 should be tested with DP case then open to default again

Test

(APIServer pid=454020) INFO:     Started server process [454020]
(APIServer pid=454020) INFO:     Waiting for application startup.
(APIServer pid=454020) INFO:     Application startup complete.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mergify
Copy link

mergify bot commented Sep 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yewentao256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly disables the VLLM_ALLREDUCE_USE_SYMM_MEM feature by default by changing its environment variable's default value from True to False. This is a sensible approach to temporarily mitigate a bug as described in the PR. The changes are consistently applied, and the existing tests for this feature are correctly configured to explicitly enable it, ensuring continued test coverage. The implementation looks good.

…-default

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's get this bugfix in for now to fix release

@mgoin mgoin changed the title Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False [Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False Sep 11, 2025
@mgoin mgoin added the bug Something isn't working label Sep 11, 2025
@mergify mergify bot removed the needs-rebase label Sep 11, 2025
@simon-mo simon-mo merged commit 1ec2035 into vllm-project:main Sep 11, 2025
5 of 9 checks passed
@yewentao256 yewentao256 deleted the wye-set-VLLM_ALLREDUCE_USE_SYMM_MEM-to-false-default branch September 11, 2025 21:33
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
@nvpohanh
Copy link
Contributor

@ilmarkov Could you fix the issue and re-enable VLLM_ALLREDUCE_USE_SYMM_MEM by default so that we can benefit from the faster AllReduce without any env vars? Thanks!

@ilmarkov
Copy link
Contributor

@nvpohanh Yes, I am working on this. The easiest solution would be disable symm mem when DP is used (i.e. all devices do only TP or PP) but I am trying to find a way to enable it for all. The problem is torch incorrectly detects overlapping devices here in case when multiple DP processes are running on the same node.

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…ject#24696)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ject#24696)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: RuntimeError: CUDASymmetricMemoryAllocator::rendezvous: detected allocations from overlapping devices from different ranks.

5 participants