Skip to content

Conversation

@ilmarkov
Copy link
Contributor

@ilmarkov ilmarkov commented Sep 17, 2025

Purpose

Enable torch symm mem by default and fix DataParallel crash. Enable torch symm mem only for TP communicator.

Conflict with DataParallel

In DataParallel we have multiple workers that are assigned to the same set of devices in their environment (with different CUDA_VISIBLE_DEVICES).

In case of DP communicator, we have wrong local_rank:rank mapping, e.g.
unique_name: dp:0, device: cuda:0, rank: 1, world_size: 2.

In case of EP communicator in DP setup we have two workers with the same device a different ranks within group:

unique_name: ep:0, device: cuda:0, rank: 0, world_size: 4                                                                                                                                                 
unique_name: ep:0, device: cuda:0, rank: 2, world_size: 4

So in following line, we basically use the same self.device in the same group. That makes torch here think that we overlap devices, although the allocated tensors are in different environment. Torch symm mem fails to do rendevouz in this setup so we basically can not use symm mem
with DataParallel. This PR only avoids the crash.

         self.buffer = torch_symm_mem.empty(
            self.max_size // self.dtype.itemsize,
            device=self.device,
            dtype=self.dtype,
        )

Fixes #24694.

Test Plan

Added test to test_symm_mem_allreduce.py

Test Result

Passed test

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables symmetric memory all-reduce by default and fixes a crash when used with DataParallel. The core fix correctly sets TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES when a DataParallel setup is detected, which is a sound approach. The new test case validates this fix. The changes are generally good, but I have identified a critical bug in the test cleanup logic that could lead to resource leaks, and a high-severity issue regarding environment variable modification that should be addressed to prevent potential side effects.

Comment on lines 122 to 124
if val is not None:
pytest.skip(val)
cleanup_dist_env_and_memory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in the cleanup logic. pytest.skip() raises a special exception to stop the test execution. Because it's called before cleanup_dist_env_and_memory() within the finally block, the cleanup function will not be executed if a worker process sends a skip message. This will lead to resource leaks (e.g., dangling distributed processes), which can cause subsequent tests to fail or hang. The cleanup must be guaranteed to run before the test is skipped.

Suggested change
if val is not None:
pytest.skip(val)
cleanup_dist_env_and_memory()
cleanup_dist_env_and_memory()
if val is not None:
pytest.skip(val)

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Sep 18, 2025
@mgoin mgoin changed the title Fix DataParallel symm mem crash Fix DataParallel crash with VLLM_ALLREDUCE_USE_SYMM_MEM Sep 18, 2025
@mergify mergify bot added the ci/build label Sep 18, 2025
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
@ilmarkov ilmarkov force-pushed the imarkov/fix_torch_symm_mem branch from 29a56e9 to 0b3cb29 Compare September 19, 2025 09:34
@mgoin mgoin changed the title Fix DataParallel crash with VLLM_ALLREDUCE_USE_SYMM_MEM Enable symmetric memory all reduce by default only enabling for TP Sep 23, 2025
@mgoin mgoin merged commit 8bdd8b5 into vllm-project:main Sep 23, 2025
78 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…llm-project#25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
…llm-project#25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: gaojc <1055866782@qq.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…llm-project#25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…llm-project#25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
@yewentao256 yewentao256 deleted the imarkov/fix_torch_symm_mem branch October 15, 2025 16:33
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…llm-project#25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#25070)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: RuntimeError: CUDASymmetricMemoryAllocator::rendezvous: detected allocations from overlapping devices from different ranks.

3 participants