Enable symmetric memory all reduce by default only enabling for TP #25070

ilmarkov · 2025-09-17T12:26:46Z

Purpose

Enable torch symm mem by default and fix DataParallel crash. Enable torch symm mem only for TP communicator.

Conflict with DataParallel

In DataParallel we have multiple workers that are assigned to the same set of devices in their environment (with different CUDA_VISIBLE_DEVICES).

In case of DP communicator, we have wrong local_rank:rank mapping, e.g.
unique_name: dp:0, device: cuda:0, rank: 1, world_size: 2.

In case of EP communicator in DP setup we have two workers with the same device a different ranks within group:

unique_name: ep:0, device: cuda:0, rank: 0, world_size: 4                                                                                                                                                 
unique_name: ep:0, device: cuda:0, rank: 2, world_size: 4

So in following line, we basically use the same self.device in the same group. That makes torch here think that we overlap devices, although the allocated tensors are in different environment. Torch symm mem fails to do rendevouz in this setup so we basically can not use symm mem
with DataParallel. This PR only avoids the crash.

         self.buffer = torch_symm_mem.empty(
            self.max_size // self.dtype.itemsize,
            device=self.device,
            dtype=self.dtype,
        )

Fixes #24694.

Test Plan

Added test to test_symm_mem_allreduce.py

Test Result

Passed test

gemini-code-assist

Code Review

This pull request enables symmetric memory all-reduce by default and fixes a crash when used with DataParallel. The core fix correctly sets TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES when a DataParallel setup is detected, which is a sound approach. The new test case validates this fix. The changes are generally good, but I have identified a critical bug in the test cleanup logic that could lead to resource leaks, and a high-severity issue regarding environment variable modification that should be addressed to prevent potential side effects.

gemini-code-assist · 2025-09-17T12:29:02Z

tests/distributed/test_symm_mem_allreduce.py

+        if val is not None:
+            pytest.skip(val)
+        cleanup_dist_env_and_memory()


There is a critical bug in the cleanup logic. pytest.skip() raises a special exception to stop the test execution. Because it's called before cleanup_dist_env_and_memory() within the finally block, the cleanup function will not be executed if a worker process sends a skip message. This will lead to resource leaks (e.g., dangling distributed processes), which can cause subsequent tests to fail or hang. The cleanup must be guaranteed to run before the test is skipped.

Suggested change

if val is not None:

pytest.skip(val)

cleanup_dist_env_and_memory()

cleanup_dist_env_and_memory()

if val is not None:

pytest.skip(val)

vllm/distributed/device_communicators/symm_mem.py

mgoin

LGTM!

tests/distributed/test_symm_mem_allreduce.py

Signed-off-by: ilmarkov <markovilya197@gmail.com>

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: gaojc <1055866782@qq.com>

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

tlrmchlsmth reviewed Sep 17, 2025

View reviewed changes

vllm/distributed/device_communicators/symm_mem.py Outdated Show resolved Hide resolved

mgoin approved these changes Sep 18, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Sep 18, 2025

mgoin changed the title ~~Fix DataParallel symm mem crash~~ Fix DataParallel crash with VLLM_ALLREDUCE_USE_SYMM_MEM Sep 18, 2025

mgoin reviewed Sep 18, 2025

View reviewed changes

tests/distributed/test_symm_mem_allreduce.py Show resolved Hide resolved

mergify bot added the ci/build label Sep 18, 2025

ilmarkov added 4 commits September 19, 2025 02:33

Fix DataParallel symm mem crash

7f49d35

Signed-off-by: ilmarkov <markovilya197@gmail.com>

Fix

2675bca

Signed-off-by: ilmarkov <markovilya197@gmail.com>

Enable symm mem in DP setup

510085d

Signed-off-by: ilmarkov <markovilya197@gmail.com>

Add test to CI

0b3cb29

Signed-off-by: ilmarkov <markovilya197@gmail.com>

ilmarkov force-pushed the imarkov/fix_torch_symm_mem branch from 29a56e9 to 0b3cb29 Compare September 19, 2025 09:34

Merge branch 'main' into imarkov/fix_torch_symm_mem

06e327c

mgoin changed the title ~~Fix DataParallel crash with VLLM_ALLREDUCE_USE_SYMM_MEM~~ Enable symmetric memory all reduce by default only enabling for TP Sep 23, 2025

mgoin merged commit 8bdd8b5 into vllm-project:main Sep 23, 2025
78 checks passed

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

Enable symmetric memory all reduce by default only enabling for TP (v…

1a97218

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

yewentao256 deleted the imarkov/fix_torch_symm_mem branch October 15, 2025 16:33

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

Enable symmetric memory all reduce by default only enabling for TP (v…

2171cca

…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Enable symmetric memory all reduce by default only enabling for TP #25070

Enable symmetric memory all reduce by default only enabling for TP #25070

ilmarkov commented Sep 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 17, 2025

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Enable symmetric memory all reduce by default only enabling for TP #25070

Enable symmetric memory all reduce by default only enabling for TP #25070

Conversation

ilmarkov commented Sep 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilmarkov commented Sep 17, 2025 •

edited by github-actions bot

Loading