[unit-test] for UCCCommunicator #4

panpan0000 · 2025-09-18T08:20:46Z

add UT just like vllm-project#20759 does

I don't have UCC complied Pytorch to verify either CPU or GPU side.

github-actions · 2025-09-18T08:20:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

panpan0000 · 2025-09-19T05:23:46Z

detailed log (the debug logging has been removed from this PR)

VLLM_USE_UCC=1 pytest tests/distributed/test_ucc_communicator-3.py -vvs --log-cli-level=DEBUG

INFO 09-19 05:30:27 [__init__.py:216] Automatically detected platform cuda.
========================================================================================================================== test session starts ==========================================================================================================================
platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/workspace/vllm/.hypothesis/examples'))
rootdir: /workspace/vllm
configfile: pyproject.toml
plugins: xdist-3.6.1, rerunfailures-15.1, hypothesis-6.130.8, shard-0.1.2, xdoctest-1.0.2, flakefinder-1.1.0, anyio-4.9.0, typeguard-4.3.0
collecting ... WARNING 09-19 05:30:28 [interface.py:533] Current platform cuda does not have '_pytestfixturefunction' attribute.
WARNING 09-19 05:30:28 [interface.py:533] Current platform cuda does not have '__test__' attribute.
WARNING 09-19 05:30:28 [interface.py:533] Current platform cuda does not have '__bases__' attribute.
WARNING 09-19 05:30:28 [interface.py:533] Current platform cuda does not have '__test__' attribute.
collected 4 items                                                                                                                                                                                                                                                       
Running 4 items in this shard: tests/distributed/test_ucc_communicator-3.py::test_ucc_allreduce[1-2], tests/distributed/test_ucc_communicator-3.py::test_ucc_availability[1-2], tests/distributed/test_ucc_communicator-3.py::test_ucc_communicator_initialization, tests/distributed/test_ucc_communicator-3.py::test_ucc_static_methods

tests/distributed/test_ucc_communicator-3.py::test_ucc_allreduce[1-2] INFO 09-19 05:30:35 [__init__.py:216] Automatically detected platform cuda.
2025-09-19 05:30:35,141 - ucc_test - DEBUG - Starting ucc_allreduce_worker with rank 0, world_size 2
2025-09-19 05:30:35,141 - ucc_test - DEBUG - Selected device: cuda:0, dtype: torch.bfloat16
INFO 09-19 05:30:35 [__init__.py:216] Automatically detected platform cuda.
2025-09-19 05:30:35,360 - ucc_test - DEBUG - Starting ucc_allreduce_worker with rank 1, world_size 2
2025-09-19 05:30:35,360 - ucc_test - DEBUG - Selected device: cuda:1, dtype: torch.bfloat16
WARNING 09-19 05:30:35 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:36 [__init__.py:3864] Current vLLM config is not set.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:36 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:36 [__init__.py:3864] Current vLLM config is not set.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:36 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:36 [__init__.py:3864] Current vLLM config is not set.
INFO 09-19 05:30:36 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 05:30:36 [pynccl.py:70] vLLM is using nccl==2.26.5
INFO 09-19 05:30:36 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 05:30:36 [pynccl.py:70] vLLM is using nccl==2.26.5
INFO 09-19 05:30:37 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-19 05:30:37 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-19 05:30:37 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:1, world size: 2
INFO 09-19 05:30:37 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:0, world size: 2
INFO 09-19 05:30:37 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_dd0f2668'), local_subscribe_addr='ipc:///tmp/5e555a05-e12f-43b5-a76e-577bc47cfa7b', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:37 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:37 [__init__.py:3864] Current vLLM config is not set.
INFO 09-19 05:30:37 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:1, world size: 2
INFO 09-19 05:30:37 [parallel_state.py:1165] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
2025-09-19 05:30:37,630 - ucc_test - DEBUG - Rank 1: Checking if UCC is available
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 1: UCC available: True
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 1: Getting tensor model parallel group
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 1: Got tensor model parallel group: <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f1fc1c6bf30>
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 1: Creating UCC process group
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 1: Created UCC process group: <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f1fc01119b0>
INFO 09-19 05:30:37 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:1, world size: 2
INFO 09-19 05:30:37 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:0, world size: 2
INFO 09-19 05:30:37 [parallel_state.py:1165] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 0: Checking if UCC is available
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 0: UCC available: True
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 0: Getting tensor model parallel group
2025-09-19 05:30:37,631 - ucc_test - DEBUG - Rank 0: Got tensor model parallel group: <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f610547c870>
2025-09-19 05:30:37,632 - ucc_test - DEBUG - Rank 0: Creating UCC process group
2025-09-19 05:30:37,632 - ucc_test - DEBUG - Rank 0: Created UCC process group: <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f610410d430>
INFO 09-19 05:30:37 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:0, world size: 2

[Rank 1] Input tensor shape: torch.Size([4194304]), dtype: torch.bfloat16, device: cuda:1

[Rank 0] Input tensor shape: torch.Size([4194304]), dtype: torch.bfloat16, device: cuda:0
[Rank 1] Input tensor stats - min: 1.0, max: 22.0, mean: 11.499662399291992
[Rank 0] Input tensor stats - min: 1.0, max: 22.0, mean: 11.499662399291992
[Rank 0] Output tensor stats - min: 2.0, max: 44.0, mean: 22.999324798583984
[Rank 1] Output tensor stats - min: 2.0, max: 44.0, mean: 22.999324798583984

[Rank 0] Testing op: RedOpType.SUM, tensor shape: torch.Size([1024]), dtype: torch.bfloat16
[Rank 0] Input tensor stats - min: 1.0, max: 9.0, mean: 4.994140625

[Rank 1] Testing op: RedOpType.SUM, tensor shape: torch.Size([1024]), dtype: torch.bfloat16
[Rank 1] Input tensor stats - min: 1.0, max: 9.0, mean: 4.994140625
[Rank 1] Output tensor stats after RedOpType.SUM - min: 2.0, max: 18.0, mean: 9.98828125
[Rank 0] Output tensor stats after RedOpType.SUM - min: 2.0, max: 18.0, mean: 9.98828125

[Rank 1] Testing op: RedOpType.MAX, tensor shape: torch.Size([1024]), dtype: torch.bfloat16

[Rank 0] Testing op: RedOpType.MAX, tensor shape: torch.Size([1024]), dtype: torch.bfloat16
[Rank 1] Input tensor stats - min: 1.0, max: 9.0, mean: 5.025390625
[Rank 0] Input tensor stats - min: 1.0, max: 9.0, mean: 5.025390625
[Rank 1] Output tensor stats after RedOpType.MAX - min: 1.0, max: 9.0, mean: 5.025390625
[Rank 0] Output tensor stats after RedOpType.MAX - min: 1.0, max: 9.0, mean: 5.025390625

[Rank 1] Testing op: RedOpType.MIN, tensor shape: torch.Size([1024]), dtype: torch.bfloat16

[Rank 0] Testing op: RedOpType.MIN, tensor shape: torch.Size([1024]), dtype: torch.bfloat16
[Rank 1] Input tensor stats - min: 1.0, max: 9.0, mean: 4.939453125
[Rank 0] Input tensor stats - min: 1.0, max: 9.0, mean: 4.939453125
[Rank 1] Output tensor stats after RedOpType.MIN - min: 1.0, max: 9.0, mean: 4.939453125
[Rank 0] Output tensor stats after RedOpType.MIN - min: 1.0, max: 9.0, mean: 4.939453125
[rank0]:[W919 05:30:40.708318052 ProcessGroupNCCL.cpp:1505] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
PASSED
tests/distributed/test_ucc_communicator-3.py::test_ucc_availability[1-2] INFO 09-19 05:30:48 [__init__.py:216] Automatically detected platform cuda.
INFO 09-19 05:30:49 [__init__.py:216] Automatically detected platform cuda.
WARNING 09-19 05:30:49 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:49 [__init__.py:3864] Current vLLM config is not set.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:49 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:49 [__init__.py:3864] Current vLLM config is not set.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:49 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:49 [__init__.py:3864] Current vLLM config is not set.
INFO 09-19 05:30:49 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 05:30:49 [pynccl.py:70] vLLM is using nccl==2.26.5
INFO 09-19 05:30:49 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 05:30:49 [pynccl.py:70] vLLM is using nccl==2.26.5
INFO 09-19 05:30:50 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-19 05:30:50 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-19 05:30:50 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:0, world size: 2
INFO 09-19 05:30:50 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:1, world size: 2
INFO 09-19 05:30:50 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_46082ee1'), local_subscribe_addr='ipc:///tmp/2db55b37-aca4-4d2d-84fc-1365c4102722', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:50 [__init__.py:3864] Current vLLM config is not set.
WARNING 09-19 05:30:50 [__init__.py:3864] Current vLLM config is not set.
INFO 09-19 05:30:50 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:0, world size: 2
INFO 09-19 05:30:50 [parallel_state.py:1165] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-19 05:30:50 [ucc_communicator.py:56] UCCCommunicator initialized successfully with UCC backend on device cuda:1, world size: 2
INFO 09-19 05:30:50 [parallel_state.py:1165] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:50 [ucc_communicator.py:38] UCCCommunicator requires a UCC process group backend, but got backend: gloo. Disabling UCC allreduce.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
WARNING 09-19 05:30:50 [ucc_communicator.py:38] UCCCommunicator requires a UCC process group backend, but got backend: gloo. Disabling UCC allreduce.
[rank0]:[W919 05:30:51.835072704 ProcessGroupNCCL.cpp:1505] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

PASSED
tests/distributed/test_ucc_communicator-3.py::test_ucc_communicator_initialization PASSED
tests/distributed/test_ucc_communicator-3.py::test_ucc_static_methods PASSED

panpan0000 · 2025-09-22T02:53:11Z

@ikryukov do you have time to review and merge ？

ikryukov · 2025-09-22T10:06:46Z

@ikryukov do you have time to review and merge ？

It is ready to merge, thanks! but could you rebase it, since I added co-author lengrongfu to previous commit

panpan0000 · 2025-09-23T01:38:20Z

@ikryukov rebased , thank you .

ikryukov force-pushed the ucc_integration branch from 70cea45 to bacd2c6 Compare September 22, 2025 10:04

ikryukov force-pushed the ucc_integration branch from bacd2c6 to 36afd42 Compare September 22, 2025 15:02

panpan0000 force-pushed the ucc_integration-pr-ut branch from 8d52072 to ed9cf39 Compare September 23, 2025 01:30

ikryukov force-pushed the ucc_integration branch from 36afd42 to a65a92f Compare September 23, 2025 08:52

[unit-test] for UCC

a90900f

panpan0000 force-pushed the ucc_integration-pr-ut branch from ed9cf39 to a90900f Compare September 24, 2025 06:48

ikryukov merged commit cbbb8a8 into ikryukov:ucc_integration Sep 24, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[unit-test] for UCCCommunicator #4

[unit-test] for UCCCommunicator #4

Uh oh!

panpan0000 commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

panpan0000 commented Sep 19, 2025 •

edited

Loading

Uh oh!

panpan0000 commented Sep 22, 2025

Uh oh!

ikryukov commented Sep 22, 2025

Uh oh!

panpan0000 commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[unit-test] for UCCCommunicator #4

[unit-test] for UCCCommunicator #4

Uh oh!

Conversation

panpan0000 commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

panpan0000 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panpan0000 commented Sep 22, 2025

Uh oh!

ikryukov commented Sep 22, 2025

Uh oh!

panpan0000 commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

panpan0000 commented Sep 18, 2025 •

edited by github-actions bot

Loading

panpan0000 commented Sep 19, 2025 •

edited

Loading