Skip to content

gloo: support CUDA tensors and verify correctness (CPU reduction, CUDA tensor) #54

@d4l3k

Description

@d4l3k

We currently only test Gloo with CPU tensors. ProcessGroup in c10d supports both CUDA and CPU tensors. We have some basic code to support this but it hasn't been robustly tested / verified.

The test configurations in CI are controlled via: https://github.com/meta-pytorch/torchcomms/blob/main/comms/torchcomms/scripts/run_tests_integration_py.sh#L26-L32

Testing:

We want to test with CUDA backend and gloo.

For example to run the alltoall single test:

TEST_BACKEND=gloo TEST_DEVICE=cuda torchrun --nnodes 1 --nproc_per_node 4 comms/torchcomms/tests/integration/py/AllToAllSingleTest.py

We can also test E2E with torchtitan which should give a good indication if things are being handled incorrectly

Things to watch out for:

  • making sure that we're copying back to the original device tensor for all operations
  • making sure that we correctly synchronize the streams and doing the transfer on the right streams

Not necessarily part of this issue but also relevant:

  • enabling ibverbs backend for Gloo
  • enabling CUDA/GPU reductions for Gloo - this task is only covering CPU reduction for tensors originally located on the GPU

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions