gloo: support CUDA tensors and verify correctness (CPU reduction, CUDA tensor)

We currently only test Gloo with CPU tensors. ProcessGroup in c10d supports both CUDA and CPU tensors. We have some basic code to support this but it hasn't been robustly tested / verified.

The test configurations in CI are controlled via: https://github.com/meta-pytorch/torchcomms/blob/main/comms/torchcomms/scripts/run_tests_integration_py.sh#L26-L32

Testing:

We want to test with CUDA backend and gloo.

For example to run the alltoall single test:

```sh
TEST_BACKEND=gloo TEST_DEVICE=cuda torchrun --nnodes 1 --nproc_per_node 4 comms/torchcomms/tests/integration/py/AllToAllSingleTest.py
```

We can also test E2E with torchtitan which should give a good indication if things are being handled incorrectly

Things to watch out for:

* making sure that we're copying back to the original device tensor for all operations
* making sure that we correctly synchronize the streams and doing the transfer on the right streams

Not necessarily part of this issue but also relevant:
* enabling ibverbs backend for Gloo
* enabling CUDA/GPU reductions for Gloo - this task is only covering CPU reduction for tensors originally located on the GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gloo: support CUDA tensors and verify correctness (CPU reduction, CUDA tensor) #54

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gloo: support CUDA tensors and verify correctness (CPU reduction, CUDA tensor) #54

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions