-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Labels
Description
We currently only test Gloo with CPU tensors. ProcessGroup in c10d supports both CUDA and CPU tensors. We have some basic code to support this but it hasn't been robustly tested / verified.
The test configurations in CI are controlled via: https://github.com/meta-pytorch/torchcomms/blob/main/comms/torchcomms/scripts/run_tests_integration_py.sh#L26-L32
Testing:
We want to test with CUDA backend and gloo.
For example to run the alltoall single test:
TEST_BACKEND=gloo TEST_DEVICE=cuda torchrun --nnodes 1 --nproc_per_node 4 comms/torchcomms/tests/integration/py/AllToAllSingleTest.pyWe can also test E2E with torchtitan which should give a good indication if things are being handled incorrectly
Things to watch out for:
- making sure that we're copying back to the original device tensor for all operations
- making sure that we correctly synchronize the streams and doing the transfer on the right streams
Not necessarily part of this issue but also relevant:
- enabling ibverbs backend for Gloo
- enabling CUDA/GPU reductions for Gloo - this task is only covering CPU reduction for tensors originally located on the GPU