-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fabric.all_reduce
modifies the input value inplace on GPU
#18228
Comments
Hey @function2-llx This is the default behavior in torch, see here how we call it under the hood: Here are the docs for PyTorch's all_reduce: Reducing the tensor this way is more efficient than making a copy first, so I think we should consider keeping it this way. The documentation should be clearer though. WDYT? |
that was my question too, thanks |
@awaelchli Thanks for the clarification. I agree that most of the time, reducing the tensors without copying first is a good idea for performance, and users can copy the tensors themselves if necessary. So, this is generally not a bug while a clarification in the documentation will be better. There's still a little problem currently, though. When calling |
This part could indeed be considered a bug yes. If my memory is correct, the reason this was done is to support other backends that didn't have the "avg" option. See this note in the torch docs:
That is why the SUM + divide by size approach was chosen. This is of course just one way to solve this. Another way would have been to explicitly error out on the option if the backend didn't support it, or implement either approach depending on the backend. |
Bug description
When calling
Fabric.all_reduce
on tensors on GPU, it will modify the input value to the sum of all values across processes..What version are you seeing the problem on?
v2.0
How to reproduce the bug
Run the following code with multiple (e.g., 4) GPUs:
Error messages and logs
output:
Environment
Current environment
More info
The bug does not occur for tensors on the CPU.
cc @carmocca @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: