-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload reduction operations to accelerator devices #12318
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Joseph Schuchart <jschuchart@leconte.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
If the target process is unable to execute an RDMA operation it instructs the origin to change the communication protocol. When this happen theorigin must be informed to cancel all pending RDMA operations, and release the rdma_frag. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
…or allreduce recursive doubling Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal thank you for all of this work! Can I make a suggestion? This is a massive pr as it is at the moment. Could we try to break it down into multiple smaller pieces, that are more manageable? E.g.
Ideally, even if a new feature in one of the components is not used initially, it can be reviewed and resolved independently, and if we do it right it shouldn't cause any issues as long as its not used. I am more than happy to assist/help with that process if you want. |
@devreal Can you share how you configure the build? It seems that the C++ dependency is wrong when I build it. |
@edgargabriel I agree, this should be split up. I will start with the accelerator framework. |
@devreal We built with libfabric and collected some performance data of osu-micro-benchmarks on GPU instances(p4d.24xlarge).
We found ireduce and iallreduce segfault with
|
On a single node with UCX
|
This PR is an attempt to offload reduction operations in
MPI_Allreduce
to accelerator devices if the input buffer is located on a device.A few notes: