-
Notifications
You must be signed in to change notification settings - Fork 897
opal/mca/ofi: select NIC closest to accelerator if requested #11716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please see #11696 (comment) for a suggested generalized approach to this. I believe you have done the single GPU case described in that comment - you might want to consider the extension to multiple GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me as far as I can tell.
039e1b5
to
8029e6d
Compare
I need to rebase this change if #11689 is merged. |
8029e6d
to
af181c3
Compare
Just a reminder from prior conversations about this topic. We know this algorithm isn't actually correct as the HWLOC "depth" does not correlate to communication distance. We do have a topic scheduled for offline discussion with vendors about this issue. However, in this case, this approach probably won't hurt as you'll just wind up doing a round-robin of the devices (which is what we've observed in the past, unless something has changed since the last time you tried this), and for now that's probably the best you can do anyway. |
af181c3
to
38f7bb1
Compare
@rhc54 You are right. I figured that the depth attribute is not that much helpful. Instead, I chose a more direct distance measure, based on the assumptions:
Then I can use this imperfect metric This is largely an experimental feature, so I'm adding a switch to disable it by default. |
Worth trying - I agree with the default switch, though. It's a rather difficult problem and so far every attempt has failed to produce desired results. We need a better understanding of the signal flow in the system. For example, intervening objects have no impact on message traversal times because the electronics in each device on the buss don't intercept/relay the signal - it's just a buss that they all are hanging off of, and distance along the buss is largely irrelevant given the speed of the electrons and the physical distances involved. What does seem to matter is any intervening device that actually does do an intercept/relay operation - e.g., to switch from one buss to another where injection into the target buss requires traffic coordination. So moving from the main memory buss to the PCI buss costs you something - but talking to anything on that PCI buss is the same as talking to anything else on the buss. Doesn't matter where on the PCI buss you sit. Picking the right combination therefore seems to devolve into minimizing buss transitions (e.g., having a NIC on one PCI buss and the GPU on another is probably not good - unless you have a cross-device harness, in which case the two communicate without transferring across PCI) and balancing loads. We can compute the first - the second is less clear without making assumptions on how the application might use the devices. Hope to get some of this clarified and quantified in upcoming months. |
38f7bb1
to
a8587c2
Compare
This patch introduces the capability to select the closest NIC to the accelerator device. If the accelerator or NIC PCI information is not available, fallback to select the NIC on the closest package. To enable this feature, the application should set the MCA parameter OMPI_MCA_opal_common_ofi_accelerator_rank(default -1) to a non-negative integer, which represents the process rank(0-based) on the same accelerator. The distance between the acclerator device and NIC is the number of hwloc objects inbetween, which includes the lowest common ancestor on the hwloc topology tree. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
a8587c2
to
66912b9
Compare
Chatted with @naughtont3 offline. I will merge this PR for now. New issues will be filed later based on additional testing. |
This patch introduces a new capability to select NIC closest to the user requested accelerator (PCI) device. The implementation should suit all accelerator types, i.e. cuda & rocm. This change addresses #11696
In this patch, we introduce a overriding logic when an accelerator has been initialized - instead of selecting a NIC on the package, we select a NIC closest to the accelerator(might be on a different package).
To enable this feature, the application should set the MCA parameter
OMPI_MCA_opal_common_ofi_accelerator_rank
(default -1) to a non-negative integer, which represents the process rank(0-based) on the same accelerator.The impl depends on the following APIs:
accelerator.get_device_pci_attr
: Retrieve the PCI info of the accelerator.hwloc_get_pcidev_by_busid
: Get the hwloc object of the accelerator and provider(NIC)hwloc_get_common_ancestor_obj
: Get the closest common ancestor hwloc object between the accelerator and providerThe NIC selection logic can be summarized as following:
pmix_device_distance_t
orhwloc_distances_s
for practical reasons - they are not computable for every platform, e.g. on AWS EC2 we cannot reliably get such values between GPU and NIC. Instead the device proximity is measured asdistance(GPU, common ancestor) + distance(NIC, common ancestor)
, see https://www.open-mpi.org/projects/hwloc/doc/v2.9.1/a00359.php(local rank on the same accelerator) % (number of nearest providers)
. Note that we do not have a good way to calculatelocal rank on the same accelerator
, so instead we reuselocal rank on the same package
as a proxy.