opal/mca/ofi: select NIC closest to accelerator if requested #11716

wenduwan · 2023-05-24T00:12:39Z

This patch introduces a new capability to select NIC closest to the user requested accelerator (PCI) device. The implementation should suit all accelerator types, i.e. cuda & rocm. This change addresses #11696

In this patch, we introduce a overriding logic when an accelerator has been initialized - instead of selecting a NIC on the package, we select a NIC closest to the accelerator(might be on a different package).

To enable this feature, the application should set the MCA parameter OMPI_MCA_opal_common_ofi_accelerator_rank(default -1) to a non-negative integer, which represents the process rank(0-based) on the same accelerator.

The impl depends on the following APIs:

accelerator.get_device_pci_attr: Retrieve the PCI info of the accelerator.
hwloc_get_pcidev_by_busid: Get the hwloc object of the accelerator and provider(NIC)
hwloc_get_common_ancestor_obj: Get the closest common ancestor hwloc object between the accelerator and provider

The NIC selection logic can be summarized as following:

Among available NICs, find those closest to the accelerator device. Here we choose to not use the pmix_device_distance_t or hwloc_distances_s for practical reasons - they are not computable for every platform, e.g. on AWS EC2 we cannot reliably get such values between GPU and NIC. Instead the device proximity is measured as distance(GPU, common ancestor) + distance(NIC, common ancestor), see https://www.open-mpi.org/projects/hwloc/doc/v2.9.1/a00359.php
When there is a tie, break the tie using a modulo (local rank on the same accelerator) % (number of nearest providers). Note that we do not have a good way to calculate local rank on the same accelerator, so instead we reuse local rank on the same package as a proxy.

rhc54 · 2023-05-24T02:16:41Z

Please see #11696 (comment) for a suggested generalized approach to this. I believe you have done the single GPU case described in that comment - you might want to consider the extension to multiple GPUs.

wenduwan · 2023-05-24T18:19:08Z

@rhc54 Thanks for the discourse on #11696

You are right that this patch only addresses the single GPU case. I don't have a good grasp on how multi GPU should be handled. I think it demands considerable work to design and implement.

edgargabriel

looks good to me as far as I can tell.

opal/mca/common/ofi/common_ofi.c

wenduwan · 2023-05-25T17:27:55Z

I need to rebase this change if #11689 is merged.

rhc54 · 2024-07-03T14:15:29Z

Just a reminder from prior conversations about this topic. We know this algorithm isn't actually correct as the HWLOC "depth" does not correlate to communication distance. We do have a topic scheduled for offline discussion with vendors about this issue. However, in this case, this approach probably won't hurt as you'll just wind up doing a round-robin of the devices (which is what we've observed in the past, unless something has changed since the last time you tried this), and for now that's probably the best you can do anyway.

wenduwan · 2024-07-03T16:08:20Z

@rhc54 You are right. I figured that the depth attribute is not that much helpful. Instead, I chose a more direct distance measure, based on the assumptions:

Any NIC and GPU should have at least 1 common ancestor
The number of objects between NIC(or GPU) and the common ancestor can be reliably computed

Then I can use this imperfect metric objects_between(GPU, common ancestor) + objects_between(NIC, common ancestor) to qualitatively compare GPU-NIC distances.

This is largely an experimental feature, so I'm adding a switch to disable it by default.

rhc54 · 2024-07-03T17:56:53Z

Worth trying - I agree with the default switch, though. It's a rather difficult problem and so far every attempt has failed to produce desired results. We need a better understanding of the signal flow in the system. For example, intervening objects have no impact on message traversal times because the electronics in each device on the buss don't intercept/relay the signal - it's just a buss that they all are hanging off of, and distance along the buss is largely irrelevant given the speed of the electrons and the physical distances involved.

What does seem to matter is any intervening device that actually does do an intercept/relay operation - e.g., to switch from one buss to another where injection into the target buss requires traffic coordination. So moving from the main memory buss to the PCI buss costs you something - but talking to anything on that PCI buss is the same as talking to anything else on the buss. Doesn't matter where on the PCI buss you sit.

Picking the right combination therefore seems to devolve into minimizing buss transitions (e.g., having a NIC on one PCI buss and the GPU on another is probably not good - unless you have a cross-device harness, in which case the two communicate without transferring across PCI) and balancing loads. We can compute the first - the second is less clear without making assumptions on how the application might use the devices.

Hope to get some of this clarified and quantified in upcoming months.

This patch introduces the capability to select the closest NIC to the accelerator device. If the accelerator or NIC PCI information is not available, fallback to select the NIC on the closest package. To enable this feature, the application should set the MCA parameter OMPI_MCA_opal_common_ofi_accelerator_rank(default -1) to a non-negative integer, which represents the process rank(0-based) on the same accelerator. The distance between the acclerator device and NIC is the number of hwloc objects inbetween, which includes the lowest common ancestor on the hwloc topology tree. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>

wenduwan · 2024-08-26T13:58:16Z

Chatted with @naughtont3 offline. I will merge this PR for now. New issues will be filed later based on additional testing.

github-actions bot added the Target: main label May 24, 2023

wenduwan requested review from bwbarrett, lrbison and edgargabriel May 24, 2023 00:12

wenduwan added the enhancement label May 24, 2023

edgargabriel approved these changes May 24, 2023

View reviewed changes

opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved

wenduwan force-pushed the accelerator_awareness branch from 039e1b5 to 8029e6d Compare May 25, 2023 00:26

wenduwan added the ⚠️ WIP-DNM! label Aug 1, 2023

wenduwan mentioned this pull request Aug 15, 2023

opal/ofi: refactor NIC selection logic #11689

Merged

wenduwan self-assigned this Nov 15, 2023

wenduwan marked this pull request as draft March 21, 2024 22:30

wenduwan force-pushed the accelerator_awareness branch from 8029e6d to af181c3 Compare July 2, 2024 18:00

wenduwan force-pushed the accelerator_awareness branch from af181c3 to 38f7bb1 Compare July 3, 2024 15:57

wenduwan marked this pull request as ready for review July 3, 2024 16:01

wenduwan force-pushed the accelerator_awareness branch from 38f7bb1 to a8587c2 Compare July 5, 2024 02:44

wenduwan added Target: v6.0.x and removed ⚠️ WIP-DNM! labels Jul 5, 2024

wenduwan force-pushed the accelerator_awareness branch from a8587c2 to 66912b9 Compare July 11, 2024 00:58

github-actions bot removed the Target: v6.0.x label Jul 11, 2024

wenduwan requested a review from hppritcha July 16, 2024 16:59

wenduwan merged commit 7d20b86 into open-mpi:main Aug 26, 2024
16 checks passed

wenduwan deleted the accelerator_awareness branch August 26, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opal/mca/ofi: select NIC closest to accelerator if requested #11716

opal/mca/ofi: select NIC closest to accelerator if requested #11716

wenduwan commented May 24, 2023 •

edited

Loading

rhc54 commented May 24, 2023

wenduwan commented May 24, 2023 •

edited

Loading

edgargabriel left a comment

wenduwan commented May 25, 2023

rhc54 commented Jul 3, 2024

wenduwan commented Jul 3, 2024 •

edited

Loading

rhc54 commented Jul 3, 2024

wenduwan commented Aug 26, 2024

opal/mca/ofi: select NIC closest to accelerator if requested #11716

opal/mca/ofi: select NIC closest to accelerator if requested #11716

Conversation

wenduwan commented May 24, 2023 • edited Loading

rhc54 commented May 24, 2023

wenduwan commented May 24, 2023 • edited Loading

edgargabriel left a comment

Choose a reason for hiding this comment

wenduwan commented May 25, 2023

rhc54 commented Jul 3, 2024

wenduwan commented Jul 3, 2024 • edited Loading

rhc54 commented Jul 3, 2024

wenduwan commented Aug 26, 2024

wenduwan commented May 24, 2023 •

edited

Loading

wenduwan commented May 24, 2023 •

edited

Loading

wenduwan commented Jul 3, 2024 •

edited

Loading