Skip to content

ucx: check supported transports for setting priority #8496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 3, 2021

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Feb 17, 2021

Fixes #8489

Results

Log when transports are not found:

  • [jazz01.swx.labs.mlnx:01720] ../../../../../opal/mca/common/ucx/common_ucx.c:171 didn't find matching transports, setting priority to 25

Log when transports are found:

  • [jazz01.swx.labs.mlnx:03174] ../../../../../opal/mca/common/ucx/common_ucx.c:163 found 'ud_verbs' transport on ud_verbs/mlx5_0:1, setting priority to 51
  • [vulcan04.swx.labs.mlnx:32443] ../../../../../opal/mca/common/ucx/common_ucx.c:163 found 'cuda_ipc' transport on cuda_ipc/cuda, setting priority to 51

@yosefe yosefe force-pushed the topic/pml-ucx-set-priority branch from e9f36b1 to 1e14648 Compare February 17, 2021 22:45
@yosefe
Copy link
Contributor Author

yosefe commented Feb 17, 2021

adding @jsquyres @hppritcha @jladd-mlnx

for (i = 0; i < sizeof(transports) / sizeof(transports[0]); ++i) {
/* Match "resource 6 : md 5 dev 4 flags -- rc_verbs/mlx5_0:1" */
snprintf(needle, sizeof(needle), " %s/", transports[i]);
match = strstr(buffer, needle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we search just transport name without /?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanted to have a more strict check so it would not match against partial words

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe add /mlx then? To avoid the case when ud is supported for some non-mlnx HCA which may have better support in other pmls/btls

{
#if HAVE_DECL_OPEN_MEMSTREAM
static const char *transports[] = {
"cuda_ipc", "ud_verbs", "rc_verbs", "dc_mlx5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no rc_mlx5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is a sufficient check, prefer to reduce init time if possible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe We can add "rocm_ipc" to the list as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ud_verbs in there, aren't we still going to be selecting UCX (over UD) over Libfabric for EFA and/or usnic.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this approach, nor do I understand how it addresses #8489.

It looks like you are searching for some UCX transports, and if you don't find them, you decrease UCX's priority to 25.

But OB1's priority is 20, so even if you don't find any of the desired transports, you're still outranking OB1, and therefore usNIC still won't get selected.

Additionally, you're not doing this the way any other Open MPI component does it.

  1. Other components have either an include and an exclude list, or a single list that can be negated with a prefix ^. E.g.
    • pml_ucx_transports_include=blah,blah,blah and pml_ucx_transports_exclude=blah,blah,blah, OR
    • pml_ucx_transports=[^]blah,blah,blah.
  2. Additionally, if you don't find any of the desired transports, then UCX should completely exclude itself from selection -- don't just go down to an arbitrarily lower priority level.

@yosefe
Copy link
Contributor Author

yosefe commented Feb 18, 2021

It looks like you are searching for some UCX transports, and if you don't find them, you decrease UCX's priority to 25.

@jsquyres I see your point, after looking at other components code.
Could we do it like this then:

  1. Add "pml_ucx_tls_include" param
  2. If none of the transports (1) is found - set priority to 0 (so could still fallback to UCX if none of the other components is available)
  3. Otherwise, if a matching transport is found, set priority to a high value

@jsquyres
Copy link
Member

@jsquyres I see your point, after looking at other components code.
Could we do it like this then:

  1. Add "pml_ucx_tls_include" param

If you have an include, you should also have an exclude.

Or just have pml_ucx_tls and allow a ^ to negate the list. I.e.,

  • pml_ucx_tls=a,b,c is an "include" list -- it means that UCX can use ONLY the a, b, or c TLS transports.
  • pml_ucx_tls=^x,y,z is an "exclude" list -- it means that UCX can use any TLS transport EXCEPT x, y, or z.

This is how most (all?) other Open MPI components allow selection.

  1. If none of the transports (1) is found - set priority to 0 (so could still fallback to UCX if none of the other components is available)

If no desired TLS transports are found, then the UCX PML cannot be used.

Put differently: the user (or defaults) has indicated which TLS transports can be used by UCX. If none of those are available, then it doesn't make sense for the UCX PML to be used at all. Meaning: UCX should disqualify itself from being selected.

It would violate the law of least surprise if you say "only use UCX TLS providers A, B, or C" but then since none of those are available, the UCX PML chose to use TLX provider D anyway.

  1. Otherwise, if a matching transport is found, set priority to a high value

Yes, if you find one of the UCX "include"d transports, set a high priority value so that the UCX PML will be selected.

@yosefe
Copy link
Contributor Author

yosefe commented Feb 21, 2021

@jsquyres pushed updated patch, pls LMK if I got your intention correctly

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that this works for usNIC; I don't know if it's sufficient for other networks. Can other vendors chime in here?

Thanks!

#if HAVE_DECL_OPEN_MEMSTREAM
char needle[64], line[128], *ptr;
char **tl_list, **tl_name;
int found, negate, enable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Open MPI does have the bool type. We tend to use that instead of int for boolean values.

@@ -39,10 +43,13 @@ static void opal_common_ucx_mem_release_cb(void *buf, size_t length,

OPAL_DECLSPEC void opal_common_ucx_mca_var_register(const mca_base_component_t *component)
{
static const char *default_tls = "rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,cuda_ipc,rocm_ipc";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that this is a good enough default? I.e., are there devices that export userspace verbs interfaces (RC or UD) that don't want to use UCX? Infinipath comes to mind. I don't know about Omnipath, or others...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good question, but how was it handled by OpenIB BTL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have their own libraries (psm, psm2) that only support infinipath and omnipath so pml/cm took precedence because their library found their hardware. You could use the openib btl with those networks by simply specifying --mca pml ob1. For UCX you will need to specifically make sure that the priority is low if either network is found (at least lower than pml/cm but probably also lower than pml/ob1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we want is:
(1) prio_ucx_with_mlx = prio_cm_with_opa/ipath > prio_ucx_with_verbs_non_mlx > prio_openib_btl
OR:
(2) prio_ucx_with_mlx = prio_cm_with_opa/ipath > prio_openib_btl > prio_ucx_with_verbs_non_mlx
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're asking. Remember, the openib BTL no longer exists on master (i.e., what will become v5.0.0).

What we want is:

  1. CM PML is used for EFA-, GNI-, Infinipath-, Omnipath-, iWARP-, and Portals-based networks
  2. UCX PML is used for IB- or RoCE-based networks.
  3. OB1 PML is used for everything else.

See this slide: https://docs.google.com/presentation/d/1AFBgQ-O8YEoqgNjXoQvVFK0SNd7iW899JonxHu6UY4Y/edit#slide=id.g8bc052a5e6_1_18

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part is that some of (1) networks are also exposing and/or emulating IB capabilities..

We can have a more strict (but ugly) check that the device we use with UCX belongs to a "mlx**" driver for (2)
Another option for (2) is to add UCX API for getting more details about devices, vendors etc- but then it won't work on current/previous UCX versions.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe That approach sounds fine to me. In the long run I wonder if we should revive the idea of a framework to handle the enabling/disabling of network types, APIs, etc. Could bake this logic into that component. btl/uct, for example, also should only be enabling on mellanox hardware. Right now it looks for specific transports but the logic is fragile and could break if UCX changes how transports are named.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we need to have a common approach of how to decide which component to select, and we can add the necessary APIs in UCX to support that. Indeed, using transport/device/vendor string name does not feel right for the long term...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 4.1.1, I think I'd propose:

  1. If the UCX PML sees a MLNX device (either by device driver permissions or even "there's any PCI device with an MLNX vendorid", UCX PML sets its priority to 52.
  2. If the UCX PML sees any other devices, it sets its priority to 19 (below OB1)
  3. Libfabric MTL / BTL adds verbs to the default provider blocklist, so that the reverse selection problem doesn't occur.

This is not perfect. It really sucks. But it should work well enough for 4.1.x. And, because we're just goofing with priorities instead of disabling the UCX PML, customers can say -mca pml ucx to enable UCX in non-default situations.

Long term, I think we need to get back to some type of centralized network selection infrastructure. I'm not sure we can do that for 5.0.0, but I do think a centralized vendorid map to set priorities of UCX/OFI/PSM/btls should be possible. I volunteered at the telecon today to write up a proposal by end of the week.

{
#if HAVE_DECL_OPEN_MEMSTREAM
static const char *transports[] = {
"cuda_ipc", "ud_verbs", "rc_verbs", "dc_mlx5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ud_verbs in there, aren't we still going to be selecting UCX (over UD) over Libfabric for EFA and/or usnic.

@yosefe
Copy link
Contributor Author

yosefe commented Feb 23, 2021

Maybe, in order to keep things simpler, we can set UCX priority to be more than openib btl, but less than all EFA/OPA/ipath/usnic-specific transports?

@hjelmn
Copy link
Member

hjelmn commented Feb 23, 2021

@yosefe It doesn't work like that. For USNIC, uGNI, tcp, shared-memory only, etc ob1 should be selected by default. So you will need to be very specific in your check. That is how it has worked in the past with other transports. Pretty much if any mellanox hardware is detected then default priority == high, if not it is lower than ob1.

@bwbarrett
Copy link
Member

Maybe, in order to keep things simpler, we can set UCX priority to be more than openib btl, but less than all EFA/OPA/ipath/usnic-specific transports?

I think the challenge with that is then we end up in the opposite world. If Open MPI is built with both UCX and Libfabric and you're running on MLNX hardware, then Open MPI would select Libfabric over verbs over the MLNX hardware, which is probably not what we want.

@jsquyres
Copy link
Member

I can make at least one variable simpler: usNIC hasn't offered a userspace verbs interface for several years now. So I'm guessing that usNIC doesn't work with UCX ud_verbs. Maybe showing that is that I tested this PR -- even including ud_verbs in the UCX defaults -- and it disqualified the UCX PML and defaulted to ob1/usnic. That doesn't help with the other networks, but hopefully this is a helpful data point.

@mwheinz
Copy link

mwheinz commented Feb 24, 2021

This patch definitely blocks the PSM2 and OFI MTLs from being selected despite the presence of OPA hardware and the absence of Mellanox devices - or any other brand of RoCE device.

@gpaulsen
Copy link
Member

@mwheinz, did you test if PSM2 and OFI MTLS selected on master today without this PR?

@yosefe
Copy link
Contributor Author

yosefe commented Feb 28, 2021

@jsquyres fixed according to latest comments (filter by mlx* device name). The logic got much more complicated though :(

mpirun --mca pml_ucx_verbose 2 -n 1 ./a.out
$ mpirun  --mca pml_ucx_verbose 2 -n 1 ./a.out 
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:198 mca_pml_ucx_open
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 posix/memory: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 sysv/memory: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 self/memory0: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 tcp/eno1: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 tcp/p2p2: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 tcp/lo: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 tcp/ib0: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 tcp/ib3: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:307 tcp/ib2: did not match transport list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:202 driver '../../../../bus/pci/drivers/mlx5_core' matched by 'mlx*'
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:298 rc_verbs/mlx5_0:1: matched both transport and device list
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:314 support level is transports and devices
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:282 mca_pml_ucx_init
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:120 Pack remote worker address, size 714
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:120 Pack local worker address, size 924
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:339 created ucp context 0x18f3c00, worker 0x19c4380
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:114 returning priority 51
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:187 Got proc 0 address, size 924
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:397 connecting to proc. 0
Hello, world, I am 0 of 1, (Open MPI v5.0.0a1, package: Open MPI username@hostname Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-8395-g723c6b77b1, Unreleased developer copy, 158)
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:444 disconnecting from rank 0
[hostname:192985] ../../../../../opal/mca/common/ucx/common_ucx.c:408 waiting for 0 disconnect requests
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:353 mca_pml_ucx_cleanup
[hostname:192985] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:261 mca_pml_ucx_close

@gpaulsen
Copy link
Member

gpaulsen commented Mar 1, 2021

@mwheinz, can you please re-test this now?

@mwheinz
Copy link

mwheinz commented Mar 1, 2021

@mwheinz, can you please re-test this now?

I’ll try to get to it soon. I could not get master branch to build so I’m cherry-picking this into 4.1.x.

@rajachan
Copy link
Member

rajachan commented Mar 1, 2021

Tested the latest version of this PR on a system with EFA, and the logic works as intended:

select: initializing pml component cm
select: init returned priority 25
select: initializing pml component ob1
select: init returned priority 20
select: initializing pml component ucx
common_ucx.c:307 self/memory0: did not match transport list
common_ucx.c:302 ud_verbs/efa_0:1: matched transport list but not device list
common_ucx.c:302 ud_verbs/efa_1:1: matched transport list but not device list
common_ucx.c:302 ud_verbs/efa_2:1: matched transport list but not device list
common_ucx.c:302 ud_verbs/efa_3:1: matched transport list but not device list
common_ucx.c:314 support level is transports only
pml_ucx.c:282 mca_pml_ucx_init
pml_ucx.c:118 Pack remote worker address, size 149
pml_ucx.c:118 Pack local worker address, size 178
pml_ucx.c:337 created ucp context 0x1054380, worker 0x12b8ef0
pml_ucx_component.c:114 returning priority 19
select: init returned priority 19
select: component monitoring not in the include list
selected cm best priority 25
select: component cm selected
select: component ob1 not selected / finalized
pml_ucx.c:353 mca_pml_ucx_cleanup
select: component ucx not selected / finalized

@mwheinz
Copy link

mwheinz commented Mar 1, 2021

[cn-priv-01:3749165] select: initializing mtl component psm2
[cn-priv-01:3749165] select: init returned success
[cn-priv-01:3749165] select: component psm2 selected
[cn-priv-01:3749165] select: init returned priority 40
[cn-priv-01:3749165] select: component monitoring not in the include list
[cn-priv-01:3749165] select: initializing pml component ob1
[cn-priv-01:3749165] select: init returned priority 20
[cn-priv-01:3749165] select: initializing pml component ucx
[cn-priv-01:3749165] select: init returned priority 19
[cn-priv-01:3749165] selected cm best priority 40
[cn-priv-01:3749165] select: component cm selected
```

So it looks good.

@rhc54
Copy link
Contributor

rhc54 commented Mar 2, 2021

Before we accept this as "solved" - have we verified that OMPI built against UCX and/or libfabric, but with no fabric device installed, defaults to "ob1" and the BTLs as it should? This is the most common case we encounter when using distros.

@jsquyres
Copy link
Member

jsquyres commented Mar 2, 2021

Before we accept this as "solved" - have we verified that OMPI built against UCX and/or libfabric, but with no fabric device installed, defaults to "ob1" and the BTLs as it should? This is the most common case we encounter when using distros.

I'll check while I'm (re)checking usNIC.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works correctly for usNIC. I also verify that it works for defaulting over to the TCP BTL when compiled with UCX and usNIC is not available.

One minor thing, however: this requires 2 MCA params (because we're distinguishing between transports and devices). This is different than elsewhere in Open MPI. At a minimum, this should probably be documented somewhere.

@yosefe
Copy link
Contributor Author

yosefe commented Mar 2, 2021

One minor thing, however: this requires 2 MCA params (because we're distinguishing between transports and devices). This is different than elsewhere in Open MPI. At a minimum, this should probably be documented somewhere.

@jsquyres shall I update the FAQ at ompi-www?
Ok to squash this PR?

@jsquyres
Copy link
Member

jsquyres commented Mar 2, 2021

Yes, please squash.
Yes, please update the FAQ.

Thanks!

Add "pml_ucx_tls" parameter to control the transports to include or
exclude (with a '^' prefix). UCX will be enabled only if one of the
included transports is present, or if a non-excluded transport is
present (in case of exclude-list with a '^' prefix).

Add "pml_ucx_devices" parameter to control the devices which make UCX
component set a high priority for itself. If none of the listed devices
is present, the priority will be set to 19 - lower than ob1 priority.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
@yosefe yosefe force-pushed the topic/pml-ucx-set-priority branch from 2023aa3 to 562c57c Compare March 2, 2021 20:33
@gpaulsen
Copy link
Member

gpaulsen commented Mar 2, 2021

Thanks @yosefe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UCX PML is pre-empting the usNIC BTL