-
Notifications
You must be signed in to change notification settings - Fork 893
ucx: check supported transports for setting priority #8496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e9f36b1
to
1e14648
Compare
adding @jsquyres @hppritcha @jladd-mlnx |
opal/mca/common/ucx/common_ucx.c
Outdated
for (i = 0; i < sizeof(transports) / sizeof(transports[0]); ++i) { | ||
/* Match "resource 6 : md 5 dev 4 flags -- rc_verbs/mlx5_0:1" */ | ||
snprintf(needle, sizeof(needle), " %s/", transports[i]); | ||
match = strstr(buffer, needle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we search just transport name without /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanted to have a more strict check so it would not match against partial words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, maybe add /mlx
then? To avoid the case when ud is supported for some non-mlnx HCA which may have better support in other pmls/btls
opal/mca/common/ucx/common_ucx.c
Outdated
{ | ||
#if HAVE_DECL_OPEN_MEMSTREAM | ||
static const char *transports[] = { | ||
"cuda_ipc", "ud_verbs", "rc_verbs", "dc_mlx5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no rc_mlx5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is a sufficient check, prefer to reduce init time if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe We can add "rocm_ipc" to the list as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ud_verbs in there, aren't we still going to be selecting UCX (over UD) over Libfabric for EFA and/or usnic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this approach, nor do I understand how it addresses #8489.
It looks like you are searching for some UCX transports, and if you don't find them, you decrease UCX's priority to 25.
But OB1's priority is 20, so even if you don't find any of the desired transports, you're still outranking OB1, and therefore usNIC still won't get selected.
Additionally, you're not doing this the way any other Open MPI component does it.
- Other components have either an include and an exclude list, or a single list that can be negated with a prefix
^
. E.g.pml_ucx_transports_include=blah,blah,blah
andpml_ucx_transports_exclude=blah,blah,blah
, ORpml_ucx_transports=[^]blah,blah,blah
.
- Additionally, if you don't find any of the desired transports, then UCX should completely exclude itself from selection -- don't just go down to an arbitrarily lower priority level.
@jsquyres I see your point, after looking at other components code.
|
If you have an Or just have
This is how most (all?) other Open MPI components allow selection.
If no desired TLS transports are found, then the UCX PML cannot be used. Put differently: the user (or defaults) has indicated which TLS transports can be used by UCX. If none of those are available, then it doesn't make sense for the UCX PML to be used at all. Meaning: UCX should disqualify itself from being selected. It would violate the law of least surprise if you say "only use UCX TLS providers A, B, or C" but then since none of those are available, the UCX PML chose to use TLX provider D anyway.
Yes, if you find one of the UCX "include"d transports, set a high priority value so that the UCX PML will be selected. |
@jsquyres pushed updated patch, pls LMK if I got your intention correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that this works for usNIC; I don't know if it's sufficient for other networks. Can other vendors chime in here?
Thanks!
opal/mca/common/ucx/common_ucx.c
Outdated
#if HAVE_DECL_OPEN_MEMSTREAM | ||
char needle[64], line[128], *ptr; | ||
char **tl_list, **tl_name; | ||
int found, negate, enable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, Open MPI does have the bool
type. We tend to use that instead of int
for boolean values.
opal/mca/common/ucx/common_ucx.c
Outdated
@@ -39,10 +43,13 @@ static void opal_common_ucx_mem_release_cb(void *buf, size_t length, | |||
|
|||
OPAL_DECLSPEC void opal_common_ucx_mca_var_register(const mca_base_component_t *component) | |||
{ | |||
static const char *default_tls = "rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,cuda_ipc,rocm_ipc"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know that this is a good enough default? I.e., are there devices that export userspace verbs interfaces (RC or UD) that don't want to use UCX? Infinipath comes to mind. I don't know about Omnipath, or others...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good question, but how was it handled by OpenIB BTL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have their own libraries (psm, psm2) that only support infinipath and omnipath so pml/cm took precedence because their library found their hardware. You could use the openib btl with those networks by simply specifying --mca pml ob1
. For UCX you will need to specifically make sure that the priority is low if either network is found (at least lower than pml/cm but probably also lower than pml/ob1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we want is:
(1) prio_ucx_with_mlx = prio_cm_with_opa/ipath > prio_ucx_with_verbs_non_mlx > prio_openib_btl
OR:
(2) prio_ucx_with_mlx = prio_cm_with_opa/ipath > prio_openib_btl > prio_ucx_with_verbs_non_mlx
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're asking. Remember, the openib BTL no longer exists on master (i.e., what will become v5.0.0).
What we want is:
- CM PML is used for EFA-, GNI-, Infinipath-, Omnipath-, iWARP-, and Portals-based networks
- UCX PML is used for IB- or RoCE-based networks.
- OB1 PML is used for everything else.
See this slide: https://docs.google.com/presentation/d/1AFBgQ-O8YEoqgNjXoQvVFK0SNd7iW899JonxHu6UY4Y/edit#slide=id.g8bc052a5e6_1_18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky part is that some of (1) networks are also exposing and/or emulating IB capabilities..
We can have a more strict (but ugly) check that the device we use with UCX belongs to a "mlx**" driver for (2)
Another option for (2) is to add UCX API for getting more details about devices, vendors etc- but then it won't work on current/previous UCX versions.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe That approach sounds fine to me. In the long run I wonder if we should revive the idea of a framework to handle the enabling/disabling of network types, APIs, etc. Could bake this logic into that component. btl/uct, for example, also should only be enabling on mellanox hardware. Right now it looks for specific transports but the logic is fragile and could break if UCX changes how transports are named.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we need to have a common approach of how to decide which component to select, and we can add the necessary APIs in UCX to support that. Indeed, using transport/device/vendor string name does not feel right for the long term...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 4.1.1, I think I'd propose:
- If the UCX PML sees a MLNX device (either by device driver permissions or even "there's any PCI device with an MLNX vendorid", UCX PML sets its priority to 52.
- If the UCX PML sees any other devices, it sets its priority to 19 (below OB1)
- Libfabric MTL / BTL adds verbs to the default provider blocklist, so that the reverse selection problem doesn't occur.
This is not perfect. It really sucks. But it should work well enough for 4.1.x. And, because we're just goofing with priorities instead of disabling the UCX PML, customers can say -mca pml ucx
to enable UCX in non-default situations.
Long term, I think we need to get back to some type of centralized network selection infrastructure. I'm not sure we can do that for 5.0.0, but I do think a centralized vendorid map to set priorities of UCX/OFI/PSM/btls should be possible. I volunteered at the telecon today to write up a proposal by end of the week.
opal/mca/common/ucx/common_ucx.c
Outdated
{ | ||
#if HAVE_DECL_OPEN_MEMSTREAM | ||
static const char *transports[] = { | ||
"cuda_ipc", "ud_verbs", "rc_verbs", "dc_mlx5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ud_verbs in there, aren't we still going to be selecting UCX (over UD) over Libfabric for EFA and/or usnic.
Maybe, in order to keep things simpler, we can set UCX priority to be more than openib btl, but less than all EFA/OPA/ipath/usnic-specific transports? |
@yosefe It doesn't work like that. For USNIC, uGNI, tcp, shared-memory only, etc ob1 should be selected by default. So you will need to be very specific in your check. That is how it has worked in the past with other transports. Pretty much if any mellanox hardware is detected then default priority == high, if not it is lower than ob1. |
I think the challenge with that is then we end up in the opposite world. If Open MPI is built with both UCX and Libfabric and you're running on MLNX hardware, then Open MPI would select Libfabric over verbs over the MLNX hardware, which is probably not what we want. |
I can make at least one variable simpler: usNIC hasn't offered a userspace verbs interface for several years now. So I'm guessing that usNIC doesn't work with UCX ud_verbs. Maybe showing that is that I tested this PR -- even including |
This patch definitely blocks the PSM2 and OFI MTLs from being selected despite the presence of OPA hardware and the absence of Mellanox devices - or any other brand of RoCE device. |
@mwheinz, did you test if PSM2 and OFI MTLS selected on master today without this PR? |
@jsquyres fixed according to latest comments (filter by mpirun --mca pml_ucx_verbose 2 -n 1 ./a.out
|
@mwheinz, can you please re-test this now? |
I’ll try to get to it soon. I could not get master branch to build so I’m cherry-picking this into 4.1.x. |
Tested the latest version of this PR on a system with EFA, and the logic works as intended:
|
|
Before we accept this as "solved" - have we verified that OMPI built against UCX and/or libfabric, but with no fabric device installed, defaults to "ob1" and the BTLs as it should? This is the most common case we encounter when using distros. |
I'll check while I'm (re)checking usNIC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works correctly for usNIC. I also verify that it works for defaulting over to the TCP BTL when compiled with UCX and usNIC is not available.
One minor thing, however: this requires 2 MCA params (because we're distinguishing between transports and devices). This is different than elsewhere in Open MPI. At a minimum, this should probably be documented somewhere.
@jsquyres shall I update the FAQ at ompi-www? |
Yes, please squash. Thanks! |
Add "pml_ucx_tls" parameter to control the transports to include or exclude (with a '^' prefix). UCX will be enabled only if one of the included transports is present, or if a non-excluded transport is present (in case of exclude-list with a '^' prefix). Add "pml_ucx_devices" parameter to control the devices which make UCX component set a high priority for itself. If none of the listed devices is present, the priority will be set to 19 - lower than ob1 priority. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2023aa3
to
562c57c
Compare
Thanks @yosefe! |
Fixes #8489
Results
Log when transports are not found:
[jazz01.swx.labs.mlnx:01720] ../../../../../opal/mca/common/ucx/common_ucx.c:171 didn't find matching transports, setting priority to 25
Log when transports are found:
[jazz01.swx.labs.mlnx:03174] ../../../../../opal/mca/common/ucx/common_ucx.c:163 found 'ud_verbs' transport on ud_verbs/mlx5_0:1, setting priority to 51
[vulcan04.swx.labs.mlnx:32443] ../../../../../opal/mca/common/ucx/common_ucx.c:163 found 'cuda_ipc' transport on cuda_ipc/cuda, setting priority to 51