Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openib BTL being selected and used when not desired #7107

Closed
mwheinz opened this issue Oct 25, 2019 · 13 comments
Closed

Openib BTL being selected and used when not desired #7107

mwheinz opened this issue Oct 25, 2019 · 13 comments
Assignees

Comments

@mwheinz
Copy link

mwheinz commented Oct 25, 2019

Thank you for taking the time to submit an issue!

Background information

When using the OFI MTL with 4.0.1 and 4.0.2 I am seeing the following warning:

By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

After research, I discovered that the problem is that OMPI is selecting a BTL transport even though the OFI MTL was selected. Moreover, attempting to exclude openib as a transport simply causes OMPI to select a different BTL to use instead of using the selected MTL. If I exclude all BTLs except self, the benchmark begins reporting errors instead - although it then runs to completion anyway.

The problem does not occur when the PSM2 MTL is selected.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1, 4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Compiled from git clone of 4.0.[1|2]

Please describe the system on which you are running

  • Operating system/version: RHEL 8.0, RHEL 7.6
  • Computer hardware: Intel Xeon
  • Network type: OPA, TCP

Details of the problem

[LINUX hdsmpriv01 20191104_1617 mpi-benchmarks]# mpirun -allow-run-as-root --mca mtl ofi --mca mtl_ofi_provider_include psm2 --mca btl_base_verbose 10 --mca mtl_base_verbose 10 -H hdsmpriv01,hdsmpriv02 IMB-EXT Accumulate 2>&1 | tee a
[hdsmpriv01:64835] mca: base: components_register: registering framework btl components
[hdsmpriv01:64835] mca: base: components_register: found loaded component tcp
[hdsmpriv01:64835] mca: base: components_register: component tcp register function successful
[hdsmpriv01:64835] mca: base: components_register: found loaded component vader
[hdsmpriv01:64835] mca: base: components_register: component vader register function successful
[hdsmpriv01:64835] mca: base: components_register: found loaded component usnic
[hdsmpriv01:64835] mca: base: components_register: component usnic register function successful
[hdsmpriv01:64835] mca: base: components_register: found loaded component self
[hdsmpriv01:64835] mca: base: components_register: component self register function successful
[hdsmpriv01:64835] mca: base: components_register: found loaded component openib
[hdsmpriv01:64835] mca: base: components_register: component openib register function successful
[hdsmpriv01:64835] mca: base: components_register: found loaded component sm
[hdsmpriv01:64835] mca: base: components_open: opening btl components
[hdsmpriv01:64835] mca: base: components_open: found loaded component tcp
[hdsmpriv01:64835] mca: base: components_open: component tcp open function successful
[hdsmpriv01:64835] mca: base: components_open: found loaded component vader
[hdsmpriv01:64835] mca: base: components_open: component vader open function successful
[hdsmpriv01:64835] mca: base: components_open: found loaded component usnic
[hdsmpriv01:64835] mca: base: components_open: component usnic open function successful
[hdsmpriv01:64835] mca: base: components_open: found loaded component self
[hdsmpriv01:64835] mca: base: components_open: component self open function successful
[hdsmpriv01:64835] mca: base: components_open: found loaded component openib
[hdsmpriv01:64835] mca: base: components_open: component openib open function successful
[hdsmpriv01:64835] select: initializing btl component tcp
[hdsmpriv01:64835] select: init of component tcp returned success
[hdsmpriv01:64835] select: initializing btl component vader
[hdsmpriv01:64835] select: init of component vader returned failure
[hdsmpriv01:64835] mca: base: close: component vader closed
[hdsmpriv01:64835] mca: base: close: unloading component vader
[hdsmpriv01:64835] select: initializing btl component usnic
[hdsmpriv02:01067] mca: base: components_register: registering framework btl components
[hdsmpriv02:01067] mca: base: components_register: found loaded component sm
[hdsmpriv02:01067] mca: base: components_register: found loaded component self
[hdsmpriv02:01067] mca: base: components_register: component self register function successful
[hdsmpriv02:01067] mca: base: components_register: found loaded component vader
[hdsmpriv02:01067] mca: base: components_register: component vader register function successful
[hdsmpriv02:01067] mca: base: components_register: found loaded component usnic
[hdsmpriv02:01067] mca: base: components_register: component usnic register function successful
[hdsmpriv02:01067] mca: base: components_register: found loaded component openib
[hdsmpriv01:64835] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[hdsmpriv01:64835] select: init of component usnic returned failure
[hdsmpriv01:64835] mca: base: close: component usnic closed
[hdsmpriv01:64835] mca: base: close: unloading component usnic
[hdsmpriv01:64835] select: initializing btl component self
[hdsmpriv01:64835] select: init of component self returned success
[hdsmpriv01:64835] select: initializing btl component openib
[hdsmpriv01:64835] Checking distance from this process to device=hfi1_0
[hdsmpriv01:64835] Process is not bound: distance to device is 0.000000
[hdsmpriv02:01067] mca: base: components_register: component openib register function successful
[hdsmpriv02:01067] mca: base: components_register: found loaded component tcp
[hdsmpriv02:01067] mca: base: components_register: component tcp register function successful
[hdsmpriv02:01067] mca: base: components_open: opening btl components
[hdsmpriv02:01067] mca: base: components_open: found loaded component self
[hdsmpriv02:01067] mca: base: components_open: component self open function successful
[hdsmpriv02:01067] mca: base: components_open: found loaded component vader
[hdsmpriv02:01067] mca: base: components_open: component vader open function successful
[hdsmpriv02:01067] mca: base: components_open: found loaded component usnic
[hdsmpriv02:01067] mca: base: components_open: component usnic open function successful
[hdsmpriv02:01067] mca: base: components_open: found loaded component openib
[hdsmpriv02:01067] mca: base: components_open: component openib open function successful
[hdsmpriv02:01067] mca: base: components_open: found loaded component tcp
[hdsmpriv02:01067] mca: base: components_open: component tcp open function successful
[hdsmpriv02:01067] select: initializing btl component self
[hdsmpriv02:01067] select: init of component self returned success
[hdsmpriv02:01067] select: initializing btl component vader
[hdsmpriv02:01067] select: init of component vader returned failure
[hdsmpriv02:01067] mca: base: close: component vader closed
[hdsmpriv02:01067] mca: base: close: unloading component vader
[hdsmpriv02:01067] select: initializing btl component usnic
[hdsmpriv01:64835] rdmacm CPC only supported when the first QP is a PP QP; skipped
[hdsmpriv01:64835] openib BTL: rdmacm CPC unavailable for use on hfi1_0:1; skipped
[hdsmpriv01:64835] [rank=0] openib: using port hfi1_0:1
[hdsmpriv01:64835] select: init of component openib returned success
[hdsmpriv01:64835] mca: base: components_register: registering framework mtl components
[hdsmpriv01:64835] mca: base: components_register: found loaded component ofi
[hdsmpriv01:64835] mca: base: components_register: component ofi register function successful
[hdsmpriv01:64835] mca: base: components_open: opening mtl components
[hdsmpriv01:64835] mca: base: components_open: found loaded component ofi
[hdsmpriv01:64835] mca: base: components_open: component ofi open function successful
[hdsmpriv02:01067] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[hdsmpriv02:01067] select: init of component usnic returned failure
[hdsmpriv02:01067] mca: base: close: component usnic closed
[hdsmpriv02:01067] mca: base: close: unloading component usnic
[hdsmpriv02:01067] select: initializing btl component openib
[hdsmpriv02:01067] Checking distance from this process to device=hfi1_0
[hdsmpriv02:01067] Process is not bound: distance to device is 0.000000
[hdsmpriv01:64835] mca:base:select: Auto-selecting mtl components
[hdsmpriv01:64835] mca:base:select:(  mtl) Querying component [ofi]
[hdsmpriv01:64835] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[hdsmpriv01:64835] mca:base:select:(  mtl) Selected component [ofi]
[hdsmpriv01:64835] select: initializing mtl component ofi
[hdsmpriv02:01067] rdmacm CPC only supported when the first QP is a PP QP; skipped
[hdsmpriv02:01067] openib BTL: rdmacm CPC unavailable for use on hfi1_0:1; skipped
[hdsmpriv02:01067] [rank=1] openib: using port hfi1_0:1
[hdsmpriv02:01067] select: init of component openib returned success
[hdsmpriv02:01067] select: initializing btl component tcp
[hdsmpriv02:01067] select: init of component tcp returned success
[hdsmpriv02:01067] mca: base: components_register: registering framework mtl components
[hdsmpriv02:01067] mca: base: components_register: found loaded component ofi
[hdsmpriv02:01067] mca: base: components_register: component ofi register function successful
[hdsmpriv02:01067] mca: base: components_open: opening mtl components
[hdsmpriv02:01067] mca: base: components_open: found loaded component ofi
[hdsmpriv02:01067] mca: base: components_open: component ofi open function successful
[hdsmpriv02:01067] mca:base:select: Auto-selecting mtl components
[hdsmpriv02:01067] mca:base:select:(  mtl) Querying component [ofi]
[hdsmpriv02:01067] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[hdsmpriv02:01067] mca:base:select:(  mtl) Selected component [ofi]
[hdsmpriv02:01067] select: initializing mtl component ofi
[hdsmpriv02:01067] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm2"
[hdsmpriv02:01067] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[hdsmpriv02:01067] mtl_ofi_component.c:301: mtl:ofi:prov: psm2
[hdsmpriv02:01067] select: init returned success
[hdsmpriv02:01067] select: component ofi selected
[hdsmpriv01:64835] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm2"
[hdsmpriv01:64835] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[hdsmpriv01:64835] mtl_ofi_component.c:301: mtl:ofi:prov: psm2
[hdsmpriv01:64835] select: init returned success
[hdsmpriv01:64835] select: component ofi selected
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 3, MPI-2 part    
#------------------------------------------------------------
# Date                  : Mon Nov  4 16:17:47 2019
# Machine               : x86_64
# System                : Linux
# Release               : 4.18.0-80.el8.x86_64
# Version               : #1 SMP Wed Mar 13 12:02:46 UTC 2019
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# IMB-EXT Accumulate

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Accumulate
[hdsmpriv01:64835] mca: bml: Using self btl for send to [[26765,1],0] on node hdsmpriv01
[hdsmpriv02:01067] mca: bml: Using self btl for send to [[26765,1],1] on node hdsmpriv02
[hdsmpriv02:01067] mca: bml: Using openib btl for send to [[26765,1],0] on node hdsmpriv01

#----------------------------------------------------------------
# Benchmarking Accumulate 
# #processes = 2 
#----------------------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.05         0.06         0.05
[hdsmpriv01:64835] *** Process received signal ***
[hdsmpriv01:64835] Signal: Segmentation fault (11)
[hdsmpriv01:64835] Signal code: Address not mapped (1)
[hdsmpriv01:64835] Failing at address: (nil)
[hdsmpriv01:64835] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f24da4eed80]
[hdsmpriv01:64835] [ 1] /usr/mpi/gcc/openmpi-3.1.4-hfi/lib64/openmpi/mca_btl_openib.so(mca_btl_openib_get+0x141)[0x7f24c767e0d1]
[hdsmpriv01:64835] [ 2] /usr/mpi/gcc/openmpi-3.1.4-hfi/lib64/openmpi/mca_osc_rdma.so(ompi_osc_get_data_blocking+0x18b)[0x7f24c4bbb5db]
[hdsmpriv01:64835] [ 3] /usr/mpi/gcc/openmpi-3.1.4-hfi/lib64/openmpi/mca_osc_rdma.so(+0xc5c6)[0x7f24c4bbf5c6]
[hdsmpriv01:64835] [ 4] /usr/mpi/gcc/openmpi-3.1.4-hfi/lib64/openmpi/mca_osc_rdma.so(ompi_osc_rdma_accumulate+0x4a5)[0x7f24c4bcae75]
[hdsmpriv01:64835] [ 5] /usr/mpi/gcc/openmpi-3.1.4-hfi/lib64/libmpi.so.40(PMPI_Accumulate+0x161)[0x7f24db0b7961]
[hdsmpriv01:64835] [ 6] IMB-EXT[0x43c5d2]
[hdsmpriv01:64835] [ 7] IMB-EXT[0x42ca75]
[hdsmpriv01:64835] [ 8] IMB-EXT[0x4325e3]
[hdsmpriv01:64835] [ 9] IMB-EXT[0x4056f4]
[hdsmpriv01:64835] [10] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f24da13b813]
[hdsmpriv01:64835] [11] IMB-EXT[0x4040ae]
[hdsmpriv01:64835] *** End of error message ***
@jsquyres
Copy link
Member

I think the help message could be a bit better:

By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

There's no indication there that this is about the openib BTL. Additionally, InfiniBand is capitalized incorrectly.

@mwheinz What is your suggestion here? Specifically: the intent to use UCX is the same, regardless of whether Open MPI is compiled with UCX support or not. Are you asking for a re-wording of the message? Or something else?

@mwheinz
Copy link
Author

mwheinz commented Oct 25, 2019

@mwheinz What is your suggestion here? Specifically: the intent to use UCX is the same, regardless of whether Open MPI is compiled with UCX support or not. Are you asking for a re-wording of the message? Or something else?

Jeff,

My question is: if we aren't using verbs, why am I getting the warning at all?

@mwheinz
Copy link
Author

mwheinz commented Oct 25, 2019

Heck, in the above example, I'm not using OPA +or+ InfiniBand at all!

@mwheinz mwheinz self-assigned this Oct 25, 2019
@mwheinz
Copy link
Author

mwheinz commented Oct 25, 2019

This problem (and a related hang) appears to be caused by the openib btl loading even though it is not desired. Using

--mca btl ^openib

prevents both the warning and the apparently related hang. However, it leaves the question of why the openib btl is being used at all.

Using this updated command line:

mpirun -allow-run-as-root --mca mtl ofi --mca mtl_ofi_provider_include psm2,shm  --mca btl_base_verbose 10 --mca mtl_base_verbose 99 --mca btl ^openib -H hdsmpriv01,hdsmpriv02 IMB-EXT Accumulate

I still see

[hdsmpriv02:50676] select: initializing btl component tcp
...
[hdsmpriv01:50392] mca: bml: Using self btl for send to [[20835,1],0] on node hdsmpriv01
[hdsmpriv01:50392] mca: bml: Using tcp btl for send to [[20835,1],1] on node hdsmpriv02
[hdsmpriv01:50392] mca: bml: Using tcp btl for send to [[20835,1],1] on node hdsmpriv02
[hdsmpriv02:50676] mca: bml: Using tcp btl for send to [[20835,1],0] on node hdsmpriv01
[hdsmpriv02:50676] mca: bml: Using tcp btl for send to [[20835,1],0] on node hdsmpriv01
[hdsmpriv02:50676] btl:tcp: host hdsmpriv01, process [[20835,1],0] UNREACHABLE
[hdsmpriv02:50676] mca: bml: Using self btl for send to [[20835,1],1] on node hdsmpriv02

And if I use --mca btl ^openib,^tcp I get this:

[hdsmpriv01:50436] mca: bml: Using self btl for send to [[20815,1],0] on node hdsmpriv01
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[20815,1],0]) is on host: hdsmpriv01
  Process 2 ([[20815,1],1]) is on host: hdsmpriv02
  BTLs attempted: self

@mwheinz mwheinz changed the title UCX warnings about using verbs even when not using verbs, even when not compiling OFI with UCX support. Openib BTL being selected and used when not desired Oct 25, 2019
@jsquyres
Copy link
Member

This problem (and a related hang) appears to be caused by the openib btl loading even though it is not desired. Using

I would rephrase that: the problem is that the OFI MTL is not being used when you expect it to be.

It is understandable that the openib BTL is being used. Specifically:

  1. For some reason the OFI MTL is electing not to be used.
  2. Assumedly, no other MTLs elect to be used.
  3. This causes the CM PML to elect not to be used.
  4. The PML selection logic then queries the OB1 PML.
  5. OB1 queries the BTLs.
  6. The openib BTL finds the HFI devices, and therefore elects to be used.
  7. Others probably elect to be used, too (vader, self, tcp).
  8. OB1 therefore elects to be used.

Hence, the real question is the first element in this chain: why did the OFI MTL elect not to be used? I suspect it has something to do with the selection logic in the OFI MTL -- perhaps something specific to do with the TCP OFI provider...? There was definitely discussion of not wanting to use the TCP OFI provider for the OFI MTL (and instead using the TCP BTL). I don't know/remember where all that conversation landed. But that's where I'd look.

Using this updated command line:

mpirun -allow-run-as-root --mca mtl ofi --mca mtl_ofi_provider_include psm2,shm  --mca btl_base_verbose 10 --mca mtl_base_verbose 99 --mca btl ^openib -H hdsmpriv01,hdsmpriv02 IMB-EXT Accumulate

I still see

[hdsmpriv02:50676] select: initializing btl component tcp
...
[hdsmpriv01:50392] mca: bml: Using self btl for send to [[20835,1],0] on node hdsmpriv01
[hdsmpriv01:50392] mca: bml: Using tcp btl for send to [[20835,1],1] on node hdsmpriv02
[hdsmpriv01:50392] mca: bml: Using tcp btl for send to [[20835,1],1] on node hdsmpriv02
[hdsmpriv02:50676] mca: bml: Using tcp btl for send to [[20835,1],0] on node hdsmpriv01
[hdsmpriv02:50676] mca: bml: Using tcp btl for send to [[20835,1],0] on node hdsmpriv01
[hdsmpriv02:50676] btl:tcp: host hdsmpriv01, process [[20835,1],0] UNREACHABLE
[hdsmpriv02:50676] mca: bml: Using self btl for send to [[20835,1],1] on node hdsmpriv02

Again, I think the question is: why did the OFI MTL elect not to run (and therefore failover to the OB1 PML and whatever BTLs elected to run)?

And if I use --mca btl ^openib,^tcp I get this:

Note that the correct notation is: --mca btl ^openib,tcp (i.e., the ^ negates the whole comma-delimited list -- it never makes sense to mix "include" and "exclude" elements in the same list, so a single ^ to negate the whole list is sufficient).

[hdsmpriv01:50436] mca: bml: Using self btl for send to [[20815,1],0] on node hdsmpriv01
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[20815,1],0]) is on host: hdsmpriv01
  Process 2 ([[20815,1],1]) is on host: hdsmpriv02
  BTLs attempted: self

Same question: why did the OFI MTL elect not to run?

Per our off-issue discussion in Slack, if there's a hang with the openib BTL on HFI devices, that is a secondary issue. You may or may not be interested in addressing that issue (i.e., if PSM/PSM2 is never intended to be supported through openib, it may not be worth digging in to that).

@mwheinz
Copy link
Author

mwheinz commented Oct 28, 2019

Agreed. At this point, I've established the following characteristics:

  1. The problem is not related to using PSM2 - I can create the problem using other non-RDMA OFI transports.
  2. Eliminating openib from the list of available BTLs results in OMPI using the sockets BTL instead.
  3. Excluding TCP causes an error message about processes being unable to communicate with each other. This is particularly interesting because the benchmark in question, (IMB-EXT) then actually produces valid output even after the error message appears.

The most astonishing bit for #3 is that I am running a 2-process B2B benchmark but I also see this:

[hdsmpriv01:39668] 81 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[hdsmpriv01:39668] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@mwheinz
Copy link
Author

mwheinz commented Oct 28, 2019

I've updated the problem description to more accurately reflect my current understanding and attached a verbose log of a sample run of the IMB-EXT benchmark.

@jsquyres
Copy link
Member

It's probably due to BTL's being used for one-sided MPI operations, but CM/OFI MTL being used for point-to-point.

@mwheinz
Copy link
Author

mwheinz commented Oct 28, 2019

I literally just figured that out - decided to try some other apps to collect more info and, yup, the problem only occurs for one-sided communications.

Does this also occur with OMPI 3.1.x?

@jsquyres
Copy link
Member

Probably. I.e., I don't think the MTL's support one-sided operations. So this behavior is probably not new.

@mwheinz
Copy link
Author

mwheinz commented Oct 28, 2019

Makes sense.

@mwheinz
Copy link
Author

mwheinz commented Nov 4, 2019

@hppritcha has indicated that this may be fixed in 4.0.2. I will re-test with that and report back.

@mwheinz
Copy link
Author

mwheinz commented Nov 5, 2019

I've confirmed this is fixed in 4.0.2.

@mwheinz mwheinz closed this as completed Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants