Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP BTL: problems when there are multiple IP interfaces in the same subnet #5818

Open
jsquyres opened this issue Oct 1, 2018 · 21 comments
Open
Assignees

Comments

@jsquyres
Copy link
Member

jsquyres commented Oct 1, 2018

This issue is split off from #3035 and #5817.

On one of my development servers that has multiple IP addresses on a single IP interface, I get 100% failure with:

Here's the details:

$ mpirun -np 2 --mca btl tcp,self ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          savbu-usnic-a
  Local PID:           4674
  Peer hostname:       savbu-usnic-a ([[17212,1],0])
  Source IP of socket: 10.193.184.48
  Known IPs of peer:   
        10.0.8.254
        10.193.184.49
        10.10.0.254
        10.2.0.254
        10.3.0.252
        10.50.0.254
--------------------------------------------------------------------------
[hang]

Here's the ip addr from this machine:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 60:73:5c:68:f3:68 brd ff:ff:ff:ff:ff:ff
    inet 10.0.8.254/16 brd 10.0.255.255 scope global eth0
    inet6 fe80::6273:5cff:fe68:f368/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:f3:69 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 60:73:5c:68:f3:6a brd ff:ff:ff:ff:ff:ff
    inet 10.193.184.48/24 brd 10.193.184.255 scope global eth2
    inet 10.193.184.49/24 brd 10.193.184.255 scope global secondary eth2:cmha
    inet6 fe80::6273:5cff:fe68:f36a/64 scope link 
       valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:f3:6b brd ff:ff:ff:ff:ff:ff
6: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP qlen 1000
    link/ether 24:57:20:fe:20:00 brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.254/16 brd 10.10.255.255 scope global eth4
    inet6 fe80::2657:20ff:fefe:2000/64 scope link 
       valid_lft forever preferred_lft forever
7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP qlen 1000
    link/ether 24:57:20:fe:21:00 brd ff:ff:ff:ff:ff:ff
    inet 10.2.0.254/16 brd 10.2.255.255 scope global eth5
    inet6 fe80::2657:20ff:fefe:2100/64 scope link 
       valid_lft forever preferred_lft forever
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP qlen 1000
    link/ether 24:57:20:fe:50:00 brd ff:ff:ff:ff:ff:ff
    inet 10.3.0.254/16 brd 10.3.255.255 scope global eth6
    inet 10.3.0.252/16 brd 10.3.255.255 scope global secondary eth6:0
    inet6 fe80::2657:20ff:fefe:5000/64 scope link 
       valid_lft forever preferred_lft forever
9: eth7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 24:57:20:fe:51:00 brd ff:ff:ff:ff:ff:ff
    inet 10.50.0.254/16 brd 10.50.255.255 scope global eth7
    inet6 fe80::2657:20ff:fefe:5100/64 scope link 
       valid_lft forever preferred_lft forever

Notice:

  • eth2 has both .48 and .49
  • eth6 has both .252 and .254

The error message above indicates that a connection came in from .48, but it only recognized a peer with address .49.

Here's the same run, but with btl_base_verbose at 100.

...the output is quite long, so I put it in a gist.

Note, too, that the job hung. There are many other IP interfaces where the connection could be made, so I'm not sure why it didn't just failover to another address/interface.

@jsquyres
Copy link
Member Author

jsquyres commented Oct 4, 2018

Note that this is still happening on master/HEAD. #5817 has been merged to master, so we get the correct IP addresses output in the show_help message, but running on the same platform as described above, I still get the same failure (and it still hangs).

As a slight simplification, I notice that the problem even occurs if I limit the TCP BTL to the subnet containing both IPs on a single IP interface. Specifically:

$ mpirun \
    --mca btl_tcp_if_include 10.193.184.0/24 \
    --mca btl_base_verbose 100 \
    -np 2 --mca btl tcp,self ring_c

(i.e., 10.193.184.48 and 10.193.184.49 are on the same IP interface)

The verbose output from that run is located in this gist -- it's a bit smaller/easier to read (compared to the original gist I posted, above) because it's only those 2 IP addresses -- not 6.

@bosilca
Copy link
Member

bosilca commented Oct 4, 2018

The selection mechanism in split_and_resolve will only picks the first IP that matches due to the fact that we are breaking the loop in 645 after finding the first match. In your case, as both your processes are on the same node, they both select to use 10.193.184.48, and this should 1) be the only IP they publish in the modex, 2) be the only IP they use to contact the others, and 3) be the only IP associated with a module.

According to your output we fail 2, but there is not enough information to pinpoint the root cause. Let's add the following to our module creation and see what we get:

diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c
index d068aecb22..0a6e2b7dd7 100644
--- a/opal/mca/btl/tcp/btl_tcp_component.c
+++ b/opal/mca/btl/tcp/btl_tcp_component.c
@@ -522,6 +522,10 @@ static int mca_btl_tcp_create(int if_kindex, const char* if_name)
         if (addr.ss_family == AF_INET) {
            btl->tcp_ifaddr = addr;
         }
+        opal_output_verbose(10, opal_btl_base_framework.framework_output,
+                            "Create TCP module %d for local address (%s)",
+                            mca_btl_tcp_component.tcp_num_btls-1,
+                            opal_net_get_hostname((struct sockaddr*) &btl->tcp_ifaddr));
         /* allow user to specify interface bandwidth */
         sprintf(param, "bandwidth_%s", if_name);
         mca_btl_tcp_param_register_uint(param, NULL, btl->super.btl_bandwidth, OPAL_INFO_LVL_5, &btl->super.btl_bandwidth);

@jsquyres
Copy link
Member Author

jsquyres commented Oct 4, 2018

Done.

Short version:

It added the following line to the previous gist:

[savbu-usnic-a:28141] Create TCP module 0 for local address (10.193.184.48)

which matches output in the prior line ([savbu-usnic-a:28141] btl: tcp: Found match: 10.193.184.48 (eth2)).

I updated the gist with a 2nd file that shows the complete output (that includes George's new opal_output): https://gist.github.com/jsquyres/10026ceda61ff3b18cdcfc4c8f1a4aca#file-mpirun-ouptput-with-georges-additional-output-txt

@bosilca
Copy link
Member

bosilca commented Oct 4, 2018

I blame it on mca_btl_tcp_component_exchange, more specifically on the order mismatch between opal_ifbegin and split_and_resolve. When a process prepares the modex information, we walk over all local TCP BTL modules, and compare their kernel interface index (tcp_ifkindex) with the one we obtained during split_and_resolve. The first IP we find on the same physical kernel interface will do, and it will become our IP for all peers. Note however that this can be different from the IP selected during the TCP BTL module creation, as we compare only the underlying interface index.

The fix should be simple: remove the entire loop in btl_tcp_component.c line 1145, and use instead the IP already associated with the BTL module. In fact I already commented on the same problem on another ticket, and proposed a solution only use each interface once. I can take a stab at it tomorrow afternoon.

@newnovice01
Copy link

I have the same problem where my eth0 has multiple IP's. I get only so much bandwidth per ip pair due to a different reason and need more than 1 IP to achieve more bandwidth between the nodes. Since the ip's are from the same subnet they go out the default gateway. Is there a way to simple tell MPI to not bother about any ip or host validation and just accept messages and proceed with the job?

I have something like:
mpirun -np 8 -hostfile ~/HostsIpFile -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=eth0 -mca btl_tcp_if_exclude lo,docker0 -mca pml ob1 -mca btl ^openib python

Will adding: --mca btl_tcp_if_include 0.0.0.0 help? Since that is the default gateway in my case?
route -n:
Destination Gateway Genmask Flags Metric Ref Use Iface
192.31.16.0 0.0.0.0 255.255.240.0 U 0 0 0 eth0

@bosilca
Copy link
Member

bosilca commented Oct 15, 2018

@newnovice01 in your case you don't need multiple IPs to solve this problem. Instead you can use the multiple links support from OMPI, that will open multiple sockets between pairs on the same IPs. To do this add "--mca btl_tcp_links number" to your mpiexec or add "btl_tcp_links = number" to your ${HOME}/.openmpi/mca-params.conf.

@newnovice01
Copy link

newnovice01 commented Oct 15, 2018

multiple sockets doesn't help me since the platform i am on - at the IP level has IPsec tunnels that then encapsulates everything. So the network flow looks like src_ip, ip_proto=esp, dst_ip only. I cant change ip_proto - even 1 master being able to launch jobs on same peer using 2 different ip's would help.

If i had multiple network interfaces: eth0, eth1 - would it help?
-x NCCL_SOCKET_IFNAME=eth0,eth1...

Another idea would be - only the master node having multiple ip's also can help. or the clients having more than 1.

I am on: mpirun (Open MPI) 3.1.0

@bwbarrett
Copy link
Member

If Open MPI sees multiple network interfaces, it will spread traffic across all of them. Open MPI balances traffic at the device, not IP, level. So even when we fix the current set of bugs around multiple IPs, it still wouldn't do what you want with multiple IPs / device and you'd have to use multiple network devices to get that traffic spread (or, as George said, use the "btl_tcp_links" parameter, but that doesn't work for your use case).

@newnovice01
Copy link

newnovice01 commented Oct 15, 2018

I got 2 eth's on both master & slave node:

cat hosts: 
cat multi_eth_mpi_test-hosts-2
ip-192-31-20-48  slots=8
ip-192-31-28-116 slots=8

master: 
eth0      Link encap:Ethernet  HWaddr AA:FF:DD:CC:55:QA  
          inet addr:192.31.20.48  Bcast:192.31.31.255  Mask:255.255.240.0
eth1      Link encap:Ethernet  HWaddr AA:FE:CC:DF:WQ:4A  
          inet addr:192.31.23.189  Bcast:192.31.31.255  Mask:255.255.240.0

slave: 
eth0      Link encap:Ethernet  HWaddr A2:A0:D9:BB:67:4A  
          inet addr:192.31.28.116  Bcast:192.31.31.255  Mask:255.255.240.0
eth1      Link encap:Ethernet  HWaddr 12:12:17:44:DD:AA  
          inet addr:192.31.28.69  Bcast:192.31.31.255  Mask:255.255.240.0

command i used:

mpirun -np 16 -hostfile hosts -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0,eth1 -x LD_LIBRARY_PATH -x PATH  -mca pml ob1 -mca btl ^openib python

I get:

Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

Local host:          ip-192-31-20-48
Local PID:           51831
Peer hostname:       (null) ([[33822,1],8])
Source IP of socket: 192.31.28.116
Known IPs of peer:   
	0.0.0.0
	0.0.0.0
--

Is there a hosts file hack that I can get this working with? I am very new to this and am kind of lost..

bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 17, 2018
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs open-mpi#5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 17, 2018
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs open-mpi#5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 17, 2018
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs open-mpi#5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 18, 2018
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs open-mpi#5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@bwbarrett bwbarrett self-assigned this Oct 19, 2018
@bwbarrett
Copy link
Member

@newnovice01 sorry for the delay in responding. Your latest issue is actually #3035, not this problem. There is a race when using multiple network interfaces and threads which causes the error you're seeing. The only current workaround is to only use one network interface for TCP traffic in Open MPI, something like btl_tcp_if_include=eth0.

The original issue has been fixed in master. We need to discuss whether we should fix in the active release branches.

bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 3, 2018
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs open-mpi#5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@titodalcanton
Copy link

titodalcanton commented Mar 1, 2021

Hello, I am experiencing this problem using Open MPI 4.0.5. Is the fix already available in 4.1.x?

Edit: I still see it on 4.1.0, so I guess the answer is no.

@gpaulsen
Copy link
Member

Lets discuss this on the weekly Telcon.

@gpaulsen
Copy link
Member

gpaulsen commented Apr 23, 2021

Fixed in v4.0 here: #8721

EDIT: The race condition mentioned in #3036 (and in some of the comments on this PR) was fixed by #8721 in v4.0.x and #8966 for v4.1.x. Note that those PR's do not fix the initial problem described in this issue.

@SharonBrunett
Copy link

SharonBrunett commented May 14, 2021

Is this fix folded into Open MPI 4.1.1 libraries?

A test code, built against Open MPI 4.1.1 sees "inbound connection dropped ...try with a different IP interface" messages, despite this problem being reported as fixed in 4.0.

There's one bonded interface (unique IPs 0.9.6.185 and 10.14.6.185 share an interface), per node, which confuses mpi.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:97:06:d6 brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond0 state DOWN group default qlen 1000
    link/ether 0c:c4:7a:97:06:d7 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:c4:7a:97:06:d6 brd ff:ff:ff:ff:ff:ff
    inet 10.9.6.185/16 brd 10.9.255.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet 10.14.6.185/16 brd 10.14.255.255 scope global bond0
       valid_lft forever preferred_lft forever
[sharon.brunett@node1435 ~]$ 

Adding to the mpirun command line these arguments
--mca btl_tcp_if_include with 10.14.6.185/24, ..... does not help.
btw_tcp_if_exclude does not help. I get

WARNING: An invalid value was given for btl_tcp_if_include.  This
value will be ignored.

  Local host: node1394
  Value:      10.14.6.144/24
  Message:    Did not find interface matching this subnet

Inputs on what libraries in v 4.1.1 to pull, if any, having the multiple IP addresses on a single interface fix are much appreciated!

@jsquyres jsquyres changed the title TCP BTL: problems when there are multiple IP addresses on a single interface TCP BTL: problems when there are multiple IP interfaces in the same subnet May 17, 2021
@jsquyres
Copy link
Member Author

jsquyres commented May 17, 2021

@gpaulsen's comment above is not quite correct (I just added an "EDIT" note to it): the PR he mentions fixes the race condition that was described in #3035 (and some of the comments above) -- it does not fix the original problem described in this issue. Hence, this issue is still open. However, @gpaulsen's comments caused me to realize that we had somehow neglected to cherry pick the #3035 fix to the v4.1.x branch. Doh! I just created #8966 to bring the fix for #3035 to v4.1.x branch.

@jsquyres
Copy link
Member Author

@SharonBrunett I think the issue you describe is different than the other 2 issues described on this issue.

The original issue has to do with having multiple IPs on a single machine that are in the same subnet, and/or on the same IP interface (and has not been fixed). The other issue that came up on this issue was already reported in #3035 (and has been fixed).

You're asking about a bonded interface that has multiple IPs. I'm honestly not sure what Open MPI will do with a bonded interface that has multiple active IP addresses; we've never tried this use case. Things could get weird (note that a bonded interface with multiple active IP's is different than a single interface with multiple IPs).

Note, too, that your btl_tcp_if_include usage wasn't quite correct, either. The subnet of your interface is 10.9.6.185/16, but you specified 10.9.6.185/24 -- but this doesn't exist.

Given that this is a new issue, if you'd like to converse further about it, please open a new github issue.

@gpaulsen
Copy link
Member

What's the state of this? It looks like #9681 and #9718 (v5.0.x) have been merged. Can we close this issue, or are there remaining items?

@reox
Copy link

reox commented Jan 16, 2024

We recently switched from mpirun 1.10.7 on CentOS to 4.1.4 on Debian. I got the "The inbound connection has been dropped" error before, as our servers have multiple IPs assigned to a single interface, and fixed it back then by using --mca btl_tcp_if_include 10.20.30.0/24. However, for some reason, this does not work with 4.1.4 anymore.

I tried several other things, such as adding also --mca oob_tcp_if_include 10.20.30.0/24, adding both IP address, such as --mca btl_tcp_if_include 10.20.30.0/24,1.2.3.0/24 (the second one is indeed a public v4 /24 network), using the btl_tcp_if_exclude options etc. I eventually get other error messages then, such as

WARNING: An invalid value was given for oob_tcp_if_include.  This
value will be ignored.

  Local host: clusternode01
  Value:      10.20.30.0/24
  Message:    Did not find interface matching this subnet

which is wrong, as an IP with the correct subnet exists on this host.

Is this behaviour this bug here and was it introduced somewhere in between v1.10 and v4.1?
Is there another workaround I could test?

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2024

from mpirun 1.10.7 on CentOS to 4.1.4 on Debian

Wow - that is one heck of an update!

WARNING: An invalid value was given for oob_tcp_if_include. This value will be ignored.

Sounds like a bug in the opal_net_samenetwork code. Afraid I haven't looked there in years.

This is an old issue, so I don't know how much traction you might get - you might need to open a new issue for this specific request to get someone's attention. Or we can try to ping someone who can at least prod people.

@jsquyres Any ideas on who could look at this? The code hasn't really changed, so whatever problem exists might well be in OMPI v5 (via PMIx) as well.

@reox
Copy link

reox commented Jan 17, 2024

Sounds like a bug in the opal_net_samenetwork code. Afraid I haven't looked there in years.

Not sure if this is oob stuff is actually relevant. While debugging this, we saw in the docs (https://www.open-mpi.org/faq/?category=tcp#tcp-selection) that you might want to set the oob parameter as well. However, if I do that, I get that error.
But, if I just use --mca btl_tcp_if_include 10.20.30.0/24 --mca btl tcp,self, I get the "The inbound connection has been dropped" error - as described by others above.

Then we also saw this FAQ entry: https://www.open-mpi.org/faq/?category=tcp#ip-multiaddress-devices
This would mean, it was just luck that it worked with 1.10?

One interesting thing is, that it seems to be dependent on the number of processes and hosts I use. For example, I was able to start a job now on two hosts, but if I add a third one it fails. Is this maybe something that is triggered by chance, thus, if I add more and more hosts and processes, the chances to trigger this are getting higher?

edit: I played around with it a bit more and thought, maybe if I cannot include a network, maybe I can exclude it. However, this does not work properly: --mca oob_tcp_if_exclude 1.2.3.0/24 --mca btl_tcp_if_exclude 1.2.3.0/24

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    cluster01
  Remote host:   cluster02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

I also checked without the oob parameter, but then I get the same error.
Sometimes I also get this one:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: cluster03
  PID:        285053

@reox
Copy link

reox commented Jan 22, 2024

To give a short update: We tested now with OpenMPI 5.0.1 and this bug* seems to be resolved in that version. We can start with mpirun on our cluster, even without specifying the interface or excluding/including specific networks.

* I'm still not sure if our bug is the same as this bug report... Nevertheless, we can start our software now without the "The inbound connection has been dropped" error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants