-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP BTL: problems when there are multiple IP interfaces in the same subnet #5818
Comments
Note that this is still happening on master/HEAD. #5817 has been merged to master, so we get the correct IP addresses output in the show_help message, but running on the same platform as described above, I still get the same failure (and it still hangs). As a slight simplification, I notice that the problem even occurs if I limit the TCP BTL to the subnet containing both IPs on a single IP interface. Specifically:
(i.e., 10.193.184.48 and 10.193.184.49 are on the same IP interface) The verbose output from that run is located in this gist -- it's a bit smaller/easier to read (compared to the original gist I posted, above) because it's only those 2 IP addresses -- not 6. |
The selection mechanism in split_and_resolve will only picks the first IP that matches due to the fact that we are breaking the loop in 645 after finding the first match. In your case, as both your processes are on the same node, they both select to use 10.193.184.48, and this should 1) be the only IP they publish in the modex, 2) be the only IP they use to contact the others, and 3) be the only IP associated with a module. According to your output we fail 2, but there is not enough information to pinpoint the root cause. Let's add the following to our module creation and see what we get: diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c
index d068aecb22..0a6e2b7dd7 100644
--- a/opal/mca/btl/tcp/btl_tcp_component.c
+++ b/opal/mca/btl/tcp/btl_tcp_component.c
@@ -522,6 +522,10 @@ static int mca_btl_tcp_create(int if_kindex, const char* if_name)
if (addr.ss_family == AF_INET) {
btl->tcp_ifaddr = addr;
}
+ opal_output_verbose(10, opal_btl_base_framework.framework_output,
+ "Create TCP module %d for local address (%s)",
+ mca_btl_tcp_component.tcp_num_btls-1,
+ opal_net_get_hostname((struct sockaddr*) &btl->tcp_ifaddr));
/* allow user to specify interface bandwidth */
sprintf(param, "bandwidth_%s", if_name);
mca_btl_tcp_param_register_uint(param, NULL, btl->super.btl_bandwidth, OPAL_INFO_LVL_5, &btl->super.btl_bandwidth); |
Done. Short version: It added the following line to the previous gist:
which matches output in the prior line ( I updated the gist with a 2nd file that shows the complete output (that includes George's new opal_output): https://gist.github.com/jsquyres/10026ceda61ff3b18cdcfc4c8f1a4aca#file-mpirun-ouptput-with-georges-additional-output-txt |
I blame it on mca_btl_tcp_component_exchange, more specifically on the order mismatch between opal_ifbegin and split_and_resolve. When a process prepares the modex information, we walk over all local TCP BTL modules, and compare their kernel interface index (tcp_ifkindex) with the one we obtained during split_and_resolve. The first IP we find on the same physical kernel interface will do, and it will become our IP for all peers. Note however that this can be different from the IP selected during the TCP BTL module creation, as we compare only the underlying interface index. The fix should be simple: remove the entire loop in btl_tcp_component.c line 1145, and use instead the IP already associated with the BTL module. In fact I already commented on the same problem on another ticket, and proposed a solution only use each interface once. I can take a stab at it tomorrow afternoon. |
I have the same problem where my eth0 has multiple IP's. I get only so much bandwidth per ip pair due to a different reason and need more than 1 IP to achieve more bandwidth between the nodes. Since the ip's are from the same subnet they go out the default gateway. Is there a way to simple tell MPI to not bother about any ip or host validation and just accept messages and proceed with the job? I have something like: Will adding: --mca btl_tcp_if_include 0.0.0.0 help? Since that is the default gateway in my case? |
@newnovice01 in your case you don't need multiple IPs to solve this problem. Instead you can use the multiple links support from OMPI, that will open multiple sockets between pairs on the same IPs. To do this add "--mca btl_tcp_links number" to your mpiexec or add "btl_tcp_links = number" to your ${HOME}/.openmpi/mca-params.conf. |
multiple sockets doesn't help me since the platform i am on - at the IP level has IPsec tunnels that then encapsulates everything. So the network flow looks like src_ip, ip_proto=esp, dst_ip only. I cant change ip_proto - even 1 master being able to launch jobs on same peer using 2 different ip's would help. If i had multiple network interfaces: eth0, eth1 - would it help? Another idea would be - only the master node having multiple ip's also can help. or the clients having more than 1. I am on: mpirun (Open MPI) 3.1.0 |
If Open MPI sees multiple network interfaces, it will spread traffic across all of them. Open MPI balances traffic at the device, not IP, level. So even when we fix the current set of bugs around multiple IPs, it still wouldn't do what you want with multiple IPs / device and you'd have to use multiple network devices to get that traffic spread (or, as George said, use the "btl_tcp_links" parameter, but that doesn't work for your use case). |
I got 2 eth's on both master & slave node:
command i used:
I get:
Is there a hosts file hack that I can get this working with? I am very new to this and am kind of lost.. |
Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs open-mpi#5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs open-mpi#5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs open-mpi#5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs open-mpi#5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@newnovice01 sorry for the delay in responding. Your latest issue is actually #3035, not this problem. There is a race when using multiple network interfaces and threads which causes the error you're seeing. The only current workaround is to only use one network interface for TCP traffic in Open MPI, something like The original issue has been fixed in master. We need to discuss whether we should fix in the active release branches. |
Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs open-mpi#5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Hello, I am experiencing this problem using Open MPI 4.0.5. Is the fix already available in 4.1.x? Edit: I still see it on 4.1.0, so I guess the answer is no. |
Lets discuss this on the weekly Telcon. |
Is this fix folded into Open MPI 4.1.1 libraries? A test code, built against Open MPI 4.1.1 sees "inbound connection dropped ...try with a different IP interface" messages, despite this problem being reported as fixed in 4.0. There's one bonded interface (unique IPs 0.9.6.185 and 10.14.6.185 share an interface), per node, which confuses mpi.
Adding to the mpirun command line these arguments
Inputs on what libraries in v 4.1.1 to pull, if any, having the multiple IP addresses on a single interface fix are much appreciated! |
@gpaulsen's comment above is not quite correct (I just added an "EDIT" note to it): the PR he mentions fixes the race condition that was described in #3035 (and some of the comments above) -- it does not fix the original problem described in this issue. Hence, this issue is still open. However, @gpaulsen's comments caused me to realize that we had somehow neglected to cherry pick the #3035 fix to the v4.1.x branch. Doh! I just created #8966 to bring the fix for #3035 to v4.1.x branch. |
@SharonBrunett I think the issue you describe is different than the other 2 issues described on this issue. The original issue has to do with having multiple IPs on a single machine that are in the same subnet, and/or on the same IP interface (and has not been fixed). The other issue that came up on this issue was already reported in #3035 (and has been fixed). You're asking about a bonded interface that has multiple IPs. I'm honestly not sure what Open MPI will do with a bonded interface that has multiple active IP addresses; we've never tried this use case. Things could get weird (note that a bonded interface with multiple active IP's is different than a single interface with multiple IPs). Note, too, that your Given that this is a new issue, if you'd like to converse further about it, please open a new github issue. |
We recently switched from mpirun 1.10.7 on CentOS to 4.1.4 on Debian. I got the "The inbound connection has been dropped" error before, as our servers have multiple IPs assigned to a single interface, and fixed it back then by using I tried several other things, such as adding also
which is wrong, as an IP with the correct subnet exists on this host. Is this behaviour this bug here and was it introduced somewhere in between v1.10 and v4.1? |
Wow - that is one heck of an update!
Sounds like a bug in the This is an old issue, so I don't know how much traction you might get - you might need to open a new issue for this specific request to get someone's attention. Or we can try to ping someone who can at least prod people. @jsquyres Any ideas on who could look at this? The code hasn't really changed, so whatever problem exists might well be in OMPI v5 (via PMIx) as well. |
Not sure if this is oob stuff is actually relevant. While debugging this, we saw in the docs (https://www.open-mpi.org/faq/?category=tcp#tcp-selection) that you might want to set the oob parameter as well. However, if I do that, I get that error. Then we also saw this FAQ entry: https://www.open-mpi.org/faq/?category=tcp#ip-multiaddress-devices One interesting thing is, that it seems to be dependent on the number of processes and hosts I use. For example, I was able to start a job now on two hosts, but if I add a third one it fails. Is this maybe something that is triggered by chance, thus, if I add more and more hosts and processes, the chances to trigger this are getting higher? edit: I played around with it a bit more and thought, maybe if I cannot include a network, maybe I can exclude it. However, this does not work properly:
I also checked without the oob parameter, but then I get the same error.
|
To give a short update: We tested now with OpenMPI 5.0.1 and this bug
|
This issue is split off from #3035 and #5817.
On one of my development servers that has multiple IP addresses on a single IP interface, I get 100% failure with:
Here's the details:
Here's the
ip addr
from this machine:Notice:
The error message above indicates that a connection came in from .48, but it only recognized a peer with address .49.
Here's the same run, but with
btl_base_verbose
at 100....the output is quite long, so I put it in a gist.
Note, too, that the job hung. There are many other IP interfaces where the connection could be made, so I'm not sure why it didn't just failover to another address/interface.
The text was updated successfully, but these errors were encountered: