Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP BTL fails in the presence of virbr0/docker0 interfaces #6377

Closed
mkre opened this issue Feb 11, 2019 · 6 comments
Closed

TCP BTL fails in the presence of virbr0/docker0 interfaces #6377

mkre opened this issue Feb 11, 2019 · 6 comments
Labels

Comments

@mkre
Copy link

mkre commented Feb 11, 2019

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

built from source

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: Intel CPUs
  • Network type: TCP

Details of the problem

I have a TCP network of two nodes with different network interfaces. The output of ip addr is as follows:
node1

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 18:66:da:2e:43:4f brd ff:ff:ff:ff:ff:ff
    inet 146.122.240.139/23 brd 146.122.241.255 scope global dynamic eth0
       valid_lft 5066sec preferred_lft 5066sec
    inet6 fe80::1a66:daff:fe2e:434f/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:5c:0f:85:a0 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

node2

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 18:66:da:2e:43:ae brd ff:ff:ff:ff:ff:ff
    inet 146.122.240.138/23 brd 146.122.241.255 scope global dynamic eth0
       valid_lft 3541sec preferred_lft 3541sec
    inet6 fe80::1a66:daff:fe2e:43ae/64 scope link 
       valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:1e:69:de brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 1000
    link/ether 52:54:00:1e:69:de brd ff:ff:ff:ff:ff:ff

Running a simple MPI application (Init + Allreduce + Finalize) with mpirun -np 2 -H node1,node2 -mca orte_base_help_aggregate 0 ./a.out hangs for a while and eventually fails with

--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now fail.

  Local host: node1
  PID:        15830
  Message:    connect() to 192.168.122.1:1040 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now fail.

  Local host: node2
  PID:        25833
  Message:    connect() to 172.17.0.1:1040 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------

It seems like there is a connection problem between the virbr0 and docker0 interfaces. I have seen that Open MPI ignores all vir* interfaces, but that's only the case in oob/tcp and not in btl/tcp, right?
Adding -mca btl_tcp_if_include eth0 to the command line causes the program to finish successfully. The same can be achieved with -mca btl_tcp_if_exclude virbr0,docker0,lo.

However, as this is not very user-friendly (requires knowledge about available network interfaces, etc.) and we don't know about potential network configurations we come across in the future (hence, we don't want to hard-code this in openmpi-mca-params.conf or the like), we are wondering: Is there any chance to have this case handled by Open MPI transparently?

Thanks,
Moritz

@bwbarrett
Copy link
Member

@mkre would you mind running that example with -mca btl_base_verbose 100 and posting the resulting million lines of output as an attachment?

In general, I agree that we should be more customer friendly around virtual and docker interfaces. A straight blacklist is a bit of a problem, because some customers want to use Docker interfaces for MPI (why I don't know, given the performance impact, but they do). Given that, there's no short term "right" fix. Long term, we have half an implementation of using Linux route tables to better select pairings, which should help the default cases significantly.

However, looking at the output of ifconfig and your error messages, our current logic should have handled that case. There's no reason that Open MPI should have tried to connect between 172.17.0.1 and 192.168.122.1; we at least have enough logic (we thought) to not try to route across private ranges. Running with the verbose output I asked for above should help me figure out what went wrong.

@mkre
Copy link
Author

mkre commented Feb 12, 2019

@bwbarrett, here's the output with increased verbosity:

[node1:40348] mca: base: components_register: registering framework btl components
[node1:40348] mca: base: components_register: found loaded component openib
[node1:40348] mca: base: components_register: component openib register function successful
[node1:40348] mca: base: components_register: found loaded component self
[node1:40348] mca: base: components_register: component self register function successful
[node1:40348] mca: base: components_register: found loaded component sm
[node1:40348] mca: base: components_register: found loaded component tcp
[node1:40348] mca: base: components_register: component tcp register function successful
[node1:40348] mca: base: components_register: found loaded component usnic
[node1:40348] mca: base: components_register: component usnic register function successful
[node1:40348] mca: base: components_register: found loaded component vader
[node1:40348] mca: base: components_register: component vader register function successful
[node1:40348] mca: base: components_open: opening btl components
[node1:40348] mca: base: components_open: found loaded component openib
[node1:40348] mca: base: components_open: component openib open function successful
[node1:40348] mca: base: components_open: found loaded component self
[node1:40348] mca: base: components_open: component self open function successful
[node1:40348] mca: base: components_open: found loaded component tcp
[node1:40348] mca: base: components_open: component tcp open function successful
[node1:40348] mca: base: components_open: found loaded component usnic
[node1:40348] mca: base: components_open: component usnic open function successful
[node1:40348] mca: base: components_open: found loaded component vader
[node1:40348] mca: base: components_open: component vader open function successful
[node1:40348] select: initializing btl component openib
[node1:40348] select: init of component openib returned failure
[node1:40348] mca: base: close: component openib closed     
[node1:40348] mca: base: close: unloading component openib
[node1:40348] select: initializing btl component self                                               
[node1:40348] select: init of component self returned success 
[node1:40348] select: initializing btl component tcp  
[node1:40348] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node1:40348] btl: tcp: Found match: 127.0.0.1 (lo)   
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1024
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1025
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1026
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1027                
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1028                
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1029                               
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1030                                   
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1031                                
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1032                              
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1033                               
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1034                                
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1035                                   
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1036                              
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1037               
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1038                               
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1039                                
[node1:40348] btl:tcp: Attempting to bind to AF_INET port 1040                                   
[node1:40348] btl:tcp: Successfully bound to AF_INET port 1040                              
[node1:40348] btl:tcp: my listening v4 socket is 0.0.0.0:1040                
[node1:40348] btl:tcp: examining interface eth0                                                 
[node1:40348] btl:tcp: using ipv6 interface eth0                             
[node1:40348] btl:tcp: examining interface docker0                                           
[node1:40348] btl:tcp: using ipv6 interface docker0                                              
[node1:40348] select: init of component tcp returned success                                  
[node1:40348] select: initializing btl component usnic                                      
[node1:40348] btl:usnic: disqualifiying myself because Libfabric does not support v1.3 of the API (v1.3 is *required* for correct usNIC functionality).
[node1:40348] select: init of component usnic returned failure     
[node1:40348] mca: base: close: component usnic closed                                             
[node1:40348] mca: base: close: unloading component usnic          
[node1:40348] select: initializing btl component vader                    
[node1:40348] select: init of component vader returned failure      
[node1:40348] mca: base: close: component vader closed
[node1:40348] mca: base: close: unloading component vader
[node2:13439] mca: base: components_register: registering framework btl components
[node2:13439] mca: base: components_register: found loaded component openib
[node2:13439] mca: base: components_register: component openib register function successful
[node2:13439] mca: base: components_register: found loaded component self
[node2:13439] mca: base: components_register: component self register function successful
[node2:13439] mca: base: components_register: found loaded component sm
[node2:13439] mca: base: components_register: found loaded component tcp
[node2:13439] mca: base: components_register: component tcp register function successful
[node2:13439] mca: base: components_register: found loaded component usnic
[node2:13439] mca: base: components_register: component usnic register function successful
[node2:13439] mca: base: components_register: found loaded component vader
[node2:13439] mca: base: components_register: component vader register function successful
[node2:13439] mca: base: components_open: opening btl components
[node2:13439] mca: base: components_open: found loaded component openib
[node2:13439] mca: base: components_open: component openib open function successful
[node2:13439] mca: base: components_open: found loaded component self
[node2:13439] mca: base: components_open: component self open function successful
[node2:13439] mca: base: components_open: found loaded component tcp
[node2:13439] mca: base: components_open: component tcp open function successful
[node2:13439] mca: base: components_open: found loaded component usnic
[node2:13439] mca: base: components_open: component usnic open function successful
[node2:13439] mca: base: components_open: found loaded component vader
[node2:13439] mca: base: components_open: component vader open function successful
[node2:13439] select: initializing btl component openib
[node2:13439] select: init of component openib returned failure
[node2:13439] mca: base: close: component openib closed
[node2:13439] mca: base: close: unloading component openib
[node2:13439] select: initializing btl component self
[node2:13439] select: init of component self returned success
[node2:13439] select: initializing btl component tcp
[node2:13439] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node2:13439] btl: tcp: Found match: 127.0.0.1 (lo)
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1024
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1025
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1026
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1027
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1028
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1029
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1030
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1031
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1032
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1033
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1034
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1035
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1036
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1037
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1038
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1039
[node2:13439] btl:tcp: Attempting to bind to AF_INET port 1040
[node2:13439] btl:tcp: Successfully bound to AF_INET port 1040
[node2:13439] btl:tcp: my listening v4 socket is 0.0.0.0:1040
[node2:13439] btl:tcp: examining interface eth0
[node2:13439] btl:tcp: using ipv6 interface eth0
[node2:13439] btl:tcp: examining interface virbr0
[node2:13439] btl:tcp: using ipv6 interface virbr0
[node2:13439] select: init of component tcp returned success
[node2:13439] select: initializing btl component usnic
[node2:13439] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node2:13439] select: init of component usnic returned failure
[node2:13439] mca: base: close: component usnic closed
[node2:13439] mca: base: close: unloading component usnic
[node2:13439] select: initializing btl component vader
[node2:13439] select: init of component vader returned failure
[node2:13439] mca: base: close: component vader closed
[node2:13439] mca: base: close: unloading component vader
[node2:13439] mca: bml: Using self btl for send to [[10343,1],0] on node node2
[node1:40348] mca: bml: Using self btl for send to [[10343,1],1] on node node1
[node1:40348] btl:tcp: path from 146.122.240.139 to 146.122.240.138: IPV4 PUBLIC SAME NETWORK
[node1:40348] btl:tcp: path from 146.122.240.139 to 192.168.122.1: IPV4 PRIVATE DIFFERENT NETWORK
[node1:40348] btl:tcp: path from 172.17.0.1 to 146.122.240.138: IPV4 PRIVATE DIFFERENT NETWORK
[node1:40348] btl:tcp: path from 172.17.0.1 to 192.168.122.1: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] btl:tcp: path from 146.122.240.138 to 146.122.240.139: IPV4 PUBLIC SAME NETWORK
[node2:13439] btl:tcp: path from 146.122.240.138 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] btl:tcp: path from 192.168.122.1 to 146.122.240.139: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] btl:tcp: path from 192.168.122.1 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] mca: bml: Using tcp btl for send to [[10343,1],1] on node node1
[node2:13439] btl:tcp: path from 146.122.240.138 to 146.122.240.139: IPV4 PUBLIC SAME NETWORK
[node2:13439] btl:tcp: path from 146.122.240.138 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] btl:tcp: path from 192.168.122.1 to 146.122.240.139: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] btl:tcp: path from 192.168.122.1 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[node2:13439] mca: bml: Using tcp btl for send to [[10343,1],1] on node node1
[node2:13439] btl: tcp: attempting to connect() to [[10343,1],1] address 172.17.0.1 on port 1040
[node1:40348] mca: bml: Using tcp btl for send to [[10343,1],0] on node node2
[node1:40348] btl:tcp: path from 146.122.240.139 to 146.122.240.138: IPV4 PUBLIC SAME NETWORK
[node1:40348] btl:tcp: path from 146.122.240.139 to 192.168.122.1: IPV4 PRIVATE DIFFERENT NETWORK
[node1:40348] btl:tcp: path from 172.17.0.1 to 146.122.240.138: IPV4 PRIVATE DIFFERENT NETWORK
[node1:40348] btl:tcp: path from 172.17.0.1 to 192.168.122.1: IPV4 PRIVATE DIFFERENT NETWORK
[node1:40348] mca: bml: Using tcp btl for send to [[10343,1],0] on node node2
[node2:13439] btl:tcp: would block, so allowing background progress
[node1:40348] btl: tcp: attempting to connect() to [[10343,1],0] address 192.168.122.1 on port 1040
[node1:40348] btl:tcp: would block, so allowing background progress
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now fail.

  Local host: node1
  PID:        40348
  Message:    connect() to 192.168.122.1:1040 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now fail.

  Local host: node2
  PID:        13439
  Message:    connect() to 172.17.0.1:1040 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------

In general, I agree that we should be more customer friendly around virtual and docker interfaces. A straight blacklist is a bit of a problem, because some customers want to use Docker interfaces for MPI (why I don't know, given the performance impact, but they do). Given that, there's no short term "right" fix. Long term, we have half an implementation of using Linux route tables to better select pairings, which should help the default cases significantly.

That's great news! By the way, is there any risk in globally disabling the lo interface? I'm wondering because my example works if I disable docker0, lo, but not if I only disable only docker0. Shouldn't Open MPI always ignore lo and rely on self/vader instead? It's a bit confusing, because I don't need to exclude lo explicitly if everything is working right away (i.e., if I don't see the docker0 issue).

Thanks,
Moritz

@bwbarrett bwbarrett added the bug label Feb 12, 2019
@bwbarrett
Copy link
Member

lo (actually, the 127.0.0.0/8 IPv4 range) is in the default exclude list, which is why it works if you don't explicitly disable any device. When you set the exclude list to docker0, that overrides the default list, so localhost isn't excluding. I'd set the default in your case to docker0,127.0.0.0/8,sppp.

It looks like our selection logic has an issue; it recognizes that virb0 on node2 and docker0 on node1 are on different private IP networks, but because it tries to pair up every interface, it goes ahead and creates a pairing anyway. As I said, we have a backlog item that should fix this issue, but don't have an immediate good fix, other than updating the exclude list.

@mkre
Copy link
Author

mkre commented Feb 13, 2019

Your explanation with the exclude list makes sense, thanks! I guess we'll go with the manual blacklist approach, then.

One last question: Is there a github issue associated with the backlog item you're talking about? I couldn't find an obvious hit while searching it. I'm asking because I'd like to stay in the loop regarding this issue.

@bwbarrett
Copy link
Member

It really all boils down to the issues in #5818. That issue is one of the ones I'd really like to fix, but have to figure out how to prioritize it with other things at work.

@wckzhang
Copy link
Contributor

This issue should be resolved with #7134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants