-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP BTL fails in the presence of virbr0/docker0 interfaces #6377
Comments
@mkre would you mind running that example with In general, I agree that we should be more customer friendly around virtual and docker interfaces. A straight blacklist is a bit of a problem, because some customers want to use Docker interfaces for MPI (why I don't know, given the performance impact, but they do). Given that, there's no short term "right" fix. Long term, we have half an implementation of using Linux route tables to better select pairings, which should help the default cases significantly. However, looking at the output of ifconfig and your error messages, our current logic should have handled that case. There's no reason that Open MPI should have tried to connect between 172.17.0.1 and 192.168.122.1; we at least have enough logic (we thought) to not try to route across private ranges. Running with the verbose output I asked for above should help me figure out what went wrong. |
@bwbarrett, here's the output with increased verbosity:
That's great news! By the way, is there any risk in globally disabling the Thanks, |
It looks like our selection logic has an issue; it recognizes that |
Your explanation with the exclude list makes sense, thanks! I guess we'll go with the manual blacklist approach, then. One last question: Is there a github issue associated with the backlog item you're talking about? I couldn't find an obvious hit while searching it. I'm asking because I'd like to stay in the loop regarding this issue. |
It really all boils down to the issues in #5818. That issue is one of the ones I'd really like to fix, but have to figure out how to prioritize it with other things at work. |
This issue should be resolved with #7134 |
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v3.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
built from source
Please describe the system on which you are running
Details of the problem
I have a TCP network of two nodes with different network interfaces. The output of
ip addr
is as follows:node1
node2
Running a simple MPI application (Init + Allreduce + Finalize) with
mpirun -np 2 -H node1,node2 -mca orte_base_help_aggregate 0 ./a.out
hangs for a while and eventually fails withIt seems like there is a connection problem between the
virbr0
anddocker0
interfaces. I have seen that Open MPI ignores allvir*
interfaces, but that's only the case inoob/tcp
and not inbtl/tcp
, right?Adding
-mca btl_tcp_if_include eth0
to the command line causes the program to finish successfully. The same can be achieved with-mca btl_tcp_if_exclude virbr0,docker0,lo
.However, as this is not very user-friendly (requires knowledge about available network interfaces, etc.) and we don't know about potential network configurations we come across in the future (hence, we don't want to hard-code this in
openmpi-mca-params.conf
or the like), we are wondering: Is there any chance to have this case handled by Open MPI transparently?Thanks,
Moritz
The text was updated successfully, but these errors were encountered: