-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl tcp: Use reachability and graph solving for global interface matching #7134
Conversation
f96e297
to
d6aa11c
Compare
I opened a different issue at So far I don't think this PR addresses my bind() mismatch but wanted to get your thoughts since this PR is in the same code. My concern with the design is how mca_btl_tcp_create() walks opal_if_list creating one btl for each interface and storing its selection in btl->tcp_ifaddr. But it looks to me like the endpoint selection here is associated with a specific btl but it's considering all the local interfaces so it might pick something that's incompatible with the btl's btl->tcp_ifaddr. Then if there is a mismatch possible, it would show up in the bind() in btl_tcp_endpoint.c Overall I think I like having the btl look over all the interfaces, so I'm wondering
|
Hello, I cannot open the ibm github PR, since I don't have an account.
I think it does address the mismatch. If you are referring to the issue in #7115, the issue looks like either a reachability issue or the address that one side selects does not match the address that the other side selects. Both issues are addressed by this patch, due to using the reachability framework as well as moving the matching into a global matching problem. If possible, can you run with my patch and see if you still see the same problem?
This is okay because we do intend for each btl to represent one interface.
All local interfaces are considered here because it's a global matching problem. We match all local interfaces with all remote interfaces and store this information in the global component structure. This isn't done for each btl separately, so we need to consider all local interfaces. This is required for optimization of pairing. There shouldn't be anything incompatible with the btl's btl->tcp_ifaddr, that is what the reachability component is supposed to address. After we have created a reachability map, we then solve for interface pairing by turning it into a graph problem with a sink and flow using negative reachability for weights. It should technically not be possible to match something that is not reachable, as long as the reachability framework is functioning correctly.
I'm not sure what the original reasoning for one tcp btl per local interface was, but I don't see much wrong with the design.
I'm not sure what you mean...it looks like it binds using the btl_endpoint->endpoint_btl->tcp_ifaddr and connects using the btl_endpoint->endpoint_addr. |
Maybe Mark meant to point to #7115 |
Ah thanks, I totally had a cut-and-paste error in my question, ignore that reference I made to github.ibm.com. Thanks for the explanations, it sounds plausible. So far I've only browsed the code. It's definitely possible that the matching results in the btl using picking endpoints that match what it's going to bind to. I'll try downloading and building it next |
bot:retest |
I got a chance to test it: I think I agree now it does address my bind() mismatch in addition to doing the rest of what it does. I like that it has one TCP btl per interface and each is using the interface it's set up with. I think previously each tcp btl was potentially spanning all the interfaces with no particular pattern. I haven't really looked at how the load gets distributed over the multiple BTLs, I have maybe a slight concern that the old code could have lucked into a better distribution of traffic if everything went over btl[0] which seems like it might be the case at least for small messages. I do have a failure though when I went from -np 2 to -np 3 (--host hostA:2,hostB:1)
The above was with If I instead switch to
|
Looking into it, thanks. |
@markalle Can you paste your whole mpirun command? (I can't reproduce the issue - how many interfaces do you have on your devices?) |
The "bad parameter" non-fatal error message comes from this command The fatal command is |
I still couldn't reproduce it - ran the following with no issues
|
Also the OPAL_ERROR didn't change much when I set GRAPH_DEBUG, the message just became
I switched to your command line and looped over each interface individually for I don't know why I didn't just try single host before but that fails too with vader removed:
|
Still can't reproduce with:
I think it's probably this line that's happening:
Can you verify in mca_btl_tcp_proc_create_interface_graph the size of local_list, remote_list, and how many vertices and edges are created? |
@markalle Github pro tip: Use three single ticks on a line by themselves to start and end a verbatim section of text. See https://help.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax#quoting-code |
I printed the size of both lists, they were both 1. Then over in the failure location the return value from opal_bp_graph_bipartite_to_flow() is -5 and the gx structure is
if that helps. |
My first printout was using opal_list_get_size(local_list) and opal_list_get_size(remote_list). I just now printed results->num_local and results->num_remote and they were also 1. And the call to opal_bp_graph_add_edge() never happens in that x,y loop because the cost it sees is 0. |
So maybe wherever it's getting its data is suspect for my architecture? If I artificially change the cost to 1 in that location it passes. |
That means that the reachability framework is incorrectly identifying your interfaces as unreachable. This is a bug with the reachability framework then. It sets the weight as 0 when there is no connection available. |
In the netlink get_weights() I added a print for local_if and remote_if and they both have the right IP as 10.100.3.17, but I did notice local_if->if_name is set to "ib0" while remote_if->if_name is an empty string. Maybe it's not needed, but it made me wonder if it has all the fields it needs set up. opal_reachable_netlink_rt_lookup() is then returning 113. What can I print that would be useful there? After
|
I think what's happening is it's being told to look for a route using "ib0" for on-host traffic but it's only seeing routes that use "lo". In the code we end up with the callback opal_reachable_netlink_rt_raw_parse_cb() being triggering with So it probably only sees "lo" as routing on-host but is trying to find something using "ib0". In the past I've always thought What do you think? If vader is out of the btl list and tcp is being used on-host is it invalid to then use an interface list that leaves out "lo"? I tend to think "lo" is a loopback but not the only loopback and it should have worked with just "ib0". Thoughts? |
The remote_if->if_name being empty was intentional. I actually changed the documentation of the reachable function to specify this (since there was no prior specification).
This looks like no route has been found from netlink. I'm guessing it may return that when a non loopback interface is trying to use loopback. Not sure if this is a bug or an intended feature. Try using the weighted reachability component instead of netlink and see if that returns successful without lo.
I'm not sure what the valid operation should be - hopefully @bwbarrett or @jsquyres can chime in on this. In my perspective, vader should be used for loopback, but in the case it isn't, if ib0 is capable of loopback, it should be deemed reachable. |
During the telecom meeting, we discussed the path forward here. I will change the reachability framework documentation to say that an interface will always be shown to be reachable to itself. Then I will change the netlinks component to show reachability if two interfaces are identical. |
That sounds great to me. I agree with some former comment that vader is the logical thing to use for loopback and on-host TCP isn't something we generally do. I just like having things like that working for debugging purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address my two comments before merging.
opal/mca/btl/tcp/btl_tcp.c
Outdated
@@ -118,7 +114,7 @@ int mca_btl_tcp_add_procs( struct mca_btl_base_module_t* btl, | |||
} | |||
|
|||
tcp_endpoint->endpoint_btl = tcp_btl; | |||
rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint); | |||
rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint, tcp_btl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the BTL is already stored in the endpoint (look at the line above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks, I'll change that
OBJ_DESTRUCT(&tcp_proc->proc_lock); | ||
} | ||
|
||
static inline int mca_btl_tcp_proc_is_proc_left(opal_process_name_t a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the qualifier for being a proc_left
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proc_left is just so that when two ranks, a and b are trying to match interfaces, the graphs created will be identical. Thus, if on rank A, the bipartite flows from A -> B, then on rank B, the bipartite flows from A -> B as well. Thus, A is the left proc. It has no significance other than allowing these two processes to create identical graphs.
…hing Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>
debc302
to
e958f3c
Compare
Mellanox failure seems unrelated. |
Does anyone know how to retrigger Mellanox CI? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Can someone look into why Mellanox CI is hitting a git checkout issue? |
bot:retest |
Hmmm. a |
The bot command hasn't been updated to use the new Mellanox Azure CI |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
I've asked Mellanox if they can support the |
Regarding to triggering Mellanox CI with the old keywords - Azure Pipelines by default supports only CC @amaslenn |
@artemry-mlnx Any news on being able to support the Open MPI-usual |
I'm tracking both issues and inform you upon any progress there. |
Can we make a new issue to track this? (rather than commenting on an old/merged PR) FWIW: the git clone error "reference is not a tree" is likely that AZP is trying to clone before Github has made the merge hash yet. There's a flag in the JSON that they're supposed to be checking to see if Github has the merge hash ready yet. If it's not set, the clone will fail with this error. The Jenkins Github plugin had this problem for a while, too. |
Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.
This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.
The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).
With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.
Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.
In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.
Signed-off-by: William Zhang wilzhang@amazon.com