btl tcp: Use reachability and graph solving for global interface matching #7134

wckzhang · 2019-11-01T21:06:54Z

Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.

This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.

The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).

With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.

Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.

In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.

Signed-off-by: William Zhang wilzhang@amazon.com

wckzhang · 2019-11-01T21:15:43Z

Here are some graphs and data that I described in the commit message. The before/after graphs were run on 64 nodes with 36 ranks each, with up to 15 interfaces attached. Things to note are the gaps in the line are where there was no route found to host (a bug in the before code). I decided to keep the gaps rather than rerun the tests, since I believe they provide useful information. The number of interfaces was selected to see the difference between minimum, maximum, and the threshold switch the old code had from 8->9 interfaces (MAX_PERMUTATION_INTERFACES):

Simulated additional init cost between two ranks with up to 300 interfaces.

Average total mpi init costs before and after this patch, with 1, 8, 9, and 15 interfaces.

Min, max, and average mpi init costs for each rank with 1 interface per node.

Min, max, and average mpi init costs for each rank with 8 interfaces per node.

Min, max, and average mpi init costs for each rank with 9 interfaces per node.

Min, max, and average mpi init costs for each rank with 15 interfaces per node.

markalle · 2019-11-04T16:53:11Z

I opened a different issue at
https://github.ibm.com/smpi/ibm-ompi/pull/569
and was told this PR might be related.

So far I don't think this PR addresses my bind() mismatch but wanted to get your thoughts since this PR is in the same code. My concern with the design is how mca_btl_tcp_create() walks opal_if_list creating one btl for each interface and storing its selection in btl->tcp_ifaddr.

But it looks to me like the endpoint selection here is associated with a specific btl but it's considering all the local interfaces so it might pick something that's incompatible with the btl's btl->tcp_ifaddr. Then if there is a mismatch possible, it would show up in the bind() in btl_tcp_endpoint.c

Overall I think I like having the btl look over all the interfaces, so I'm wondering

why are we opening one tcp btl per local interface in the first place?
at connect-time should we be binding to something else if I'm right that btl->tcp_ifaddr doesn't represent the reality of what's being used?

wckzhang · 2019-11-04T18:01:28Z

Hello, I cannot open the ibm github PR, since I don't have an account.

So far I don't think this PR addresses my bind() mismatch but wanted to get your thoughts since this PR is in the same code.

I think it does address the mismatch. If you are referring to the issue in #7115, the issue looks like either a reachability issue or the address that one side selects does not match the address that the other side selects. Both issues are addressed by this patch, due to using the reachability framework as well as moving the matching into a global matching problem. If possible, can you run with my patch and see if you still see the same problem?

My concern with the design is how mca_btl_tcp_create() walks opal_if_list creating one btl for each interface and storing its selection in btl->tcp_ifaddr.

This is okay because we do intend for each btl to represent one interface.

But it looks to me like the endpoint selection here is associated with a specific btl but it's considering all the local interfaces so it might pick something that's incompatible with the btl's btl->tcp_ifaddr. Then if there is a mismatch possible, it would show up in the bind() in btl_tcp_endpoint.c

All local interfaces are considered here because it's a global matching problem. We match all local interfaces with all remote interfaces and store this information in the global component structure. This isn't done for each btl separately, so we need to consider all local interfaces. This is required for optimization of pairing. There shouldn't be anything incompatible with the btl's btl->tcp_ifaddr, that is what the reachability component is supposed to address. After we have created a reachability map, we then solve for interface pairing by turning it into a graph problem with a sink and flow using negative reachability for weights. It should technically not be possible to match something that is not reachable, as long as the reachability framework is functioning correctly.

Overall I think I like having the btl look over all the interfaces, so I'm wondering why are we opening one tcp btl per local interface in the first place?

I'm not sure what the original reasoning for one tcp btl per local interface was, but I don't see much wrong with the design.

at connect-time should we be binding to something else if I'm right that btl->tcp_ifaddr doesn't represent the reality of what's being used?

I'm not sure what you mean...it looks like it binds using the btl_endpoint->endpoint_btl->tcp_ifaddr and connects using the btl_endpoint->endpoint_addr.

jjhursey · 2019-11-04T18:25:57Z

Maybe Mark meant to point to #7115

markalle · 2019-11-04T19:10:27Z

Ah thanks, I totally had a cut-and-paste error in my question, ignore that reference I made to github.ibm.com.

Thanks for the explanations, it sounds plausible. So far I've only browsed the code. It's definitely possible that the matching results in the btl using picking endpoints that match what it's going to bind to. I'll try downloading and building it next

wckzhang · 2019-11-04T20:52:04Z

bot:retest

markalle · 2019-11-05T07:40:35Z

I got a chance to test it: I think I agree now it does address my bind() mismatch in addition to doing the rest of what it does. I like that it has one TCP btl per interface and each is using the interface it's set up with. I think previously each tcp btl was potentially spanning all the interfaces with no particular pattern.

I haven't really looked at how the load gets distributed over the multiple BTLs, I have maybe a slight concern that the old code could have lucked into a better distribution of traffic if everything went over btl[0] which seems like it might be the case at least for small messages.

I do have a failure though when I went from -np 2 to -np 3 (--host hostA:2,hostB:1)
First I get a non-fatal failure as

[1,0]:[hostA:12345] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891
but it still seems to run to completion.

The above was with
-mca pml ob1 -mca btl tcp,vader,self

If I instead switch to
-mca pml ob1 -mca btl tcp,self still at -np 3 --host hostA:2,hostB:1
it fails to run, reporting

At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated

gpaulsen · 2019-11-05T12:43:04Z

@wckzhang Can you please take a look at the error @markalle found?

wckzhang · 2019-11-05T16:20:15Z

Looking into it, thanks.

wckzhang · 2019-11-05T16:50:01Z

@markalle Can you paste your whole mpirun command? (I can't reproduce the issue - how many interfaces do you have on your devices?)

markalle · 2019-11-05T18:49:45Z

The "bad parameter" non-fatal error message comes from this command
% mpirun --tag-output -np 3 --host hostA:2,hostB:1 -mca pml ob1 -mca btl tcp,vader,self -mca btl_tcp_if_include ib0 ./a2a.x

The fatal command is
% mpirun --tag-output -np 3 --host hostA:2,hostB:1 -mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include ib0 ./a2a.x
so just using one interface.

wckzhang · 2019-11-05T19:02:35Z

I still couldn't reproduce it - ran the following with no issues

mpirun --tag-output -hostfile /shared/hostfile -mca pml ob1 -mca btl tcp,vader,self -np 3 -mca btl_tcp_if_include eth0 ./osu_allgather

mpirun --tag-output -hostfile /shared/hostfile -mca pml ob1 -mca btl tcp,self -np 3 -mca btl_tcp_if_include eth0 ./osu_allgather

markalle · 2019-11-05T20:51:02Z

Also the OPAL_ERROR didn't change much when I set GRAPH_DEBUG, the message just became

[1,0]:bipartite_to_flow failed[hostA:12345] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891

I switched to your command line and looped over each interface individually for
-mca btl_tcp_if_include [interface]
and they all (enP5p1s0f0 enP5p1s0f1 ib0 ib1 ib2 ib3) gave the same result: With vader included they passed but gave the "bad parameter" error message. With vader removed they fail.

I don't know why I didn't just try single host before but that fails too with vader removed:
% mpirun --tag-output -mca pml ob1 -mca btl tcp,self -np 2 -mca btl_tcp_if_include ib0 ./a2a.x

[1,0]:[f3n17:75535] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891

At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated

wckzhang · 2019-11-05T20:57:46Z

Still can't reproduce with:

mpirun --tag-output -mca pml ob1 -mca btl tcp,self -np 2 -mca btl_tcp_if_include eth1 ./osu_allgather

I think it's probably this line that's happening:

    /* it doesn't make sense to extend this graph with a source and sink        
     * unless */                                                                
    if (num_right == 0 || num_left == 0) {                                      
        return OPAL_ERR_BAD_PARAM;                                              
    }

Can you verify in mca_btl_tcp_proc_create_interface_graph the size of local_list, remote_list, and how many vertices and edges are created?

jsquyres · 2019-11-05T23:56:58Z

@markalle Github pro tip: Use three single ticks on a line by themselves to start and end a verbatim section of text. See https://help.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax#quoting-code

jsquyres · 2019-11-05T23:57:53Z

@wckzhang Github pro tip: I think you heavily edited your last comment after creating it. The email we got didn't have nearly as much detail as you now have in your comment -- so @markalle may not have seen your question to him.

markalle · 2019-11-06T00:20:09Z

I printed the size of both lists, they were both 1.

Then over in the failure location the return value from opal_bp_graph_bipartite_to_flow() is -5 and the gx structure is

(gdb) p *gx
$3 = {num_vertices = 4, vertices = {super = {obj_class = 0x2000006e0470 <opal_pointer_array_t_class>, 
      obj_reference_count = 1}, lock = {super = {obj_class = 0x2000006e0a78 <opal_mutex_t_class>, 
        obj_reference_count = 1}, m_lock_pthread = {__data = {__lock = 0, __count = 0, __owner = 0, 
          __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
        __size = '\000' <repeats 39 times>, __align = 0}, m_lock_atomic = {u = {lock = 0, 
          sparc_lock = 0 '\000', padding = "\000\000\000"}}}, lowest_free = 4, number_free = 28, 
    size = 32, max_size = 2147483647, block_size = 32, free_bits = 0x3c483f50, addr = 0x3c4893f0}, 
  source_idx = 2, sink_idx = 3, v_data_cleanup_fn = 0x0, e_data_cleanup_fn = 0x0}

if that helps.

markalle · 2019-11-06T00:26:21Z

My first printout was using opal_list_get_size(local_list) and opal_list_get_size(remote_list). I just now printed results->num_local and results->num_remote and they were also 1.

And the call to opal_bp_graph_add_edge() never happens in that x,y loop because the cost it sees is 0.

markalle · 2019-11-06T00:27:48Z

So maybe wherever it's getting its data is suspect for my architecture? If I artificially change the cost to 1 in that location it passes.

wckzhang · 2019-11-06T04:57:53Z

That means that the reachability framework is incorrectly identifying your interfaces as unreachable. This is a bug with the reachability framework then. It sets the weight as 0 when there is no connection available.

markalle · 2019-11-06T17:05:01Z

In the netlink get_weights() I added a print for local_if and remote_if and they both have the right IP as 10.100.3.17, but I did notice local_if->if_name is set to "ib0" while remote_if->if_name is an empty string. Maybe it's not needed, but it made me wonder if it has all the fields it needs set up.

opal_reachable_netlink_rt_lookup() is then returning 113. What can I print that would be useful there? After

    /* recieve results */
    NL_RECVMSGS(unlsk->nlh, arg, EHOSTUNREACH, err, out);
I have
(gdb) p arg
$1 = {oif = 4, found = 0, has_gateway = 0, replied = 0, unlsk = 0x3e7a3f70}

markalle · 2019-11-06T21:42:23Z

I think what's happening is it's being told to look for a route using "ib0" for on-host traffic but it's only seeing routes that use "lo".

In the code we end up with the callback opal_reachable_netlink_rt_raw_parse_cb() being triggering with
nla_get_u32(tb[RTA_OIF]) = 1 which I think is associated with interface "lo" vs
lookup_arg->oif = 4 which I think is my "ib0".

So it probably only sees "lo" as routing on-host but is trying to find something using "ib0".

In the past I've always thought
-mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include ib0
was a valid combination of options, but perhaps it has to be
-mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include ib0,lo
maybe? The above is working now, because with "lo" in the include list it's able to see that there is a route back to localhost.

What do you think? If vader is out of the btl list and tcp is being used on-host is it invalid to then use an interface list that leaves out "lo"? I tend to think "lo" is a loopback but not the only loopback and it should have worked with just "ib0". Thoughts?

wckzhang · 2019-11-07T18:06:51Z

In the netlink get_weights() I added a print for local_if and remote_if and they both have the right IP as 10.100.3.17, but I did notice local_if->if_name is set to "ib0" while remote_if->if_name is an empty string. Maybe it's not needed, but it made me wonder if it has all the fields it needs set up.

The remote_if->if_name being empty was intentional. I actually changed the documentation of the reachable function to specify this (since there was no prior specification).

/* Build reachability matrix between local and remote ethernet                  
 * interfaces                                                                   
 *                                                                              
 * @param local_ifs (IN)     Local list of opal_if_t objects                    
 *                           The opal_if_t objects must be fully populated      
 * @param remote_ifs (IN)    Remote list of opal_if_t objects                   
 *                           The opal_if_t objects must have the following fields populated:
 *                              uint16_t                 af_family;             
 *                              struct sockaddr_storage  if_addr;               
 *                              uint32_t                 if_mask;               
 *                              uint32_t                 if_bandwidth;          
 * @return opal_reachable_t  The reachability matrix was successfully created   
 * @return NULL              The reachability matrix could not be constructed   
 *                                                                              
 * Given a list of local interfaces and remote interfaces from a                
 * single peer, build a reachability matrix between the two peers.              
 * This function does not select the best pairing of local and remote           
 * interfaces, but only a (comparable) reachability between any pair            
 * of local/remote interfaces.                                                  
 *                                                                              
 *                                                                              
 */                                                                             
typedef opal_reachable_t*                                                       
(*opal_reachable_base_module_reachable_fn_t)(opal_list_t *local_ifs,            
                                             opal_list_t *remote_ifs);

opal_reachable_netlink_rt_lookup() is then returning 113.

This looks like no route has been found from netlink. I'm guessing it may return that when a non loopback interface is trying to use loopback. Not sure if this is a bug or an intended feature.

Try using the weighted reachability component instead of netlink and see if that returns successful without lo.

What do you think? If vader is out of the btl list and tcp is being used on-host is it invalid to then use an interface list that leaves out "lo"? I tend to think "lo" is a loopback but not the only loopback and it should have worked with just "ib0". Thoughts?

I'm not sure what the valid operation should be - hopefully @bwbarrett or @jsquyres can chime in on this. In my perspective, vader should be used for loopback, but in the case it isn't, if ib0 is capable of loopback, it should be deemed reachable.

wckzhang · 2019-11-12T18:02:28Z

During the telecom meeting, we discussed the path forward here. I will change the reachability framework documentation to say that an interface will always be shown to be reachable to itself. Then I will change the netlinks component to show reachability if two interfaces are identical.

markalle · 2019-11-12T21:19:11Z

That sounds great to me. I agree with some former comment that vader is the logical thing to use for loopback and on-host TCP isn't something we generally do. I just like having things like that working for debugging purposes.

bosilca

Please address my two comments before merging.

bosilca · 2019-11-01T21:21:29Z

opal/mca/btl/tcp/btl_tcp.c

@@ -118,7 +114,7 @@ int mca_btl_tcp_add_procs( struct mca_btl_base_module_t* btl,
            }

            tcp_endpoint->endpoint_btl = tcp_btl;
-            rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint);
+            rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint, tcp_btl);


the BTL is already stored in the endpoint (look at the line above).

Okay thanks, I'll change that

bosilca · 2020-01-21T15:52:07Z

opal/mca/btl/tcp/btl_tcp_proc.c

    OBJ_DESTRUCT(&tcp_proc->proc_lock);
 }

+static inline int mca_btl_tcp_proc_is_proc_left(opal_process_name_t a,


What would be the qualifier for being a proc_left?

proc_left is just so that when two ranks, a and b are trying to match interfaces, the graphs created will be identical. Thus, if on rank A, the bipartite flows from A -> B, then on rank B, the bipartite flows from A -> B as well. Thus, A is the left proc. It has no significance other than allowing these two processes to create identical graphs.

…hing Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>

wckzhang · 2020-01-21T20:08:40Z

Mellanox failure seems unrelated.

wckzhang · 2020-01-22T21:18:04Z

Does anyone know how to retrigger Mellanox CI?

rhc54 · 2020-01-22T21:19:41Z

/azp run

azure-pipelines · 2020-01-22T21:19:51Z

Azure Pipelines successfully started running 1 pipeline(s).

wckzhang · 2020-01-22T23:25:50Z

Can someone look into why Mellanox CI is hitting a git checkout issue?

gpaulsen · 2020-01-23T12:59:11Z

bot:retest

gpaulsen · 2020-01-23T13:51:17Z

Hmmm. a retest 1 hour ago, didn't even restart the Mellanox CI (Details on it still says: failed 15 hours ago in 1m 14s).

rhc54 · 2020-01-23T14:10:44Z

The bot command hasn't been updated to use the new Mellanox Azure CI

rhc54 · 2020-01-23T14:10:51Z

/azp run

azure-pipelines · 2020-01-23T14:11:05Z

Azure Pipelines successfully started running 1 pipeline(s).

jsquyres · 2020-01-23T15:02:20Z

I've asked Mellanox if they can support the bot level commands (i.e., retest and mellanox:retest).

artemry-nv · 2020-01-23T21:23:10Z

Regarding to triggering Mellanox CI with the old keywords - Azure Pipelines by default supports only /azp run and some other similar ones (comment triggers). I'm checking whether it is possible to trigger CI via webhooks and bind it to the old keywords.
Regarding to the sporadic git checkout related failures I'm investigating it also - it seems it's caused by some strange merge commits (e.g. 561f96b)

CC @amaslenn

jsquyres · 2020-01-28T16:28:42Z

@artemry-mlnx Any news on being able to support the Open MPI-usual bot commands?

artemry-nv · 2020-01-28T23:28:43Z

@jsquyres

Regarding to the old PR comment triggers support - I haven't found a way to customize the default Azure Pipelines ones (it seems the webhook triggers in Azure Pipelines aren't yet implemented). I've asked our Microsoft Azure DevOps support contact for help on this.
Regarding to the sporadic git checkout failures it seems it's a known Azure Pipelines problem which is not yet fixed.

I'm tracking both issues and inform you upon any progress there.

jsquyres · 2020-01-28T23:31:21Z

Can we make a new issue to track this? (rather than commenting on an old/merged PR)

FWIW: the git clone error "reference is not a tree" is likely that AZP is trying to clone before Github has made the merge hash yet. There's a flag in the JSON that they're supposed to be checking to see if Github has the merge hash ready yet. If it's not set, the clone will fail with this error. The Jenkins Github plugin had this problem for a while, too.

artemry-nv · 2020-01-28T23:45:48Z

@jsquyres
Sure, submitted #7347 and #7348 - feel free to assign them to me for the investigation.

wckzhang requested a review from bwbarrett November 1, 2019 21:06

wckzhang force-pushed the btl_tcp_interface_match branch from f96e297 to d6aa11c Compare November 1, 2019 22:24

jjhursey mentioned this pull request Nov 2, 2019

tcp btl: bind() on connect() side, interface conflict #7115

Closed

bwbarrett requested review from bosilca and jsquyres November 4, 2019 22:55

bwbarrett added bug Target: main labels Nov 4, 2019

wckzhang mentioned this pull request Nov 14, 2019

Reachable documentation change #7167

Merged

bwbarrett approved these changes Dec 19, 2019

View reviewed changes

bosilca approved these changes Jan 21, 2020

View reviewed changes

wckzhang force-pushed the btl_tcp_interface_match branch from debc302 to e958f3c Compare January 21, 2020 18:24

gpaulsen added this to the v5.0.0 milestone Jan 23, 2020

bwbarrett merged commit fc8c7a5 into open-mpi:master Jan 27, 2020

This was referenced Jan 28, 2020

Trigger Mellanox CI with bot:[mellanox:]retest PR comments #7347

Open

Sporadic Mellanox Open MPI CI failures with the error "fatal: reference is not a tree" #7348

Closed

This was referenced Feb 18, 2020

TCP BTL wireup fails when the networking is "strange", as seen on BigRed. #57

Closed

TCP: unexpected process identifier in connect_ack #6240

Open

wckzhang mentioned this pull request Feb 25, 2020

TCP BTL fails in the presence of virbr0/docker0 interfaces #6377

Closed

btl tcp: Use reachability and graph solving for global interface matching #7134

btl tcp: Use reachability and graph solving for global interface matching #7134

Conversation

wckzhang commented Nov 1, 2019

wckzhang commented Nov 1, 2019 • edited Loading

markalle commented Nov 4, 2019

wckzhang commented Nov 4, 2019

jjhursey commented Nov 4, 2019

markalle commented Nov 4, 2019

wckzhang commented Nov 4, 2019

markalle commented Nov 5, 2019

gpaulsen commented Nov 5, 2019

wckzhang commented Nov 5, 2019

wckzhang commented Nov 5, 2019 • edited Loading

markalle commented Nov 5, 2019

wckzhang commented Nov 5, 2019 • edited Loading

markalle commented Nov 5, 2019

[1,0]:[f3n17:75535] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891

wckzhang commented Nov 5, 2019 • edited Loading

jsquyres commented Nov 5, 2019

jsquyres commented Nov 5, 2019

markalle commented Nov 6, 2019

markalle commented Nov 6, 2019

markalle commented Nov 6, 2019

wckzhang commented Nov 6, 2019

markalle commented Nov 6, 2019

markalle commented Nov 6, 2019

wckzhang commented Nov 7, 2019

wckzhang commented Nov 12, 2019

markalle commented Nov 12, 2019

bosilca left a comment

Choose a reason for hiding this comment

bosilca Nov 1, 2019

Choose a reason for hiding this comment

wckzhang Jan 21, 2020

Choose a reason for hiding this comment

bosilca Jan 21, 2020

Choose a reason for hiding this comment

wckzhang Jan 21, 2020

Choose a reason for hiding this comment

wckzhang commented Jan 21, 2020

wckzhang commented Jan 22, 2020

rhc54 commented Jan 22, 2020

azure-pipelines bot commented Jan 22, 2020

wckzhang commented Jan 22, 2020 • edited Loading

gpaulsen commented Jan 23, 2020

gpaulsen commented Jan 23, 2020

rhc54 commented Jan 23, 2020

rhc54 commented Jan 23, 2020

azure-pipelines bot commented Jan 23, 2020

jsquyres commented Jan 23, 2020

artemry-nv commented Jan 23, 2020 • edited Loading

jsquyres commented Jan 28, 2020

artemry-nv commented Jan 28, 2020

jsquyres commented Jan 28, 2020

artemry-nv commented Jan 28, 2020 • edited Loading

wckzhang commented Nov 1, 2019 •

edited

Loading

wckzhang commented Nov 5, 2019 •

edited

Loading

wckzhang commented Nov 5, 2019 •

edited

Loading

wckzhang commented Nov 5, 2019 •

edited

Loading

wckzhang commented Jan 22, 2020 •

edited

Loading

artemry-nv commented Jan 23, 2020 •

edited

Loading

artemry-nv commented Jan 28, 2020 •

edited

Loading