Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp btl: Fix multiple-link connection establishment. #4247

Merged
merged 1 commit into from
Mar 7, 2018

Conversation

jocherry
Copy link
Contributor

Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Two issues were resulting in hangs during large message transfer:

  • The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
    address check was binary, rather than a count
  • The accept handler would not skip a btl module that was already in use, resulting in all
    connections for a given address being vectored to a single btl

Signed-off-by: Jordan Cherry cherryj@amazon.com

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@jsquyres
Copy link
Member

ok to test

@jocherry
Copy link
Contributor Author

I have reproduced the issue reported by the mellanox build, which is caused by mutliple NICs.
I am working on a fix, will update with that fix.

@jocherry
Copy link
Contributor Author

bot:retest

@bwbarrett
Copy link
Member

bot:ompi:retest

@artpol84
Copy link
Contributor

bot:mellanox:retest

1 similar comment
@jocherry
Copy link
Contributor Author

bot:mellanox:retest

@artpol84
Copy link
Contributor

@bwbarrett did you received ifconfig info that I sent you some time ago? Do you need further assistance?

@jocherry
Copy link
Contributor Author

All bugs have been fixed, this is alright to be looked at now.

@@ -167,7 +167,8 @@ struct mca_btl_tcp_module_t {
#if 0
int tcp_ifindex; /**< BTL interface index */
#endif
struct sockaddr_storage tcp_ifaddr; /**< BTL interface address */
struct sockaddr_storage tcp_ifaddr; /**< First IPv4 address discovered for this interface, bound as sending address for this BTL */
struct sockaddr_storage tcp_ifaddr_6; /**< First IPv6 address discovered for this interface, bound as sending address for this BTL */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPV6 code should be protected by OPAL_ENABLE_IPV6.

    Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
    Three issues were resulting in hangs during large message transfer:
      * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
        address check was binary, rather than a count
      * The accept handler would not skip a btl module that was already in use, resulting in all
        connections for a given address being vectored to a single btl
      * Multiple addresses in the same subnet caused connections to be
        stalled, as the receiver would always use the same (first) address
        found.  Binding the outgoing connection solves this issue
     *  Lastly fix race condition created by connections being started at the exact same time
        by accpeting connections not in the closed state, allowing endpoint_accept to resolve
        dispute

    Signed-off-by: Jordan Cherry <cherryj@amazon.com>
@jocherry
Copy link
Contributor Author

updated based on @bosilca feedback

@jocherry jocherry assigned jocherry and unassigned jocherry Mar 6, 2018
@jocherry jocherry requested review from bosilca and bwbarrett March 6, 2018 19:55
* can properly pair btl modules, even in cases where Linux
* might do something unexpected with routing */
opal_socklen_t sockaddr_addrlen = sizeof(struct sockaddr_storage);
if (endpoint_addr.ss_family == AF_INET) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but I would change this to have a struct sockaddr* temporary variable with assignment from either tcp_ifaddr or tcp_ifaddr6, to avoid duplicating the bind() and error handling code.

@jocherry jocherry merged commit 2f0e815 into open-mpi:master Mar 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants