-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcp btl: Fix multiple-link connection establishment. #4247
Conversation
Can one of the admins verify this patch? |
ok to test |
I have reproduced the issue reported by the mellanox build, which is caused by mutliple NICs. |
bot:retest |
24c6129
to
e20fc6c
Compare
bot:ompi:retest |
bot:mellanox:retest |
1 similar comment
bot:mellanox:retest |
@bwbarrett did you received ifconfig info that I sent you some time ago? Do you need further assistance? |
c2ab37a
to
840d7d5
Compare
840d7d5
to
2d60350
Compare
All bugs have been fixed, this is alright to be looked at now. |
opal/mca/btl/tcp/btl_tcp.h
Outdated
@@ -167,7 +167,8 @@ struct mca_btl_tcp_module_t { | |||
#if 0 | |||
int tcp_ifindex; /**< BTL interface index */ | |||
#endif | |||
struct sockaddr_storage tcp_ifaddr; /**< BTL interface address */ | |||
struct sockaddr_storage tcp_ifaddr; /**< First IPv4 address discovered for this interface, bound as sending address for this BTL */ | |||
struct sockaddr_storage tcp_ifaddr_6; /**< First IPv6 address discovered for this interface, bound as sending address for this BTL */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPV6 code should be protected by OPAL_ENABLE_IPV6.
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers. Three issues were resulting in hangs during large message transfer: * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process address check was binary, rather than a count * The accept handler would not skip a btl module that was already in use, resulting in all connections for a given address being vectored to a single btl * Multiple addresses in the same subnet caused connections to be stalled, as the receiver would always use the same (first) address found. Binding the outgoing connection solves this issue * Lastly fix race condition created by connections being started at the exact same time by accpeting connections not in the closed state, allowing endpoint_accept to resolve dispute Signed-off-by: Jordan Cherry <cherryj@amazon.com>
2d60350
to
d7e7e3a
Compare
updated based on @bosilca feedback |
* can properly pair btl modules, even in cases where Linux | ||
* might do something unexpected with routing */ | ||
opal_socklen_t sockaddr_addrlen = sizeof(struct sockaddr_storage); | ||
if (endpoint_addr.ss_family == AF_INET) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but I would change this to have a struct sockaddr* temporary variable with assignment from either tcp_ifaddr or tcp_ifaddr6, to avoid duplicating the bind() and error handling code.
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Two issues were resulting in hangs during large message transfer:
address check was binary, rather than a count
connections for a given address being vectored to a single btl
Signed-off-by: Jordan Cherry cherryj@amazon.com