-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection refused (src/tcp_address.cpp:172) #683
Comments
Was able to get the same crash with rc1/rc3 disabled at 4096 nodes
Same message catalog backtrace
|
Is the corefile definitely not truncated? Would it be easy to try running against a different version of libzmq, say libzmq master? |
Definitely not truncated. It was the first time, then I set I wasn't sure if that zmq connection refused message was cause or effect. The SIGABRT does make me wonder if one of the assertions in tcp_address.cpp is the cause though. |
Yeah, I couldn't find the file and line called out in the assertion, which is what made me wonder what version we are running against on opal. |
I reproduced this on opal with current master 20f1f2d at 2048 nodes (my source tree was out of date before). Opal has zeromq-4.1.4-2.ch6.x86_64. Tracking down the right src/tcp_address.cpp, it would appear this is an assertion failure: const int rc = getifaddrs (&ifa);
errno_assert (rc == 0); ECONNREFUSED seems like a strange error to get from
|
That code was touched in upstream src/tcp_address.cpp but it's still an assertion in our case: // Get the addresses.
ifaddrs *ifa = NULL;
const int rc = getifaddrs (&ifa);
if (rc != 0 && errno == EINVAL) {
// Windows Subsystem for Linux compatibility
LIBZMQ_UNUSED (nic_);
LIBZMQ_UNUSED (ipv6_);
errno = ENODEV;
return -1;
}
errno_assert (rc == 0); |
getifaddrs talks on the netlink(7) socket. I wrote a little proggie that calls it and it does this:
|
Ah, would it be the |
Seeing this same issue today on opal at 4096 brokers, but I seem to get a valid corefile this time. I don't think it gives us any new information, but this bug in zmq does impact scaling up the number of brokers per node (or seems to), so we might want to see if there is a simple fix or workaround.
|
Just opened zeromq/libzmq#2051 on this. |
This has been fixed upstream and also backported to the zeromq4-1 repo. I built TOSS3 packages with this fix that trent has picked up for the next release (zeromq-4.1.5-2). The fix is a backoff-retry algorithm, up to 10 tries. Closing for now - we can reopen if we see this again. |
Scale testing on opal, two ranks failed at 2048 size
cores from both nodes have this useless backtrace:
The text was updated successfully, but these errors were encountered: