Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection refused (src/tcp_address.cpp:172) #683

Closed
garlick opened this issue May 25, 2016 · 11 comments
Closed

Connection refused (src/tcp_address.cpp:172) #683

garlick opened this issue May 25, 2016 · 11 comments

Comments

@garlick
Copy link
Member

garlick commented May 25, 2016

Scale testing on opal, two ranks failed at 2048 size

$ time srun -N128 -n2048 ./flux broker flux getattr size
Connection refused (src/tcp_address.cpp:172)
Connection refused (src/tcp_address.cpp:172)
srun: error: opal177: task 1935: Aborted (core dumped)
srun: error: opal178: task 1939: Aborted (core dumped)
srun: First task exited 30s ago
srun: tasks 0-1934,1936-1938,1940-2047: running
srun: tasks 1935,1939: exited abnormally
srun: Terminating job step 1011406.0

cores from both nodes have this useless backtrace:

$ gdb /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker   opal178-lt-flux-broker-168609.core
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker...done.
[New LWP 168609]
[New LWP 169043]
[New LWP 169042]
[New LWP 169047]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker flux getattr size'.
Program terminated with signal 6, Aborted.
#0  0x00002aaaac4795f7 in __GI___open_catalog (
    cat_name=<error reading variable: Cannot access memory at address 0xffffffffffffff30>, 
    cat_name@entry=<error reading variable: Cannot access memory at address 0x8>, nlspath=<error reading variable: Cannot access memory at address 0x8>, 
    env_var=<error reading variable: Cannot access memory at address 0xffffffffffffff38>, 
    env_var@entry=<error reading variable: Cannot access memory at address 0x8>, 
    catalog=<error reading variable: Cannot access memory at address 0xffffffffffffff18>, 
    catalog@entry=<error reading variable: Cannot access memory at address 0x8>) at open_catalog.c:267
Cannot access memory at address 0x8
@garlick
Copy link
Member Author

garlick commented May 25, 2016

Was able to get the same crash with rc1/rc3 disabled at 4096 nodes

$ export FLUX_RC1_PATH=""
$ export FLUX_RC3_PATH=""
$ time srun -N128 -n4096  ./flux broker flux getattr size
Connection refused (src/tcp_address.cpp:172)
srun: error: opal110: task 1716: Aborted (core dumped)
srun: First task exited 30s ago
srun: tasks 0-1715,1717-4095: running
srun: task 1716: exited abnormally
srun: Terminating job step 1011439.0

Same message catalog backtrace

$ gdb /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker   opal110-lt-flux-broker-181334.core
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker...done.
[New LWP 181334]
[New LWP 182198]
[New LWP 182200]
[New LWP 182199]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker flux getattr size'.
Program terminated with signal 6, Aborted.
#0  0x00002aaaac4795f7 in __GI___open_catalog (
    cat_name=<error reading variable: Cannot access memory at address 0xffffffffffffff30>, 
    cat_name@entry=<error reading variable: Cannot access memory at address 0x8>, nlspath=<error reading variable: Cannot access memory at address 0x8>, 
    env_var=<error reading variable: Cannot access memory at address 0xffffffffffffff38>, 
    env_var@entry=<error reading variable: Cannot access memory at address 0x8>, 
    catalog=<error reading variable: Cannot access memory at address 0xffffffffffffff18>, 
    catalog@entry=<error reading variable: Cannot access memory at address 0x8>) at open_catalog.c:267
267     __munmap ((void *) catalog->file_ptr, catalog->file_size);

@grondo
Copy link
Contributor

grondo commented May 26, 2016

Is the corefile definitely not truncated?

Would it be easy to try running against a different version of libzmq, say libzmq master?

@garlick
Copy link
Member Author

garlick commented May 26, 2016

Definitely not truncated. It was the first time, then I set ulimit -c unlimited and got this.

I wasn't sure if that zmq connection refused message was cause or effect. The SIGABRT does make me wonder if one of the assertions in tcp_address.cpp is the cause though.

@grondo
Copy link
Contributor

grondo commented May 26, 2016

Yeah, I couldn't find the file and line called out in the assertion, which is what made me wonder what version we are running against on opal.

@garlick
Copy link
Member Author

garlick commented May 27, 2016

I reproduced this on opal with current master 20f1f2d at 2048 nodes (my source tree was out of date before).

Opal has zeromq-4.1.4-2.ch6.x86_64. Tracking down the right src/tcp_address.cpp, it would appear this is an assertion failure:

    const int rc = getifaddrs (&ifa);
    errno_assert (rc == 0);

ECONNREFUSED seems like a strange error to get from getifaddrs(3) although

ERRORS
getifaddrs() may fail and set errno for any of the errors specified for
socket(2), bind(2), getsockname(2), recvmsg(2), sendto(2), malloc(3),
or realloc(3).

@garlick
Copy link
Member Author

garlick commented May 27, 2016

That code was touched in upstream src/tcp_address.cpp but it's still an assertion in our case:

    //  Get the addresses.
    ifaddrs *ifa = NULL;
    const int rc = getifaddrs (&ifa);
    if (rc != 0 && errno == EINVAL) {
        // Windows Subsystem for Linux compatibility
        LIBZMQ_UNUSED (nic_);
        LIBZMQ_UNUSED (ipv6_);

        errno = ENODEV;
        return -1;
    }
    errno_assert (rc == 0);

@garlick
Copy link
Member Author

garlick commented May 27, 2016

getifaddrs talks on the netlink(7) socket. I wrote a little proggie that calls it and it does this:

socket(PF_NETLINK, SOCK_RAW, 0)         = 3
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=70999, groups=00000000}, [12]) = 0
sendto(3, "\24\0\0\0\22\0\1\3\217\275HW\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"`\4\0\0\20\0\2\0\217\275HWW\25\1\0\0\0\4\3\1\0\0\0I\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 3356
brk(0)                                  = 0x602000
brk(0x623000)                           = 0x623000
brk(0)                                  = 0x623000
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\200\4\0\0\20\0\2\0\217\275HWW\25\1\0\0\0\1\0\4\0\0\0\2\20\0\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 3456
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\217\275HWW\25\1\0\0\0\0\0", 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
sendto(3, "\24\0\0\0\26\0\1\3\220\275HW\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"L\0\0\0\24\0\2\0\220\275HWW\25\1\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 336
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\220\275HWW\25\1\0\0\0\0\0", 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(3)                                = 0

@garlick garlick changed the title SIGABRT on opal size=2048 job Connection refused (src/tcp_address.cpp:172) May 27, 2016
@grondo
Copy link
Contributor

grondo commented May 27, 2016

Ah, would it be the sendto that gets Connection refused?
Where does this code get called in zmq? Only 16 brokers are started per node, but is every thread also trying to call getifaddrs at the same time?

@grondo
Copy link
Contributor

grondo commented Jul 1, 2016

Seeing this same issue today on opal at 4096 brokers, but I seem to get a valid corefile this time. I don't think it gives us any new information, but this bug in zmq does impact scaling up the number of brokers per node (or seems to), so we might want to see if there is a simple fix or workaround.
(I didn't put any time into that yet)

gdb) where
#0  0x00002aaaac47b5f7 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaac47cce8 in __GI_abort () at abort.c:90
#2  0x00002aaaab8cc759 in zmq::zmq_abort(char const*) ()
   from /lib64/libzmq.so.5
#3  0x00002aaaab8fc3bd in zmq::tcp_address_t::resolve_nic_name(char const*, bool, bool) () from /lib64/libzmq.so.5
#4  0x00002aaaab8fc52e in zmq::tcp_address_t::resolve_interface(char const*, bool, bool) () from /lib64/libzmq.so.5
#5  0x00002aaaab8fcc45 in zmq::tcp_address_t::resolve(char const*, bool, bool, bool) () from /lib64/libzmq.so.5
#6  0x00002aaaab9000fe in zmq::tcp_listener_t::set_address(char const*) ()
   from /lib64/libzmq.so.5
#7  0x00002aaaab8f2740 in zmq::socket_base_t::bind(char const*) ()
   from /lib64/libzmq.so.5
#8  0x00002aaaab64f1f6 in zsocket_bind () from /lib64/libczmq.so.3
#9  0x000000000040f3ed in bind_child (ep=0x64c190, ov=0x635260)
    at overlay.c:484
#10 overlay_bind (ov=0x635260) at overlay.c:614
#11 0x000000000040a151 in boot_pmi (ctx=0x7fffffffcf50) at broker.c:1208
#12 0x0000000000407479 in main (argc=<optimized out>, argv=<optimized out>)

@garlick
Copy link
Member Author

garlick commented Jul 1, 2016

Just opened zeromq/libzmq#2051 on this.

@garlick
Copy link
Member Author

garlick commented Aug 1, 2016

This has been fixed upstream and also backported to the zeromq4-1 repo. I built TOSS3 packages with this fix that trent has picked up for the next release (zeromq-4.1.5-2).

The fix is a backoff-retry algorithm, up to 10 tries.

Closing for now - we can reopen if we see this again.

@garlick garlick closed this as completed Aug 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants