Connection refused (src/tcp_address.cpp:172) #683

garlick · 2016-05-25T23:21:52Z

Scale testing on opal, two ranks failed at 2048 size

$ time srun -N128 -n2048 ./flux broker flux getattr size
Connection refused (src/tcp_address.cpp:172)
Connection refused (src/tcp_address.cpp:172)
srun: error: opal177: task 1935: Aborted (core dumped)
srun: error: opal178: task 1939: Aborted (core dumped)
srun: First task exited 30s ago
srun: tasks 0-1934,1936-1938,1940-2047: running
srun: tasks 1935,1939: exited abnormally
srun: Terminating job step 1011406.0

cores from both nodes have this useless backtrace:

$ gdb /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker   opal178-lt-flux-broker-168609.core
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker...done.
[New LWP 168609]
[New LWP 169043]
[New LWP 169042]
[New LWP 169047]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker flux getattr size'.
Program terminated with signal 6, Aborted.
#0  0x00002aaaac4795f7 in __GI___open_catalog (
    cat_name=<error reading variable: Cannot access memory at address 0xffffffffffffff30>, 
    cat_name@entry=<error reading variable: Cannot access memory at address 0x8>, nlspath=<error reading variable: Cannot access memory at address 0x8>, 
    env_var=<error reading variable: Cannot access memory at address 0xffffffffffffff38>, 
    env_var@entry=<error reading variable: Cannot access memory at address 0x8>, 
    catalog=<error reading variable: Cannot access memory at address 0xffffffffffffff18>, 
    catalog@entry=<error reading variable: Cannot access memory at address 0x8>) at open_catalog.c:267
Cannot access memory at address 0x8

The text was updated successfully, but these errors were encountered:

garlick · 2016-05-25T23:46:05Z

Was able to get the same crash with rc1/rc3 disabled at 4096 nodes

$ export FLUX_RC1_PATH=""
$ export FLUX_RC3_PATH=""
$ time srun -N128 -n4096  ./flux broker flux getattr size
Connection refused (src/tcp_address.cpp:172)
srun: error: opal110: task 1716: Aborted (core dumped)
srun: First task exited 30s ago
srun: tasks 0-1715,1717-4095: running
srun: task 1716: exited abnormally
srun: Terminating job step 1011439.0

Same message catalog backtrace

$ gdb /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker   opal110-lt-flux-broker-181334.core
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker...done.
[New LWP 181334]
[New LWP 182198]
[New LWP 182200]
[New LWP 182199]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/g/g0/garlick/flux-core/build/src/broker/.libs/lt-flux-broker flux getattr size'.
Program terminated with signal 6, Aborted.
#0  0x00002aaaac4795f7 in __GI___open_catalog (
    cat_name=<error reading variable: Cannot access memory at address 0xffffffffffffff30>, 
    cat_name@entry=<error reading variable: Cannot access memory at address 0x8>, nlspath=<error reading variable: Cannot access memory at address 0x8>, 
    env_var=<error reading variable: Cannot access memory at address 0xffffffffffffff38>, 
    env_var@entry=<error reading variable: Cannot access memory at address 0x8>, 
    catalog=<error reading variable: Cannot access memory at address 0xffffffffffffff18>, 
    catalog@entry=<error reading variable: Cannot access memory at address 0x8>) at open_catalog.c:267
267     __munmap ((void *) catalog->file_ptr, catalog->file_size);

grondo · 2016-05-26T01:54:56Z

Is the corefile definitely not truncated?

Would it be easy to try running against a different version of libzmq, say libzmq master?

garlick · 2016-05-26T04:27:06Z

Definitely not truncated. It was the first time, then I set ulimit -c unlimited and got this.

I wasn't sure if that zmq connection refused message was cause or effect. The SIGABRT does make me wonder if one of the assertions in tcp_address.cpp is the cause though.

grondo · 2016-05-26T05:13:33Z

Yeah, I couldn't find the file and line called out in the assertion, which is what made me wonder what version we are running against on opal.

garlick · 2016-05-27T21:17:58Z

I reproduced this on opal with current master 20f1f2d at 2048 nodes (my source tree was out of date before).

Opal has zeromq-4.1.4-2.ch6.x86_64. Tracking down the right src/tcp_address.cpp, it would appear this is an assertion failure:

    const int rc = getifaddrs (&ifa);
    errno_assert (rc == 0);

ECONNREFUSED seems like a strange error to get from getifaddrs(3) although

ERRORS
getifaddrs() may fail and set errno for any of the errors specified for
socket(2), bind(2), getsockname(2), recvmsg(2), sendto(2), malloc(3),
or realloc(3).

garlick · 2016-05-27T21:33:12Z

That code was touched in upstream src/tcp_address.cpp but it's still an assertion in our case:

    //  Get the addresses.
    ifaddrs *ifa = NULL;
    const int rc = getifaddrs (&ifa);
    if (rc != 0 && errno == EINVAL) {
        // Windows Subsystem for Linux compatibility
        LIBZMQ_UNUSED (nic_);
        LIBZMQ_UNUSED (ipv6_);

        errno = ENODEV;
        return -1;
    }
    errno_assert (rc == 0);

garlick · 2016-05-27T21:37:13Z

getifaddrs talks on the netlink(7) socket. I wrote a little proggie that calls it and it does this:

socket(PF_NETLINK, SOCK_RAW, 0)         = 3
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=70999, groups=00000000}, [12]) = 0
sendto(3, "\24\0\0\0\22\0\1\3\217\275HW\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"`\4\0\0\20\0\2\0\217\275HWW\25\1\0\0\0\4\3\1\0\0\0I\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 3356
brk(0)                                  = 0x602000
brk(0x623000)                           = 0x623000
brk(0)                                  = 0x623000
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\200\4\0\0\20\0\2\0\217\275HWW\25\1\0\0\0\1\0\4\0\0\0\2\20\0\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 3456
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\217\275HWW\25\1\0\0\0\0\0", 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
sendto(3, "\24\0\0\0\26\0\1\3\220\275HW\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"L\0\0\0\24\0\2\0\220\275HWW\25\1\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 336
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\220\275HWW\25\1\0\0\0\0\0", 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(3)                                = 0

grondo · 2016-05-27T21:45:21Z

Ah, would it be the sendto that gets Connection refused?
Where does this code get called in zmq? Only 16 brokers are started per node, but is every thread also trying to call getifaddrs at the same time?

grondo · 2016-07-01T21:12:46Z

Seeing this same issue today on opal at 4096 brokers, but I seem to get a valid corefile this time. I don't think it gives us any new information, but this bug in zmq does impact scaling up the number of brokers per node (or seems to), so we might want to see if there is a simple fix or workaround.
(I didn't put any time into that yet)

gdb) where
#0  0x00002aaaac47b5f7 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaac47cce8 in __GI_abort () at abort.c:90
#2  0x00002aaaab8cc759 in zmq::zmq_abort(char const*) ()
   from /lib64/libzmq.so.5
#3  0x00002aaaab8fc3bd in zmq::tcp_address_t::resolve_nic_name(char const*, bool, bool) () from /lib64/libzmq.so.5
#4  0x00002aaaab8fc52e in zmq::tcp_address_t::resolve_interface(char const*, bool, bool) () from /lib64/libzmq.so.5
#5  0x00002aaaab8fcc45 in zmq::tcp_address_t::resolve(char const*, bool, bool, bool) () from /lib64/libzmq.so.5
#6  0x00002aaaab9000fe in zmq::tcp_listener_t::set_address(char const*) ()
   from /lib64/libzmq.so.5
#7  0x00002aaaab8f2740 in zmq::socket_base_t::bind(char const*) ()
   from /lib64/libzmq.so.5
#8  0x00002aaaab64f1f6 in zsocket_bind () from /lib64/libczmq.so.3
#9  0x000000000040f3ed in bind_child (ep=0x64c190, ov=0x635260)
    at overlay.c:484
#10 overlay_bind (ov=0x635260) at overlay.c:614
#11 0x000000000040a151 in boot_pmi (ctx=0x7fffffffcf50) at broker.c:1208
#12 0x0000000000407479 in main (argc=<optimized out>, argv=<optimized out>)

garlick · 2016-07-01T22:45:23Z

Just opened zeromq/libzmq#2051 on this.

garlick · 2016-08-01T18:42:12Z

This has been fixed upstream and also backported to the zeromq4-1 repo. I built TOSS3 packages with this fix that trent has picked up for the next release (zeromq-4.1.5-2).

The fix is a backoff-retry algorithm, up to 10 tries.

Closing for now - we can reopen if we see this again.

garlick mentioned this issue May 25, 2016

CTS-1 launch flux on full system flux-framework/distribution#13

Open

garlick changed the title ~~SIGABRT on opal size=2048 job~~ Connection refused (src/tcp_address.cpp:172) May 27, 2016

garlick mentioned this issue Jun 8, 2016

use TBON not ring network for RPCs #689

Merged

garlick closed this as completed Aug 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection refused (src/tcp_address.cpp:172) #683

Connection refused (src/tcp_address.cpp:172) #683

garlick commented May 25, 2016 •

edited

Loading

garlick commented May 25, 2016

grondo commented May 26, 2016

garlick commented May 26, 2016

grondo commented May 26, 2016

garlick commented May 27, 2016

garlick commented May 27, 2016 •

edited

Loading

garlick commented May 27, 2016

grondo commented May 27, 2016

grondo commented Jul 1, 2016

garlick commented Jul 1, 2016

garlick commented Aug 1, 2016

Connection refused (src/tcp_address.cpp:172) #683

Connection refused (src/tcp_address.cpp:172) #683

Comments

garlick commented May 25, 2016 • edited Loading

garlick commented May 25, 2016

grondo commented May 26, 2016

garlick commented May 26, 2016

grondo commented May 26, 2016

garlick commented May 27, 2016

garlick commented May 27, 2016 • edited Loading

garlick commented May 27, 2016

grondo commented May 27, 2016

grondo commented Jul 1, 2016

garlick commented Jul 1, 2016

garlick commented Aug 1, 2016

garlick commented May 25, 2016 •

edited

Loading

garlick commented May 27, 2016 •

edited

Loading