Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid asserting on getifaddrs failure #2051

Closed
garlick opened this issue Jul 1, 2016 · 3 comments · Fixed by #2064
Closed

avoid asserting on getifaddrs failure #2051

garlick opened this issue Jul 1, 2016 · 3 comments · Fixed by #2064

Comments

@garlick
Copy link
Contributor

garlick commented Jul 1, 2016

We've noticed the following assertion in zeromq 4.1.4 on a rhel 7.2 system (kernel 3.10):

Connection refused (src/tcp_address.cpp:172)

Here's a backtrace from a core file

gdb) where
#0  0x00002aaaac47b5f7 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaac47cce8 in __GI_abort () at abort.c:90
#2  0x00002aaaab8cc759 in zmq::zmq_abort(char const*) ()
   from /lib64/libzmq.so.5
#3  0x00002aaaab8fc3bd in zmq::tcp_address_t::resolve_nic_name(char const*, bool, bool) () from /lib64/libzmq.so.5
#4  0x00002aaaab8fc52e in zmq::tcp_address_t::resolve_interface(char const*, bool, bool) () from /lib64/libzmq.so.5
#5  0x00002aaaab8fcc45 in zmq::tcp_address_t::resolve(char const*, bool, bool, bool) () from /lib64/libzmq.so.5
#6  0x00002aaaab9000fe in zmq::tcp_listener_t::set_address(char const*) ()
   from /lib64/libzmq.so.5
#7  0x00002aaaab8f2740 in zmq::socket_base_t::bind(char const*) ()
   from /lib64/libzmq.so.5
#8  0x00002aaaab64f1f6 in zsocket_bind () from /lib64/libczmq.so.3
#9  0x000000000040f3ed in bind_child (ep=0x64c190, ov=0x635260)
    at overlay.c:484
#10 overlay_bind (ov=0x635260) at overlay.c:614
#11 0x000000000040a151 in boot_pmi (ctx=0x7fffffffcf50) at broker.c:1208
#12 0x0000000000407479 in main (argc=<optimized out>, argv=<optimized out>)

which I believe is this assertion in src/tcp_address.cpp (reference is to master not 4.1.4)

    //  Get the addresses.
    ifaddrs *ifa = NULL;
    const int rc = getifaddrs (&ifa);
    if (rc != 0 && errno == EINVAL) {
        // Windows Subsystem for Linux compatibility
        LIBZMQ_UNUSED (nic_);
        LIBZMQ_UNUSED (ipv6_);

        errno = ENODEV;
        return -1;
    }
    errno_assert (rc == 0);

Apparently getifaddrs can fail. Since it communicates with the kernel using the netlink socket, I suppose it might run out of something when abused. Although I wouldn't say we're abusing it - merely starting a dozen or so copies of the same zeromq based program at the same time.

Perhaps a backoff-retry would be appropriate here instead of an assertion?

@bluca
Copy link
Member

bluca commented Jul 8, 2016

Sounds reasonable, as long as it doesn't block forever. Would you send a PR to fix it (master first, and then backport to 4.1 if necessary)?

@garlick
Copy link
Contributor Author

garlick commented Jul 8, 2016

OK I will put something together. Thanks for the follow up.

@bluca
Copy link
Member

bluca commented Jul 8, 2016

Thanks!

garlick added a commit to garlick/libzmq that referenced this issue Jul 20, 2016
getifaddrs() can fail transiently with ECONNREFUSED on Linux.
This has been observed with Linux 3.10 when multiple processes
call zmq::tcp_address_t::resolve_nic_name() simultaneously.

Before asserting in this case, make 10 attempts, with exponential
backoff, given by (1 msec * 2^i), where i is the attempt number.

Fixes zeromq#2051
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants