Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random ZMQ 4.2.3 crash in libzmq.dll!zmq::mailbox_t::recv(...) #3104

Closed
mxcoppell opened this issue May 14, 2018 · 14 comments
Closed

Random ZMQ 4.2.3 crash in libzmq.dll!zmq::mailbox_t::recv(...) #3104

mxcoppell opened this issue May 14, 2018 · 14 comments

Comments

@mxcoppell
Copy link

Please use this template for reporting suspected bugs or requests for help.

Issue description

The crash happened randomly and can't be reproduced. But our client keeps reporting it.

We use Pub-Sub model for 2-way async communication between the client/server. Spent quite some time checking the application code checking possible socket handle race conditions but didn't notice any.

Environment

  • libzmq version (commit hash if unreleased): 4.2.20.1 (ZMQ 4.2.3)
  • OS: Window 10 32-bit

Minimal test code / Steps to reproduce the issue

N/A

What's the actual result? (include assertion message & call stack if applicable)

KERNELBASE.dll!_RaiseException@16�() Unknown

libzmq.dll!zmq::zmq_abort(const char * errmsg_) Line 89 C++
libzmq.dll!zmq::signaler_t::recv_failable() Line 357 C++
libzmq.dll!zmq::mailbox_t::recv(zmq::command_t * cmd_, int timeout_) Line 89 C++
libzmq.dll!zmq::reaper_t::in_event() Line 90 C++
libzmq.dll!zmq::select_t::trigger_events(const std::vector<zmq::select_t::fd_entry_t,std::allocatorzmq::select_t::fd_entry_t > & fd_entries_, const zmq::select_t::fds_set_t & local_fds_set_, int event_count_) Line 122 C++
libzmq.dll!zmq::select_t::select_family_entry(zmq::select_t::family_entry_t & family_entry_, int max_fd_, bool use_timeout_, timeval & tv_) Line 404 C++
libzmq.dll!zmq::select_t::loop() Line 360 C++
libzmq.dll!thread_routine(void * arg_) Line 46 C++
ucrtbase.dll!thread_start<unsigned int (__stdcall*)(void *)>() Unknown
kernel32.dll!@BaseThreadInitThunk@12�() Unknown
ntdll.dll!__RtlUserThreadStart() Unknown
ntdll.dll!__RtlUserThreadStart@8�() Unknown

What's the expected result?

In the testing environment, we pump the system with very high traffic but it worked as designed. Guess we didn't find the way to hit it. That's the most headache part.

@bluca
Copy link
Member

bluca commented May 14, 2018

Are you using a socket from multiple threads?

@bluca
Copy link
Member

bluca commented May 14, 2018

or messages

@mxcoppell
Copy link
Author

@bluca Thanks! We are in the process of reviewing the implementation. Will report back soon.

@mxcoppell
Copy link
Author

Should ZMQ socket only be created by the calling thread?

@bluca
Copy link
Member

bluca commented May 16, 2018

Yes, only the context is thread safe as the doc says (and the new draft socket types which are explicitly marked thread safe)

@mxcoppell
Copy link
Author

mxcoppell commented Jul 17, 2018

We've completed code review and all ZMQ sockets are created and used in the calling thread.
This problem is still happening. We did specific client-server high volume traffic unit-test tool on the target machine under test. And it's happening at the following location inside function recv_failable() as well:

    zmq_assert (nbytes == sizeof (dummy));
    zmq_assert (dummy == 0);
#endif
    return 0;
}

#ifdef HAVE_FORK
void zmq::signaler_t::forked ()

In the application code, our try-catch block was not able to catch this. It crashed within ZMQ code.

@sylvainduchesne
Copy link

We have a similar problem, and it seems to be related to the OS (we are using Windows) closing sockets when under high load.
I can reproduce the problem using this tool https://www.nirsoft.net/utils/cports.html by forcibly closing zmq related TCP connections for my process.
While we haven't had any concrete evidence for the reason of these crashes, the high load on the OS and the exact crash when reproducing leads me to believe that this may be the cause.

It is also possible to get a crash in the reconnection loop, where zmq tries to send the hiccup command, but will fail in the send

    unsigned char dummy = 0;
    int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
    wsa_assert (nbytes != SOCKET_ERROR);
    zmq_assert (nbytes == sizeof (dummy));

In any case, is raising an exception the only recourse in these cases? Is a proper reconnection impossible?

@bluca
Copy link
Member

bluca commented Sep 19, 2018

In any case, is raising an exception the only recourse in these cases?

Yes. If the internal control mechanism is not reliable anymore any kind of undefined behaviour and broken protocol behaviour might happen. I'd recommend to file a bug with the supplier of the operating system that randomly kills random userspace sockets, as that does not sound like something that should happen.

@bluca
Copy link
Member

bluca commented Sep 19, 2018

The other alternative is for someone to implement an alternative for the internal pipe that does not use TCP on Windows. There have been many proposals during the years but no implementation. The easiest one would probably be using the new AF_UNIX support. See #1808

@sylvainduchesne
Copy link

AF_UNIX seems fine, but availability seems very limited, so I personally won't count on this to be the silver bullet.
I don't know zmq internals, but why is it impossible to return to an initial state (eg: restart completely) rather than crashing? Our project makes use of dozens of external resources/projects and yet zmq is the only one crashing under these circumstances.

@sigiesec
Copy link
Member

For systems running a recent subversion of Windows 10, this might be an alternative, but I fear there will be plenty of pre-Win10 systems around for years. Also, AF_UNIX under Win10 does not support socketpair, which is used by libzmq. I am unfamiliar if this can easily be replaced by something else.

In addition, while I expect this to work at least with select (not sure about wepoll, since this uses several internal Windows system functions), I fear this will still perform badly on Windows, compared to the use of IOCP.

But since that is not trivial to implement and integrate with libzmq, I also think it is very much desirable to allow a process to recover from the situation described in the issue above. I think it is conceivable that in such a situation

  • the libzmq context is terminated implicitly, and all future API calls return ETERM
  • the whole libzmq context is set into a broken state, and all future API calls return an appropriate error code, and the user can terminate the context
  • a callback function defined by the user is called to indicate this situation, and they can then go on and terminate the context

However, all of these options require implementing a kind of hard shutdown that does not use the internal signaller, and therefore need some other synchronization mechanism. Maybe the user can be required to close all sockets before terminating the context.

This sounds rather complex, but I agree that it may be unacceptable to use libzmq in production when this situation leads to a forced process termination.

@Steve-Read-Stormshield
Copy link

I'm seeing exactly this same crash on FreeBSD 11.2 with libzmq 4.3.2. My program only has one thread of its own, plus two threads owned by ZMQ. My thread is trying to disconnect a DEALER socket, and the reaper thread blows up:

Thread 3 (process 100365):  ** ZMQ I/O thread? **
#0  _kevent () at _kevent.S:3
#1  0x0000000804a8dcc2 in __thr_kevent (kq=7, changelist=0x0, nchanges=0, 
    eventlist=0x7fffdfdfaf30, nevents=256, timeout=0x7fffdfdfcf38)
    at /home/build/snsbsd/lib/libthr/thread/thr_syscalls.c:398
#2  0x0000000800cd8bc3 in zmq::kqueue_t::loop ()
   from /usr/Firewall/lib/libzmq.so.5
#3  0x0000000800d063df in thread_routine () from /usr/Firewall/lib/libzmq.so.5
#4  0x0000000804a8ac06 in thread_start (curthread=0x80640fc00)
    at /home/build/snsbsd/lib/libthr/thread/thr_create.c:289
#5  0x0000000000000000 in ?? ()
Current language:  auto; currently asm

Thread 2 (process 100200): ** My thread **
#0  _sendto () at _sendto.S:3
#1  0x0000000804a8da8f in __thr_sendto (s=<value optimized out>, 
    m=<value optimized out>, l=<value optimized out>, 
    f=<value optimized out>, t=<value optimized out>, 
    tl=<value optimized out>)
    at /home/build/snsbsd/lib/libthr/thread/thr_syscalls.c:530
#2  0x0000000800cf2f61 in zmq::signaler_t::send ()
   from /usr/Firewall/lib/libzmq.so.5
#3  0x0000000800cdf5ba in zmq::object_t::send_pipe_term ()
   from /usr/Firewall/lib/libzmq.so.5
#4  0x0000000800ce3ef2 in zmq::pipe_t::terminate ()
   from /usr/Firewall/lib/libzmq.so.5
#5  0x0000000800cf7830 in zmq::socket_base_t::term_endpoint ()
   from /usr/Firewall/lib/libzmq.so.5
#6  0x0000000800893900 in sns::messaging::common::Socket::disconnect ()
   from /usr/Firewall/lib/libfwmessaging++.so
#7  0x0000000800885be5 in sns::messaging::internal::ConnectingSocket::disconnect () from /usr/Firewall/lib/libfwmessaging++.so
#8  0x00000008008744eb in sns::messaging::Client::~Client ()
   from /usr/Firewall/lib/libfwmessaging++.so
#9  0x000000000040e626 in sns::gatewayctl::Commands::~Commands ()
#10 0x000000000040df4e in main ()

Thread 1 (process 100364):  ** Reaper thread **
#0  thr_kill () at thr_kill.S:3
#1  0x0000000805ced304 in __raise (s=6)
    at /home/build/snsbsd/lib/libc/gen/raise.c:52
#2  0x0000000805ced279 in abort ()
    at /home/build/snsbsd/lib/libc/stdlib/abort.c:65
#3  0x0000000800cd4439 in zmq::zmq_abort () from /usr/Firewall/lib/libzmq.so.5
#4  0x0000000800cf3392 in zmq::signaler_t::recv_failable ()
   from /usr/Firewall/lib/libzmq.so.5
#5  0x0000000800cd98e3 in zmq::mailbox_t::recv ()
   from /usr/Firewall/lib/libzmq.so.5
#6  0x0000000800cebc2d in zmq::reaper_t::in_event ()
   from /usr/Firewall/lib/libzmq.so.5
#7  0x0000000800cd8c58 in zmq::kqueue_t::loop ()
   from /usr/Firewall/lib/libzmq.so.5
#8  0x0000000800d063df in thread_routine () from /usr/Firewall/lib/libzmq.so.5
#9  0x0000000804a8ac06 in thread_start (curthread=0x80640f000)
    at /home/build/snsbsd/lib/libthr/thread/thr_create.c:289
#10 0x0000000000000000 in ?? ()

So this problem is not restricted to 4.2.3, nor to Windows.

@Steve-Read-Stormshield
Copy link

There's a tangly set of conditions that can trigger this sort of crash on UNIX-like OSes. We discovered eventually that our asserts are caused when two things are true:

  • The application, before creating a ZMQ context, closes FD 1 (stdout) and/or FD 2 (stderr) and does not replace it/them with other things (e.g. open( "/dev/null", ...)).
  • Somewhere after starting ZMQ (so that ZMQ gets FD 1 and/or FD 2 for one end of a socket-pair), some part of the application (e.g. a third-party library that wants to write an error message) writes to the previously closed FD. The other end of the socket pair in ZMQ will receive the text written and blow up on these assertions.

It isn't ZMQ's fault, but I thought it was worth throwing this up as something you should warn people about. Yes, of course, it's very bad practice to close these FDs without opening /dev/null to replace them, but it happens, and it can cause "explosions" inside ZMQ.

A similar thing can affect Windows as well, where the application has an ordinary Winsock2 socket, closes it, and then initiates something (else) through ZMQ, but forgets that it closed the "native" socket and tries to send something on it after ZMQ already got the same handle.

It's an application bug in both cases, of course, but it points a hostile finger at ZMQ because the explosion happens there.

@stale
Copy link

stale bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

@stale stale bot added the stale label Oct 4, 2020
@stale stale bot closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants