-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random ZMQ 4.2.3 crash in libzmq.dll!zmq::mailbox_t::recv(...) #3104
Comments
Are you using a socket from multiple threads? |
or messages |
@bluca Thanks! We are in the process of reviewing the implementation. Will report back soon. |
Should ZMQ socket only be created by the calling thread? |
Yes, only the context is thread safe as the doc says (and the new draft socket types which are explicitly marked thread safe) |
We've completed code review and all ZMQ sockets are created and used in the calling thread.
In the application code, our try-catch block was not able to catch this. It crashed within ZMQ code. |
We have a similar problem, and it seems to be related to the OS (we are using Windows) closing sockets when under high load. It is also possible to get a crash in the reconnection loop, where zmq tries to send the hiccup command, but will fail in the send
In any case, is raising an exception the only recourse in these cases? Is a proper reconnection impossible? |
Yes. If the internal control mechanism is not reliable anymore any kind of undefined behaviour and broken protocol behaviour might happen. I'd recommend to file a bug with the supplier of the operating system that randomly kills random userspace sockets, as that does not sound like something that should happen. |
The other alternative is for someone to implement an alternative for the internal pipe that does not use TCP on Windows. There have been many proposals during the years but no implementation. The easiest one would probably be using the new AF_UNIX support. See #1808 |
AF_UNIX seems fine, but availability seems very limited, so I personally won't count on this to be the silver bullet. |
For systems running a recent subversion of Windows 10, this might be an alternative, but I fear there will be plenty of pre-Win10 systems around for years. Also, AF_UNIX under Win10 does not support socketpair, which is used by libzmq. I am unfamiliar if this can easily be replaced by something else. In addition, while I expect this to work at least with select (not sure about wepoll, since this uses several internal Windows system functions), I fear this will still perform badly on Windows, compared to the use of IOCP. But since that is not trivial to implement and integrate with libzmq, I also think it is very much desirable to allow a process to recover from the situation described in the issue above. I think it is conceivable that in such a situation
However, all of these options require implementing a kind of hard shutdown that does not use the internal signaller, and therefore need some other synchronization mechanism. Maybe the user can be required to close all sockets before terminating the context. This sounds rather complex, but I agree that it may be unacceptable to use libzmq in production when this situation leads to a forced process termination. |
I'm seeing exactly this same crash on FreeBSD 11.2 with libzmq 4.3.2. My program only has one thread of its own, plus two threads owned by ZMQ. My thread is trying to disconnect a DEALER socket, and the reaper thread blows up:
So this problem is not restricted to 4.2.3, nor to Windows. |
There's a tangly set of conditions that can trigger this sort of crash on UNIX-like OSes. We discovered eventually that our asserts are caused when two things are true:
It isn't ZMQ's fault, but I thought it was worth throwing this up as something you should warn people about. Yes, of course, it's very bad practice to close these FDs without opening /dev/null to replace them, but it happens, and it can cause "explosions" inside ZMQ. A similar thing can affect Windows as well, where the application has an ordinary Winsock2 socket, closes it, and then initiates something (else) through ZMQ, but forgets that it closed the "native" socket and tries to send something on it after ZMQ already got the same handle. It's an application bug in both cases, of course, but it points a hostile finger at ZMQ because the explosion happens there. |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
Please use this template for reporting suspected bugs or requests for help.
Issue description
The crash happened randomly and can't be reproduced. But our client keeps reporting it.
We use Pub-Sub model for 2-way async communication between the client/server. Spent quite some time checking the application code checking possible socket handle race conditions but didn't notice any.
Environment
Minimal test code / Steps to reproduce the issue
N/A
What's the actual result? (include assertion message & call stack if applicable)
KERNELBASE.dll!_RaiseException@16�() Unknown
What's the expected result?
In the testing environment, we pump the system with very high traffic but it worked as designed. Guess we didn't find the way to hit it. That's the most headache part.
The text was updated successfully, but these errors were encountered: