-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surplus of errno_assert() leading to deamon crash #2334
Comments
This is all intentional and by design: http://zguide.zeromq.org/php:chapter2#toc17 |
Well, I do error-checking for each ZMQ call but the thing I am trying to tell is that this particular crash was not produced or induced by ZMQ API call - it is something inside ZMQ itself made this crash happen. The app is a simple SUB app that does read_zmq() on single ZMQ socket and inbound RST packet from a publisher crashed the app. |
Please read again the links and the discussion - these are all errors that are intentionally causing a sigabort so that they are found immediately and fixed |
@bluca, I got your point clear. |
The problem in this case is that simply ignoring the error won't do - as the inline comment suggests, these options are needed for the correct functioning of the library, so I'm not sure if there's an alternative |
I don't ask for simple ignoring :) |
And please take a look on trivial scenario I made for your reference to prove that something with setsockopt() should be changed: a ZMQ-based publisher that is open to the Internet and a hacker. Compile server and client, run server and start client. You will get server stopped with SIGABRT in less than a minute. |
Can confirm this client reliably crashes our 4.2.0 services, tested against both ROUTER and PUB sockets on a TCP endpoint (although reading the report I would not expect the ZMQ socket pattern to matter). |
Sadly you have no chance to control it within your application. It's an easy DOS attack vector. |
Didn't something like this happen in vs. 2 or earlier? E.g. where it was too strict how a internet facing socket should ack from external issues. It got resolved afaik. |
@Asmod4n Yes, there's even an entry in the current FAQ about it:
|
While fail fast is the intended paradigm, it is meant to apply only to unrecoverable internal failures, which constitute either bugs or underlying platform and/or dependency instability. Data received over an externally-facing socket should never be able to bring down the process otherwise. |
TCP_NODELAY and the other options are used because they are necessary and cannot be just ignored. If there's a better solution to handle this critical failure, please send a PR to implement it. |
I'm looking into a possible PR, but there's some disagreement in the code about which error codes should be asserted or not. Here are the three sets I've found for *nix platforms: In
In
And in
When I crash an OS X server with the hammerclient, |
@sem-hub note that |
@jakecobb I'm not sure what is your approach, but aborts cannot be generally changed to handled errors based on the return code. The reason for an abort is contextual. In some cases an error code implies a program error in other cases the same error code does not. For example, when two internal sockets are communicating with each other it is often the case that all errors abort, since the socket is expected to always be available. For an externally-facing socket the behavior is different. The particular issue raised above is the |
@lytboris thanks for reporting the issue.
I don't see any place where // Create the engine object for this connection.
stream_engine_t *engine = new (std::nothrow)
stream_engine_t (fd, options, endpoint);
alloc_assert (engine);
// Attach the engine to the corresponding session object.
send_attach (session, engine);
// Shut the connecter down.
terminate ();
socket->event_connected (endpoint, (int) fd); |
@evoskuil I believe the error codes are relevant if we want to exclude internal errors from those caused by external conditions. This is the justification in
This is called just before the tuning. So my approach will be to treat tuning failures the same as failures of |
Seems right to me, given that the tuning failures can occur due to the state of the socket through no fault of the code. |
The predominant problem with signals is that it just doesn't work in a multi-threaded environment. Getting out of the kernel or out of the library with an error conveyed to the thread should not be a problem using conventional error returns. I am seeing problems with bind to an address and port that is already bound sending a SIGABRT. This is just not going to work well in any environment. Return failure on the bind. Because you went down this path, it can be hard to crawl out, but there is really no reason that the APIs should not all be returned failure codes instead of signals. |
A daemon is a program that is designed to run forever so every single error that is not fatal should be handled and the show must go on. Currently ZMQ has 404
errno_assert
calls - 404 ways to make a daemon crash with SIGABRT. Please consider this function from tcp.cpp:When setsockopt() returns an error, your daemon would crash. And there is a trivial error-free scenario when this could happen - remote side can send TCP Reset packet that will immediately invalidate the socket but instead of reconnecting, ZMQ will crash whole app.
I was debugging my app that coredumped at this particular function:
Sure I can rewrite this function to ignore failure non-disabled Naggle and delayed-ACKs, but 402 of
errno_assert()
s will remain in code. Am I missing something?The text was updated successfully, but these errors were encountered: