Surplus of errno_assert() leading to deamon crash #2334

lytboris · 2017-02-06T15:22:55Z

A daemon is a program that is designed to run forever so every single error that is not fatal should be handled and the show must go on. Currently ZMQ has 404 errno_assert calls - 404 ways to make a daemon crash with SIGABRT. Please consider this function from tcp.cpp:

void zmq::tune_tcp_socket (fd_t s_)
{
// Disable Nagle's algorithm. We are doing data batching on 0MQ level,
// so using Nagle wouldn't improve throughput in anyway, but it would
// hurt latency.
int nodelay = 1;
int rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELAY, (char*) &nodelay,
sizeof (int));
#ifdef ZMQ_HAVE_WINDOWS
wsa_assert (rc != SOCKET_ERROR);
#else
errno_assert (rc == 0);
#endif

#ifdef ZMQ_HAVE_OPENVMS
// Disable delayed acknowledgements as they hurt latency significantly.
int nodelack = 1;
rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELACK, (char*) &nodelack,
sizeof (int));
errno_assert (rc != SOCKET_ERROR);
#endif
}

When setsockopt() returns an error, your daemon would crash. And there is a trivial error-free scenario when this could happen - remote side can send TCP Reset packet that will immediately invalidate the socket but instead of reconnecting, ZMQ will crash whole app.

I was debugging my app that coredumped at this particular function:

Thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver)):
#0 0x0000000801896dcc in thr_kill () from /lib/libc.so.7
#1 0x000000080193d72b in abort () from /lib/libc.so.7
#2 0x0000000000415ac1 in zmq::zmq_abort (errmsg_=Could not find the frame base for "zmq::zmq_abort(char const*)".
) at src/err.cpp:84
#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17) at src/tcp.cpp:60
#4 0x0000000000454524 in zmq::tcp_connecter_t::out_event (this=0x80285a600) at src/tcp_connecter.cpp:134
#5 0x0000000000416be6 in zmq::kqueue_t::loop (this=0x802051300) at src/kqueue.cpp:205
#6 0x0000000000416ce5 in zmq::kqueue_t::worker_routine (arg_=0x802051300) at src/kqueue.cpp:222
#7 0x0000000000434bd8 in thread_routine (arg_=0x802051380) at src/thread.cpp:96
#8 0x0000000801618e14 in pthread_getprio () from /lib/libthr.so.3
#9 0x0000000000000000 in ?? ()
(gdb) thread 1
[Switching to thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver))]#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17)
at src/tcp.cpp:60
60 errno_assert (rc == 0);
(gdb) p errstr
$4 = 0x801b7b240 "Connection reset by peer"
(gdb)

Sure I can rewrite this function to ignore failure non-disabled Naggle and delayed-ACKs, but 402 of errno_assert()s will remain in code. Am I missing something?

The text was updated successfully, but these errors were encountered:

bluca · 2017-02-06T16:32:29Z

This is all intentional and by design:

http://zguide.zeromq.org/php:chapter2#toc17
http://grokbase.com/t/zeromq/zeromq-dev/08cf6qpw56/error-handling-assert
https://en.wikipedia.org/wiki/Fail-fast

lytboris · 2017-02-06T16:41:00Z

Well, I do error-checking for each ZMQ call but the thing I am trying to tell is that this particular crash was not produced or induced by ZMQ API call - it is something inside ZMQ itself made this crash happen. The app is a simple SUB app that does read_zmq() on single ZMQ socket and inbound RST packet from a publisher crashed the app.

bluca · 2017-02-06T16:45:13Z

Please read again the links and the discussion - these are all errors that are intentionally causing a sigabort so that they are found immediately and fixed

lytboris · 2017-02-06T19:49:55Z

@bluca, I got your point clear.
Sometimes setsockopt() calls may fail because of external cause - e.g. faulty WAN connection that can not be fixed and in these cases I do expect from ZMQ to behave as described in your first link: as robust as possible against external attacks and errors.
So I do agree that most of asserts are there for good but setsockopt() code should be altered to provide robust to external threats.

bluca · 2017-02-06T20:27:10Z

The problem in this case is that simply ignoring the error won't do - as the inline comment suggests, these options are needed for the correct functioning of the library, so I'm not sure if there's an alternative

lytboris · 2017-02-06T20:43:49Z

I don't ask for simple ignoring :)
One can treat errno == ECONNABORTED (and may be EINTR) as non-fatal thing (much alike connection timeout) - emit warning, close that broken socket and try again. How's about that?

lytboris · 2017-02-06T21:00:44Z

And please take a look on trivial scenario I made for your reference to prove that something with setsockopt() should be changed: a ZMQ-based publisher that is open to the Internet and a hacker.

Compile server and client, run server and start client. You will get server stopped with SIGABRT in less than a minute.

hammerclient.zip

jakecobb · 2017-02-07T17:04:21Z

Can confirm this client reliably crashes our 4.2.0 services, tested against both ROUTER and PUB sockets on a TCP endpoint (although reading the report I would not expect the ZMQ socket pattern to matter).

sem-hub · 2017-02-22T15:49:14Z

Sadly you have no chance to control it within your application. It's an easy DOS attack vector.
You can rebuild 0mq library with -DNDEBUG flag but you lost all errors checking this way.

Asmod4n · 2017-03-04T21:54:43Z

Didn't something like this happen in vs. 2 or earlier? E.g. where it was too strict how a internet facing socket should ack from external issues. It got resolved afaik.
Isn't it the zmq way to be as rebust as possible from external attacks and crash hard on internal errors?

jakecobb · 2017-04-03T20:16:02Z

@Asmod4n Yes, there's even an entry in the current FAQ about it:

Is it true that it is not safe to use ZeroMQ over the internet because it will crash?

Earlier versions of the ZeroMQ library (before 2.1) were not very resilient against "fuzzing" attacks. A malformed packet or garbage data could cause an old version of the library to assert and exit. Since the release of 2.1, all reported cases of assertions caused by bad data have been fixed. If your testing uncovers a problem in this area, please file a bug report.

evoskuil · 2017-04-03T23:04:21Z

While fail fast is the intended paradigm, it is meant to apply only to unrecoverable internal failures, which constitute either bugs or underlying platform and/or dependency instability. Data received over an externally-facing socket should never be able to bring down the process otherwise.

bluca · 2017-04-03T23:26:08Z

TCP_NODELAY and the other options are used because they are necessary and cannot be just ignored. If there's a better solution to handle this critical failure, please send a PR to implement it.

jakecobb · 2017-04-04T01:06:21Z

I'm looking into a possible PR, but there's some disagreement in the code about which error codes should be asserted or not. Here are the three sets I've found for *nix platforms:

In tcp_listener.cpp, these are non-asserting:

EAGAIN
EWOULDBLOCK
EINTR
ECONNABORTED
EPROTO
ENOBUFS
ENOMEM
EMFILE
ENFILE

In tcp_connector.cpp, only these assert:

EBADF
ENOPROTOOPT
ENOTSOCK
ENOBUFS

And in socks_connector.cpp, these are non-asserting:

ECONNREFUSED
ECONNRESET
ETIMEDOUT
EHOSTUNREACH
ENETUNREACH
ENETDOWN
EINVAL

When I crash an OS X server with the hammerclient, errno is EINVAL (22, "Invalid argument").

evoskuil · 2017-04-04T01:07:31Z

@sem-hub note that zmq_assert calls zmq_abort which in turn calls abort() (or RaiseException(EXCEPTION_NONCONTINUABLE) on Windows). These are not debug assertions and remain effective independent of NDEBUG declaration.

evoskuil · 2017-04-04T01:17:58Z

@jakecobb I'm not sure what is your approach, but aborts cannot be generally changed to handled errors based on the return code. The reason for an abort is contextual. In some cases an error code implies a program error in other cases the same error code does not.

For example, when two internal sockets are communicating with each other it is often the case that all errors abort, since the socket is expected to always be available. For an externally-facing socket the behavior is different.

The particular issue raised above is the setsockopt call in zmq::tune_tcp_socket. The resolution is more a question of how to propagate the setsockopt failure than what error codes to handle.

evoskuil · 2017-04-04T01:28:23Z

@lytboris thanks for reporting the issue.

to prove that something with setsockopt() should be changed

I don't see any place where setsockopt aborts, so the issue is really with the tcp_connecter_t::out_event () call of tune_tcp_socket (fd). These abort if setting the option fails. Since both are void, and presumably the setting is required, the resolution is not straightforward. Incidentally this issue applies to each of the tune_tcp_... methods in out_event, so the overall problem seems to be with the use of void/aborting tcp.cpp functions. It appears that they should all be at least Boolean and result in a failure to complete this subsequent section, by calling a socket_base_t failure method (although the aborts may be necessary in other contexts, which needs to be evaliated):

    //  Create the engine object for this connection.
    stream_engine_t *engine = new (std::nothrow)
        stream_engine_t (fd, options, endpoint);
    alloc_assert (engine);

    //  Attach the engine to the corresponding session object.
    send_attach (session, engine);

    //  Shut the connecter down.
    terminate ();

    socket->event_connected (endpoint, (int) fd);

jakecobb · 2017-04-04T01:53:01Z

@evoskuil I believe the error codes are relevant if we want to exclude internal errors from those caused by external conditions. This is the justification in tcp_connecter_t::connect which differentiates the error codes, stating:

    //  Assert if the error was caused by 0MQ bug.
    //  Networking problems are OK. No need to assert.

This is called just before the tuning. So my approach will be to treat tuning failures the same as failures of connect. My concern is just the difference in socket_connector_t::check_proxy_connection and tcp_connector_t::accept. It looks like copy+paste+modify because the comments and structure are the same but the error codes are different. All three of the places I mentioned above call the tuning functions in tcp.cpp, so presumably all three need to handle the RST case more gracefully.

evoskuil · 2017-04-04T02:03:36Z

So my approach will be to treat tuning failures the same as failures of connect

Seems right to me, given that the tuning failures can occur due to the state of the socket through no fault of the code.

greggwon · 2018-10-23T19:47:28Z

The predominant problem with signals is that it just doesn't work in a multi-threaded environment. Getting out of the kernel or out of the library with an error conveyed to the thread should not be a problem using conventional error returns.

I am seeing problems with bind to an address and port that is already bound sending a SIGABRT. This is just not going to work well in any environment. Return failure on the bind. Because you went down this path, it can be hard to crawl out, but there is really no reason that the APIs should not all be returned failure codes instead of signals.

lytboris mentioned this issue Feb 7, 2017

4.1.2: Assertion failed: Connection reset by peer (zeromq\src\signaler.cpp:298) #1808

Closed

jakecobb mentioned this issue Apr 4, 2017

Prevent DOS by asserts in TCP tuning #2492

Merged

bluca closed this as completed in #2492 Apr 4, 2017

This was referenced Apr 8, 2017

libzmq terminates process, when PUSH sockets frequently recreated #2016

Closed

tune_tcp_socket aborts the process with "Invalid argument" #2104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surplus of errno_assert() leading to deamon crash #2334

Surplus of errno_assert() leading to deamon crash #2334

lytboris commented Feb 6, 2017 •

edited

Loading

bluca commented Feb 6, 2017

lytboris commented Feb 6, 2017 •

edited

Loading

bluca commented Feb 6, 2017

lytboris commented Feb 6, 2017

bluca commented Feb 6, 2017

lytboris commented Feb 6, 2017

lytboris commented Feb 6, 2017 •

edited

Loading

jakecobb commented Feb 7, 2017

sem-hub commented Feb 22, 2017

Asmod4n commented Mar 4, 2017 •

edited

Loading

jakecobb commented Apr 3, 2017

evoskuil commented Apr 3, 2017 •

edited

Loading

bluca commented Apr 3, 2017

jakecobb commented Apr 4, 2017

evoskuil commented Apr 4, 2017

evoskuil commented Apr 4, 2017 •

edited

Loading

evoskuil commented Apr 4, 2017 •

edited

Loading

jakecobb commented Apr 4, 2017 •

edited

Loading

evoskuil commented Apr 4, 2017

greggwon commented Oct 23, 2018 •

edited

Loading

Surplus of errno_assert() leading to deamon crash #2334

Surplus of errno_assert() leading to deamon crash #2334

Comments

lytboris commented Feb 6, 2017 • edited Loading

bluca commented Feb 6, 2017

lytboris commented Feb 6, 2017 • edited Loading

bluca commented Feb 6, 2017

lytboris commented Feb 6, 2017

bluca commented Feb 6, 2017

lytboris commented Feb 6, 2017

lytboris commented Feb 6, 2017 • edited Loading

jakecobb commented Feb 7, 2017

sem-hub commented Feb 22, 2017

Asmod4n commented Mar 4, 2017 • edited Loading

jakecobb commented Apr 3, 2017

evoskuil commented Apr 3, 2017 • edited Loading

bluca commented Apr 3, 2017

jakecobb commented Apr 4, 2017

evoskuil commented Apr 4, 2017

evoskuil commented Apr 4, 2017 • edited Loading

evoskuil commented Apr 4, 2017 • edited Loading

jakecobb commented Apr 4, 2017 • edited Loading

evoskuil commented Apr 4, 2017

greggwon commented Oct 23, 2018 • edited Loading

lytboris commented Feb 6, 2017 •

edited

Loading

lytboris commented Feb 6, 2017 •

edited

Loading

lytboris commented Feb 6, 2017 •

edited

Loading

Asmod4n commented Mar 4, 2017 •

edited

Loading

evoskuil commented Apr 3, 2017 •

edited

Loading

evoskuil commented Apr 4, 2017 •

edited

Loading

evoskuil commented Apr 4, 2017 •

edited

Loading

jakecobb commented Apr 4, 2017 •

edited

Loading

greggwon commented Oct 23, 2018 •

edited

Loading