Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surplus of errno_assert() leading to deamon crash #2334

Closed
lytboris opened this issue Feb 6, 2017 · 20 comments
Closed

Surplus of errno_assert() leading to deamon crash #2334

lytboris opened this issue Feb 6, 2017 · 20 comments

Comments

@lytboris
Copy link

lytboris commented Feb 6, 2017

A daemon is a program that is designed to run forever so every single error that is not fatal should be handled and the show must go on. Currently ZMQ has 404 errno_assert calls - 404 ways to make a daemon crash with SIGABRT. Please consider this function from tcp.cpp:

void zmq::tune_tcp_socket (fd_t s_)
{
// Disable Nagle's algorithm. We are doing data batching on 0MQ level,
// so using Nagle wouldn't improve throughput in anyway, but it would
// hurt latency.
int nodelay = 1;
int rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELAY, (char*) &nodelay,
sizeof (int));
#ifdef ZMQ_HAVE_WINDOWS
wsa_assert (rc != SOCKET_ERROR);
#else
errno_assert (rc == 0);
#endif

#ifdef ZMQ_HAVE_OPENVMS
// Disable delayed acknowledgements as they hurt latency significantly.
int nodelack = 1;
rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELACK, (char*) &nodelack,
sizeof (int));
errno_assert (rc != SOCKET_ERROR);
#endif
}

When setsockopt() returns an error, your daemon would crash. And there is a trivial error-free scenario when this could happen - remote side can send TCP Reset packet that will immediately invalidate the socket but instead of reconnecting, ZMQ will crash whole app.

I was debugging my app that coredumped at this particular function:

Thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver)):
#0 0x0000000801896dcc in thr_kill () from /lib/libc.so.7
#1 0x000000080193d72b in abort () from /lib/libc.so.7
#2 0x0000000000415ac1 in zmq::zmq_abort (errmsg_=Could not find the frame base for "zmq::zmq_abort(char const*)".
) at src/err.cpp:84
#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17) at src/tcp.cpp:60
#4 0x0000000000454524 in zmq::tcp_connecter_t::out_event (this=0x80285a600) at src/tcp_connecter.cpp:134
#5 0x0000000000416be6 in zmq::kqueue_t::loop (this=0x802051300) at src/kqueue.cpp:205
#6 0x0000000000416ce5 in zmq::kqueue_t::worker_routine (arg_=0x802051300) at src/kqueue.cpp:222
#7 0x0000000000434bd8 in thread_routine (arg_=0x802051380) at src/thread.cpp:96
#8 0x0000000801618e14 in pthread_getprio () from /lib/libthr.so.3
#9 0x0000000000000000 in ?? ()
(gdb) thread 1
[Switching to thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver))]#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17)
at src/tcp.cpp:60
60 errno_assert (rc == 0);
(gdb) p errstr
$4 = 0x801b7b240 "Connection reset by peer"
(gdb)

Sure I can rewrite this function to ignore failure non-disabled Naggle and delayed-ACKs, but 402 of errno_assert()s will remain in code. Am I missing something?

@bluca
Copy link
Member

bluca commented Feb 6, 2017

@lytboris
Copy link
Author

lytboris commented Feb 6, 2017

Well, I do error-checking for each ZMQ call but the thing I am trying to tell is that this particular crash was not produced or induced by ZMQ API call - it is something inside ZMQ itself made this crash happen. The app is a simple SUB app that does read_zmq() on single ZMQ socket and inbound RST packet from a publisher crashed the app.

@bluca
Copy link
Member

bluca commented Feb 6, 2017

Please read again the links and the discussion - these are all errors that are intentionally causing a sigabort so that they are found immediately and fixed

@lytboris
Copy link
Author

lytboris commented Feb 6, 2017

@bluca, I got your point clear.
Sometimes setsockopt() calls may fail because of external cause - e.g. faulty WAN connection that can not be fixed and in these cases I do expect from ZMQ to behave as described in your first link: as robust as possible against external attacks and errors.
So I do agree that most of asserts are there for good but setsockopt() code should be altered to provide robust to external threats.

@bluca
Copy link
Member

bluca commented Feb 6, 2017

The problem in this case is that simply ignoring the error won't do - as the inline comment suggests, these options are needed for the correct functioning of the library, so I'm not sure if there's an alternative

@lytboris
Copy link
Author

lytboris commented Feb 6, 2017

I don't ask for simple ignoring :)
One can treat errno == ECONNABORTED (and may be EINTR) as non-fatal thing (much alike connection timeout) - emit warning, close that broken socket and try again. How's about that?

@lytboris
Copy link
Author

lytboris commented Feb 6, 2017

And please take a look on trivial scenario I made for your reference to prove that something with setsockopt() should be changed: a ZMQ-based publisher that is open to the Internet and a hacker.

Compile server and client, run server and start client. You will get server stopped with SIGABRT in less than a minute.

hammerclient.zip

@jakecobb
Copy link
Contributor

jakecobb commented Feb 7, 2017

Can confirm this client reliably crashes our 4.2.0 services, tested against both ROUTER and PUB sockets on a TCP endpoint (although reading the report I would not expect the ZMQ socket pattern to matter).

@sem-hub
Copy link

sem-hub commented Feb 22, 2017

Sadly you have no chance to control it within your application. It's an easy DOS attack vector.
You can rebuild 0mq library with -DNDEBUG flag but you lost all errors checking this way.

@Asmod4n
Copy link
Contributor

Asmod4n commented Mar 4, 2017

Didn't something like this happen in vs. 2 or earlier? E.g. where it was too strict how a internet facing socket should ack from external issues. It got resolved afaik.
Isn't it the zmq way to be as rebust as possible from external attacks and crash hard on internal errors?

@jakecobb
Copy link
Contributor

jakecobb commented Apr 3, 2017

@Asmod4n Yes, there's even an entry in the current FAQ about it:

Is it true that it is not safe to use ZeroMQ over the internet because it will crash?

Earlier versions of the ZeroMQ library (before 2.1) were not very resilient against "fuzzing" attacks. A malformed packet or garbage data could cause an old version of the library to assert and exit. Since the release of 2.1, all reported cases of assertions caused by bad data have been fixed. If your testing uncovers a problem in this area, please file a bug report.

@evoskuil
Copy link
Contributor

evoskuil commented Apr 3, 2017

While fail fast is the intended paradigm, it is meant to apply only to unrecoverable internal failures, which constitute either bugs or underlying platform and/or dependency instability. Data received over an externally-facing socket should never be able to bring down the process otherwise.

@bluca
Copy link
Member

bluca commented Apr 3, 2017

TCP_NODELAY and the other options are used because they are necessary and cannot be just ignored. If there's a better solution to handle this critical failure, please send a PR to implement it.

@jakecobb
Copy link
Contributor

jakecobb commented Apr 4, 2017

I'm looking into a possible PR, but there's some disagreement in the code about which error codes should be asserted or not. Here are the three sets I've found for *nix platforms:

In tcp_listener.cpp, these are non-asserting:

EAGAIN
EWOULDBLOCK
EINTR
ECONNABORTED
EPROTO
ENOBUFS
ENOMEM
EMFILE
ENFILE

In tcp_connector.cpp, only these assert:

EBADF
ENOPROTOOPT
ENOTSOCK
ENOBUFS

And in socks_connector.cpp, these are non-asserting:

ECONNREFUSED
ECONNRESET
ETIMEDOUT
EHOSTUNREACH
ENETUNREACH
ENETDOWN
EINVAL

When I crash an OS X server with the hammerclient, errno is EINVAL (22, "Invalid argument").

@evoskuil
Copy link
Contributor

evoskuil commented Apr 4, 2017

@sem-hub note that zmq_assert calls zmq_abort which in turn calls abort() (or RaiseException(EXCEPTION_NONCONTINUABLE) on Windows). These are not debug assertions and remain effective independent of NDEBUG declaration.

@evoskuil
Copy link
Contributor

evoskuil commented Apr 4, 2017

@jakecobb I'm not sure what is your approach, but aborts cannot be generally changed to handled errors based on the return code. The reason for an abort is contextual. In some cases an error code implies a program error in other cases the same error code does not.

For example, when two internal sockets are communicating with each other it is often the case that all errors abort, since the socket is expected to always be available. For an externally-facing socket the behavior is different.

The particular issue raised above is the setsockopt call in zmq::tune_tcp_socket. The resolution is more a question of how to propagate the setsockopt failure than what error codes to handle.

@evoskuil
Copy link
Contributor

evoskuil commented Apr 4, 2017

@lytboris thanks for reporting the issue.

to prove that something with setsockopt() should be changed

I don't see any place where setsockopt aborts, so the issue is really with the tcp_connecter_t::out_event () call of tune_tcp_socket (fd). These abort if setting the option fails. Since both are void, and presumably the setting is required, the resolution is not straightforward. Incidentally this issue applies to each of the tune_tcp_... methods in out_event, so the overall problem seems to be with the use of void/aborting tcp.cpp functions. It appears that they should all be at least Boolean and result in a failure to complete this subsequent section, by calling a socket_base_t failure method (although the aborts may be necessary in other contexts, which needs to be evaliated):

    //  Create the engine object for this connection.
    stream_engine_t *engine = new (std::nothrow)
        stream_engine_t (fd, options, endpoint);
    alloc_assert (engine);

    //  Attach the engine to the corresponding session object.
    send_attach (session, engine);

    //  Shut the connecter down.
    terminate ();

    socket->event_connected (endpoint, (int) fd);

@jakecobb
Copy link
Contributor

jakecobb commented Apr 4, 2017

@evoskuil I believe the error codes are relevant if we want to exclude internal errors from those caused by external conditions. This is the justification in tcp_connecter_t::connect which differentiates the error codes, stating:

    //  Assert if the error was caused by 0MQ bug.
    //  Networking problems are OK. No need to assert.

This is called just before the tuning. So my approach will be to treat tuning failures the same as failures of connect. My concern is just the difference in socket_connector_t::check_proxy_connection and tcp_connector_t::accept. It looks like copy+paste+modify because the comments and structure are the same but the error codes are different. All three of the places I mentioned above call the tuning functions in tcp.cpp, so presumably all three need to handle the RST case more gracefully.

@evoskuil
Copy link
Contributor

evoskuil commented Apr 4, 2017

So my approach will be to treat tuning failures the same as failures of connect

Seems right to me, given that the tuning failures can occur due to the state of the socket through no fault of the code.

@greggwon
Copy link

greggwon commented Oct 23, 2018

The predominant problem with signals is that it just doesn't work in a multi-threaded environment. Getting out of the kernel or out of the library with an error conveyed to the thread should not be a problem using conventional error returns.

I am seeing problems with bind to an address and port that is already bound sending a SIGABRT. This is just not going to work well in any environment. Return failure on the bind. Because you went down this path, it can be hard to crawl out, but there is really no reason that the APIs should not all be returned failure codes instead of signals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants