-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use-after-free reported in zmq::pipe_t::terminate #3245
Comments
This problem is being reported intermittently in the regression tests we run using ASAN. Interestingly, the full stack trace identifies the reaper thread as both the culprit and the victim -- in other words, this does not appear to be a thread synchronization issue, since the same thread is both freeing the memory and using it after the free. Curious if anyone has encountered this before? And maybe fixed it in a later version? TIA for any help! |
FWIW, the initial free is in |
There's a Travis job with asan but I've never seen that, do you have a test case for it? |
Afraid not -- libzmq is at the bottom of the stack here, so a repro with pure libzmq is not available. The problem is also intermittent (the test suite contains dozens of tests, runs for 7 hours, 3x per day and these are the only four instances of the problem over the past two weeks). OTOH, the stack traces are pretty much pure libzmq, and are identical in all four cases. It's also not a threading issue since both the free and use-after-free occur on the reaper thread. I'm guessing here that |
I'm curious why there has been no response to this. Use-after-free is clearly a serious issue, so I would hope that the project maintainers take it seriously. Is the lack of response because I'm unable to provide a repro? Are people simply busy? What do I need to do to help move this along? Thanks! |
Hi @WallStProg! It is very difficult to do anything about this without a test case reproducing it. We don't even know which socket types, transports and socket options are used. However, even with this information, an executable test case would still be more or less required to investigate this further. Since you are using an older version, could you try updating to the latest master, and see if the problem persists? |
Hi @sigiesec: Thanks for the thoughtful reply. The good news is that ZeroMQ works as expected 99+% of the time -- the bad news is that in order to run our production systems on ZeroMQ that percentage needs to be as close to 100% as we can make it. So, we "torture test" the software, and in the process we uncover "Heisenbugs" like this one. And when we do, we post here and ask for help. I understand that having a consistent reproduction is essential for resolving these problems. Unfortunately, I can't even get a consistent repro in my own environment, much less in a "pure" libzmq environment. So what we're hoping for is some clue that will help us narrow things down to where we might be able to create a reliable reproduction. Maybe this looks similar to another problem that was reported by someone else. Or, maybe someone who is familiar with the shutdown sequence in ZeroMQ (and it's always the shutdown sequence that bites you) can give us an idea what to look for when the problem happens again. That's all we're looking for, and any help will be much appreciated! After all, we're just trying to make ZeroMQ better, but we realize that others know much more than we do about it. Best Regards, Bill Torpey P.S. If it helps, these are PUB/SUB sockets over TCP. Socket options are pretty vanilla, but one thing that may be relevant is that we set ZMQ_LINGER to zero immediately prior to closing the socket(s) -- and in other tests we've observed a race condition where that setting doesn't necessarily take effect and so can cause |
That was fixed by #2910 |
Thanks for the suggestion! Unfortunately, in our tests, we see the same behavior w/4.2.5, which suggests that the problem is not the one fixed by #2910, but something else, possibly related to when Again, we're putting together a repro and writeup that we can post here soon. |
Please see #3252 for a pure libzmq example of the linger problem mentioned above. (Note also that we discovered the linger problem while troubleshooting #3186, which turned out be related to As I mentioned above, my experience working with several different messaging middlewares is that it's the shutdown sequence where some of the gnarliest problems tend to happen, and libzmq is no different in that respect. I suspect that the use-after-free problem is related to the linger problem in that both appear to be caused by the asynchronous nature of the communication between the application thread and the IO thread (and the reaper thread), which creates opportunities for race conditions. We're trying to understand that better so we can fix, avoid or work around some of the non-deterministic behavior that we're seeing in our tests. But it's a tall order. We weren't expecting to have to do a "deep dive" into the ZeroMQ code (naive, perhaps), and the code comments are terse to say the least. So, any help at all will be greatly appreciated! Thanks! |
Hi @sigiesec, I was running a test on the same codebase as @WallStProg and ASAN caught a race condition in the reaper thread where process_pipe_term_ack() is being called before term_endpoint() completes causing a heap-use-after-free. In the "freed by thread T3 here" section process_term_endpoint calls process_command and picks up a pipe_term_ack.
This is with libzmq 4.2.3 Full ASAN output: |
- add socket monitoring - rearrange order of socket close (close pubs before subs)
- add socket monitoring - rearrange order of socket close (close pubs before subs)
* 'master' of https://github.com/WallStProg/zmqtests: attempt to repro zeromq/libzmq#3245 (comment) - add socket monitoring - rearrange order of socket close (close pubs before subs)
Turns out that this issue can be repro'd w/pure libzmq using the code in 3186 directory from zmqtests -- it just needs more peers running. The use-after-free occurs when a number of processes connect and disconnect from each other, and when the system is busy enough that some calls (e.g., to Reproducing the problemTo reproduce the problem, do the following:
The Observed ErrorsSo far, I've seen three different causes for the error, but they all have similar stack traces for the error itself:
In all cases So far, I've seen three different ways the pipe can get deleted. In all three I'm guessing that the zmq_disconnect call (above) gets interrupted which allows the process_pipe_term_ack to get processed (or queued) before the zmq_disconnect/term_endpoint.
ConclusionsI believe the above shows that the error is potentially far more common, and more dangerous, than we've been thinking. That means it is no longer an edge case, but a significant bug in the library, and hopefully will get fixed. What prevents this bug from causing more problems is:
Related IssueNote that #3152 is another instance of this problem -- the root cause is the same (term_endpoint being called after pipe_t has been deleted), but in this case it actually results in a SEGV. |
P.S. To reproduce this problem requires building an instrumented version of libzmq using Address Sanitizer. I'm happy to help anyone who cares to do that (on Linux anyway) -- at a minimum something like the following needs to be added to CMakeLists.txt:
The application also needs to be built with ASAN, and reference the ASAN-instrumented build of libzmq. |
A couple of quick notes:
|
Hello @WallStProg .
with:
and retry your tests. Greetings. |
Thanks @bjovke ! Giving that a try now, on top of 4.2.5. FWIW, posted the full function w/patch here: https://gist.github.com/WallStProg/1e66727aa6baa7c7885ce524aa53874e |
@WallStProg Just one additional comment. |
Thanks again @bjovke ! FWIW, I've been running our torture test for about five hours now and so far it's all clean, so that's encouraging. (It usually takes about ten minutes to trigger the UAF). I would love to understand better what's going on here, and why this patch works -- if you can explain a bit I'd be very grateful. |
You have properly concluded that the free and use after free happen in the same thread. They even happen in the same function during one execution - The pipe deletes (frees) itself in
What is missing is to remove the deleted pipe from So, the most logical solution would be to remove the pipe from The problem with the code I sent you is that at first sight it looks not very optimal - there could be many iterations through One of the solutions would be to have a |
Thanks @bjovke for your thoughtful, and detailed, reply. I need to understand the code (and especially the shutdown sequence) better, but your explanation is a big help. Also, ran clean with your patch for ~ 7 hours, then reverted the patch and got a UAF in three minutes, so I'd say the patch is looking good. Will continue to test and report back. |
@WallStProg Great! |
…t(). Issue zeromq#3245. Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t.
…t(). Issue zeromq#3245. Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t. Resolves issue zeromq#3245.
…t(). Issue zeromq#3245. Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t. Resolves issue zeromq#3245.
Hi @bjovke -- if you have time to answer a few questions, I'd appreciate it very much!
Not every time, but yes it does happen that way sometimes.
That's the part that's been confusing me -- it seems that a pipe represents either a ZMQ socket or an endpoint, and in the latter case the endpoint address could be stored in the From a data-modelling perspective there would be a many-to-many relationship between sockets and endpoints. In other words, each (ZMQ) socket can connect to many endpoints, and each endpoint can be connected to many (ZMQ) sockets.
That gets to another question I have, which is what is the cardinality of the relationship between At least for TCP, a single endpoint should represent a single OS-level socket/fd. Or am I missing something here?
Indeed -- I'd love to hear from the project maintainers on this. |
Which is exactly what your PR does, so I guess that part at least is correct. |
Problem: Use of pipe_t after free in zmq::socket_base_t::term_endpoint(). Issue #3245.
@WallStProg In this case We need this info inside Have you done some tests with the latest master branch? |
@bjovke -- I've tested with your PR, and after an hour-and-a-half got no UAFs. (I get an UAF in less than ten minutes with 4.2.5). I'm continuing to test and will post back here with more results, but it certainly looks good so far. Thanks! And thanks also for the explanation of some of the internal workings of the library. While there's lots of documentation that discusses ZeroMQ from the outside, there is precious little that talks about how it works on the inside. Very helpful and much appreciated! |
@WallStProg Nice to hear that it works now. |
Hi @bjovke -- I haven't tested the linger issue specifically with the PR, but it's a good idea to give it a try, and I will do that and report back. The linger issue is already being tracked separately as #3252, and we'll start banging on that now that the UAF appears to be resolved. (The UAF was higher priority because it could conceivably cause a crash, while the linger issue is at least avoidable by setting it at socket create time). Can't thank you enough for all your help! |
Fixed by #3285 |
…t(). Issue zeromq#3245. Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t. Resolves issue zeromq#3245.
Issue description
use-after-free reported in zmq::pipe_t::terminate
Environment
What's the actual result? (include assertion message & call stack if applicable)
Address Sanitizer reports use-after-free in zmq::pipe_t::terminate:
The text was updated successfully, but these errors were encountered: