Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-after-free reported in zmq::pipe_t::terminate #3245

Closed
WallStProg opened this issue Sep 7, 2018 · 29 comments
Closed

use-after-free reported in zmq::pipe_t::terminate #3245

WallStProg opened this issue Sep 7, 2018 · 29 comments

Comments

@WallStProg
Copy link
Contributor

Issue description

use-after-free reported in zmq::pipe_t::terminate

Environment

  • libzmq version (commit hash if unreleased): d175819
  • OS: RH 6.5

What's the actual result? (include assertion message & call stack if applicable)

Address Sanitizer reports use-after-free in zmq::pipe_t::terminate:

==tpsdaemon==18932==ERROR: AddressSanitizer: heap-use-after-free on address 0x6100000184dc at pc 0x7f6a09d2b159 bp 0x7f6a081618d0 sp 0x7f6a081618c8
WRITE of size 1 at 0x6100000184dc thread T3
    #0 0x7f6a09d2b158 in zmq::pipe_t::terminate(bool) /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:393:11
    #1 0x7f6a09d774d7 in zmq::socket_base_t::term_endpoint(char const*) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1086:32
    #2 0x7f6a09d7a3af in zmq::socket_base_t::process_term_endpoint(std::string*) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1419:5
    #3 0x7f6a09d08799 in zmq::object_t::process_command(zmq::command_t&) /home/btorpey/work/libzmq/4.2.3/src/object.cpp:138:9
    #4 0x7f6a09d6ce1a in zmq::socket_base_t::process_commands(int, bool) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1367:26
    #5 0x7f6a09d7ab72 in zmq::socket_base_t::in_event() /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1515:5
    #6 0x7f6a09cd1ce4 in zmq::epoll_t::loop() /home/btorpey/work/libzmq/4.2.3/src/epoll.cpp:188:29
    #7 0x7f6a09cd1264 in zmq::epoll_t::worker_routine(void*) /home/btorpey/work/libzmq/4.2.3/src/epoll.cpp:203:24
    #8 0x7f6a09dbd938 in thread_routine(void*) /home/btorpey/work/libzmq/4.2.3/src/thread.cpp:106:9
    #9 0x3b5ee079d0 in start_thread (/lib64/libpthread.so.0+0x3b5ee079d0)
    #10 0x3b5eae8b6c in clone (/lib64/libc.so.6+0x3b5eae8b6c)
@WallStProg
Copy link
Contributor Author

This problem is being reported intermittently in the regression tests we run using ASAN.

Interestingly, the full stack trace identifies the reaper thread as both the culprit and the victim -- in other words, this does not appear to be a thread synchronization issue, since the same thread is both freeing the memory and using it after the free.

Curious if anyone has encountered this before? And maybe fixed it in a later version?

TIA for any help!

@WallStProg
Copy link
Contributor Author

@WallStProg
Copy link
Contributor Author

FWIW, the initial free is in zmq::pipe_t::process_pipe_term_ack, while the use-after-free is in zmq::socket_base_t::process_term_endpoint.

@bluca
Copy link
Member

bluca commented Sep 7, 2018

There's a Travis job with asan but I've never seen that, do you have a test case for it?

@WallStProg
Copy link
Contributor Author

WallStProg commented Sep 8, 2018

Afraid not -- libzmq is at the bottom of the stack here, so a repro with pure libzmq is not available. The problem is also intermittent (the test suite contains dozens of tests, runs for 7 hours, 3x per day and these are the only four instances of the problem over the past two weeks).

OTOH, the stack traces are pretty much pure libzmq, and are identical in all four cases. It's also not a threading issue since both the free and use-after-free occur on the reaper thread.

I'm guessing here that process_term_endpoint generally occurs before process_pipe_term_ack, and that works fine, but not the other way 'round?

@WallStProg
Copy link
Contributor Author

I'm curious why there has been no response to this. Use-after-free is clearly a serious issue, so I would hope that the project maintainers take it seriously.

Is the lack of response because I'm unable to provide a repro? Are people simply busy?

What do I need to do to help move this along?

Thanks!

@sigiesec
Copy link
Member

Hi @WallStProg!

It is very difficult to do anything about this without a test case reproducing it. We don't even know which socket types, transports and socket options are used. However, even with this information, an executable test case would still be more or less required to investigate this further.

Since you are using an older version, could you try updating to the latest master, and see if the problem persists?

@WallStProg
Copy link
Contributor Author

Hi @sigiesec:

Thanks for the thoughtful reply.

The good news is that ZeroMQ works as expected 99+% of the time -- the bad news is that in order to run our production systems on ZeroMQ that percentage needs to be as close to 100% as we can make it. So, we "torture test" the software, and in the process we uncover "Heisenbugs" like this one. And when we do, we post here and ask for help.

I understand that having a consistent reproduction is essential for resolving these problems. Unfortunately, I can't even get a consistent repro in my own environment, much less in a "pure" libzmq environment.

So what we're hoping for is some clue that will help us narrow things down to where we might be able to create a reliable reproduction. Maybe this looks similar to another problem that was reported by someone else. Or, maybe someone who is familiar with the shutdown sequence in ZeroMQ (and it's always the shutdown sequence that bites you) can give us an idea what to look for when the problem happens again.

That's all we're looking for, and any help will be much appreciated! After all, we're just trying to make ZeroMQ better, but we realize that others know much more than we do about it.

Best Regards,

Bill Torpey

P.S. If it helps, these are PUB/SUB sockets over TCP. Socket options are pretty vanilla, but one thing that may be relevant is that we set ZMQ_LINGER to zero immediately prior to closing the socket(s) -- and in other tests we've observed a race condition where that setting doesn't necessarily take effect and so can cause zmq_ctx_term to hang. We have been able to repro that using pure libzmq, and are putting together a writeup that we will post here as soon as it's ready.

@bluca
Copy link
Member

bluca commented Sep 14, 2018

P.S. If it helps, these are PUB/SUB sockets over TCP. Socket options are pretty vanilla, but one thing that may be relevant is that we set ZMQ_LINGER to zero immediately prior to closing the socket(s) -- and in other tests we've observed a race condition where that setting doesn't necessarily take effect and so can cause zmq_ctx_term to hang. We have been able to repro that using pure libzmq, and are putting together a writeup that we will post here as soon as it's ready.

That was fixed by #2910

@WallStProg
Copy link
Contributor Author

That was fixed by #2910

Thanks for the suggestion! Unfortunately, in our tests, we see the same behavior w/4.2.5, which suggests that the problem is not the one fixed by #2910, but something else, possibly related to when process_commands runs on the socket.

Again, we're putting together a repro and writeup that we can post here soon.

@WallStProg
Copy link
Contributor Author

Please see #3252 for a pure libzmq example of the linger problem mentioned above. (Note also that we discovered the linger problem while troubleshooting #3186, which turned out be related to process_commands not being called on the application thread).

As I mentioned above, my experience working with several different messaging middlewares is that it's the shutdown sequence where some of the gnarliest problems tend to happen, and libzmq is no different in that respect.

I suspect that the use-after-free problem is related to the linger problem in that both appear to be caused by the asynchronous nature of the communication between the application thread and the IO thread (and the reaper thread), which creates opportunities for race conditions. We're trying to understand that better so we can fix, avoid or work around some of the non-deterministic behavior that we're seeing in our tests.

But it's a tall order. We weren't expecting to have to do a "deep dive" into the ZeroMQ code (naive, perhaps), and the code comments are terse to say the least.

So, any help at all will be greatly appreciated! Thanks!

@teterycz
Copy link

Hi @sigiesec,

I was running a test on the same codebase as @WallStProg and ASAN caught a race condition in the reaper thread where process_pipe_term_ack() is being called before term_endpoint() completes causing a heap-use-after-free.

In the "freed by thread T3 here" section process_term_endpoint calls process_command and picks up a pipe_term_ack.

WRITE of size 1 at 0x610000019cdc thread T3
    #0 0x7f9aa7932dc8 in zmq::pipe_t::terminate(bool) /home/mteterycz/work/libzmq/nyfix/src/pipe.cpp:393:11
    #1 0x7f9aa797f317 in zmq::socket_base_t::term_endpoint(char const*) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1086:32
    #2 0x7f9aa79821ef in zmq::socket_base_t::process_term_endpoint(std::string*) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1419:5
    #3 0x7f9aa79102e9 in zmq::object_t::process_command(zmq::command_t&) /home/mteterycz/work/libzmq/nyfix/src/object.cpp:138:9
    #4 0x7f9aa7974c5a in zmq::socket_base_t::process_commands(int, bool) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1367:26
    #5 0x7f9aa79829b2 in zmq::socket_base_t::in_event() /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1515:5
    #6 0x7f9aa78d95d4 in zmq::epoll_t::loop() /home/mteterycz/work/libzmq/nyfix/src/epoll.cpp:188:29
    #7 0x7f9aa78d8b54 in zmq::epoll_t::worker_routine(void*) /home/mteterycz/work/libzmq/nyfix/src/epoll.cpp:203:24
    #8 0x7f9aa79c5b08 in thread_routine(void*) /home/mteterycz/work/libzmq/nyfix/src/thread.cpp:106:9
    #9 0x7f9aad9e4e24 in start_thread (/lib64/libpthread.so.0+0x7e24)
    #10 0x7f9aac82cbac in __clone (/lib64/libc.so.6+0xfebac)

0x610000019cdc is located 156 bytes inside of 192-byte region [0x610000019c40,0x610000019d00)
freed by thread T3 here:
    #0 0x52b1f0 in operator delete(void*) /shared/buildtest/clang/7.0.0/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:167
    #1 0x7f9aa792eb39 in zmq::pipe_t::~pipe_t() /home/mteterycz/work/libzmq/nyfix/src/pipe.cpp:101:1
    #2 0x7f9aa79327b0 in zmq::pipe_t::process_pipe_term_ack() /home/mteterycz/work/libzmq/nyfix/src/pipe.cpp:377:5
    #3 0x7f9aa790fed6 in zmq::object_t::process_command(zmq::command_t&) /home/mteterycz/work/libzmq/nyfix/src/object.cpp:118:9
    #4 0x7f9aa7974c5a in zmq::socket_base_t::process_commands(int, bool) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1367:26
    #5 0x7f9aa797e023 in zmq::socket_base_t::term_endpoint(char const*) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1017:14
    #6 0x7f9aa79821ef in zmq::socket_base_t::process_term_endpoint(std::string*) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1419:5
    #7 0x7f9aa79102e9 in zmq::object_t::process_command(zmq::command_t&) /home/mteterycz/work/libzmq/nyfix/src/object.cpp:138:9
    #8 0x7f9aa7974c5a in zmq::socket_base_t::process_commands(int, bool) /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1367:26
    #9 0x7f9aa79829b2 in zmq::socket_base_t::in_event() /home/mteterycz/work/libzmq/nyfix/src/socket_base.cpp:1515:5
    #10 0x7f9aa78d95d4 in zmq::epoll_t::loop() /home/mteterycz/work/libzmq/nyfix/src/epoll.cpp:188:29
    #11 0x7f9aa78d8b54 in zmq::epoll_t::worker_routine(void*) /home/mteterycz/work/libzmq/nyfix/src/epoll.cpp:203:24
    #12 0x7f9aa79c5b08 in thread_routine(void*) /home/mteterycz/work/libzmq/nyfix/src/thread.cpp:106:9
    #13 0x7f9aad9e4e24 in start_thread (/lib64/libpthread.so.0+0x7e24)
    #14 0x7f9aac82cbac in __clone (/lib64/libc.so.6+0xfebac)

This is with libzmq 4.2.3

Full ASAN output:
asan.listenTest.4582.txt

WallStProg added a commit to WallStProg/zmqtests that referenced this issue Oct 18, 2018
- add socket monitoring
- rearrange order of socket close (close pubs before subs)
WallStProg added a commit to WallStProg/zmqtests that referenced this issue Oct 18, 2018
- add socket monitoring
- rearrange order of socket close (close pubs before subs)
WallStProg added a commit to WallStProg/zmqtests that referenced this issue Oct 18, 2018
* 'master' of https://github.com/WallStProg/zmqtests:
  attempt to repro zeromq/libzmq#3245 (comment) - add socket monitoring - rearrange order of socket close (close pubs before subs)
@WallStProg
Copy link
Contributor Author

Turns out that this issue can be repro'd w/pure libzmq using the code in 3186 directory from zmqtests -- it just needs more peers running.

The use-after-free occurs when a number of processes connect and disconnect from each other, and when the system is busy enough that some calls (e.g., to zmq_disconnect) get preempted and resumed at a later time.

Reproducing the problem

To reproduce the problem, do the following:

  1. Edit run.sh and increase the number of peer processes (I used 99):
...
# start the peers that come and go
for i in `seq 1 99`; do
   rm -f peer${i}.out
   nohup ./repeat ./peer $@ > peer${i}.out 2>&1 &
done
  1. Run it like so:
./run.sh -seconds 2 -poll [-monitor]

The -monitor switch is not actually needed, but it seems that enabling socket monitoring causes the problem to happen more frequently.

Observed Errors

So far, I've seen three different causes for the error, but they all have similar stack traces for the error itself:

WRITE of size 1 at 0x61000006a2dc thread T0
    #0 0x7f2a6645e8e8 in zmq::pipe_t::terminate(bool) /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:393:11
    #1 0x7f2a664aae37 in zmq::socket_base_t::term_endpoint(char const*) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1086:32
    #2 0x7f2a66517353 in zmq_disconnect /home/btorpey/work/libzmq/4.2.3/src/zmq.cpp:366:15
    #3 0x53bef8 in main /home/btorpey/work/zmqtests/3186/peer.cpp:169:16
    #4 0x33bfe1ed1c in __libc_start_main (/lib64/libc.so.6+0x33bfe1ed1c)
    #5 0x4254c8 in _start (/shared/work/zmqtests/3186/peer+0x4254c8)

In all cases term_endpoint is accessing a pipe_t structure after it has been deleted.

So far, I've seen three different ways the pipe can get deleted. In all three I'm guessing that the zmq_disconnect call (above) gets interrupted which allows the process_pipe_term_ack to get processed (or queued) before the zmq_disconnect/term_endpoint.

  1. This is similar to Marek's example, where process_commands is being called directly from term_endpoint, and by the time that happens the process_pipe_term_ack is already queued up on the socket, so term_endpoint's calling process_commands effectively makes the process_pipe_term_ack "jump the queue".
0x61000006a2dc is located 156 bytes inside of 192-byte region [0x61000006a240,0x61000006a300)
freed by thread T0 here:
    #0 0x535010 in operator delete(void*) /shared/buildtest/clang/7.0.0/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:167
    #1 0x7f2a6645a659 in zmq::pipe_t::~pipe_t() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:101:1
    #2 0x7f2a6645e2d0 in zmq::pipe_t::process_pipe_term_ack() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:377:5
    #3 0x7f2a6643b9f6 in zmq::object_t::process_command(zmq::command_t&) /home/btorpey/work/libzmq/4.2.3/src/object.cpp:118:9
    #4 0x7f2a664a077a in zmq::socket_base_t::process_commands(int, bool) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1367:26
    #5 0x7f2a664a9b43 in zmq::socket_base_t::term_endpoint(char const*) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1017:14
    #6 0x7f2a66517353 in zmq_disconnect /home/btorpey/work/libzmq/4.2.3/src/zmq.cpp:366:15
    #7 0x53bef8 in main /home/btorpey/work/zmqtests/3186/peer.cpp:169:16
    #8 0x33bfe1ed1c in __libc_start_main (/lib64/libc.so.6+0x33bfe1ed1c)
    #9 0x4254c8 in _start (/shared/work/zmqtests/3186/peer+0x4254c8)
  1. In this example, it's the application's call to zmq_poll that triggered the process_pipe_term_ack
0x61000003c2dc is located 156 bytes inside of 192-byte region [0x61000003c240,0x61000003c300)
freed by thread T0 here:
    #0 0x535010 in operator delete(void*) /shared/buildtest/clang/7.0.0/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:167
    #1 0x7f0f1c1af659 in zmq::pipe_t::~pipe_t() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:101:1
    #2 0x7f0f1c1b32d0 in zmq::pipe_t::process_pipe_term_ack() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:377:5
    #3 0x7f0f1c1909f6 in zmq::object_t::process_command(zmq::command_t&) /home/btorpey/work/libzmq/4.2.3/src/object.cpp:118:9
    #4 0x7f0f1c1f577a in zmq::socket_base_t::process_commands(int, bool) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1367:26
    #5 0x7f0f1c1f48d9 in zmq::socket_base_t::getsockopt(int, void*, unsigned long*) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:414:18
    #6 0x7f0f1c2796b4 in zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) /home/btorpey/work/libzmq/4.2.3/src/socket_poller.cpp:473:33
    #7 0x7f0f1c2707e2 in zmq_poller_wait_all /home/btorpey/work/libzmq/4.2.3/src/zmq.cpp:1371:48
    #8 0x7f0f1c271ff7 in zmq_poller_poll(zmq_pollitem_t*, int, long) /home/btorpey/work/libzmq/4.2.3/src/zmq.cpp:813:10
    #9 0x7f0f1c26f930 in zmq_poll /home/btorpey/work/libzmq/4.2.3/src/zmq.cpp:861:12
    #10 0x53af57 in main /home/btorpey/work/zmqtests/3186/peer.cpp:127:7
    #11 0x33bfe1ed1c in __libc_start_main (/lib64/libc.so.6+0x33bfe1ed1c)
    #12 0x4254c8 in _start (/shared/work/zmqtests/3186/peer+0x4254c8)
  1. And in this example it's a subsequent call to zmq_connect that triggers the process_pipe_term_ack.
0x6100000018dc is located 156 bytes inside of 192-byte region [0x610000001840,0x610000001900)
freed by thread T0 here:
    #0 0x535010 in operator delete(void*) /shared/buildtest/clang/7.0.0/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:167
    #1 0x7f9b20bbb659 in zmq::pipe_t::~pipe_t() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:101:1
    #2 0x7f9b20bbf2d0 in zmq::pipe_t::process_pipe_term_ack() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:377:5
    #3 0x7f9b20b9c9f6 in zmq::object_t::process_command(zmq::command_t&) /home/btorpey/work/libzmq/4.2.3/src/object.cpp:118:9
    #4 0x7f9b20c0177a in zmq::socket_base_t::process_commands(int, bool) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:1367:26
    #5 0x7f9b20c05314 in zmq::socket_base_t::connect(char const*) /home/btorpey/work/libzmq/4.2.3/src/socket_base.cpp:698:14
    #6 0x7f9b20c78173 in zmq_connect /home/btorpey/work/libzmq/4.2.3/src/zmq.cpp:345:21
    #7 0x53bb5d in main /home/btorpey/work/zmqtests/3186/peer.cpp:157:16
    #8 0x33bfe1ed1c in __libc_start_main (/lib64/libc.so.6+0x33bfe1ed1c)
    #9 0x4254c8 in _start (/shared/work/zmqtests/3186/peer+0x4254c8)

Conclusions

I believe the above shows that the error is potentially far more common, and more dangerous, than we've been thinking. That means it is no longer an edge case, but a significant bug in the library, and hopefully will get fixed.

What prevents this bug from causing more problems is:

  1. In some cases the application is terminating anyway, and so the term_endpoint command returns early because of its check for ctx_terminated. (Note that this is also racy -- ctx_terminated can change while term_endpoint is running).

  2. It is likely that the memory for the deleted pipe_t structure has not been reallocated or written to, so term_endpoint can usually access the memory with no ill effects. In most cases, the error is only identified by a tool like Address Sanitizer that can detect use-after-free errors.

  3. Last but not least, the problem only occurs under a specific set of conditions -- i.e., a number of processes connect to and disconnect from each other at the same time, and the system is busy enough that some of those processes get preempted in the middle of the racy code.

Related Issue

Note that #3152 is another instance of this problem -- the root cause is the same (term_endpoint being called after pipe_t has been deleted), but in this case it actually results in a SEGV.

@WallStProg
Copy link
Contributor Author

P.S. To reproduce this problem requires building an instrumented version of libzmq using Address Sanitizer. I'm happy to help anyone who cares to do that (on Linux anyway) -- at a minimum something like the following needs to be added to CMakeLists.txt:

# use address sanitizer? (same params for gcc/clang)
if("$ENV{BUILD_TYPE}" STREQUAL "asan")
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=address -fsanitize-address-use-after-scope")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fsanitize-address-use-after-scope")
  set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fsanitize=address -fsanitize-address-use-after-scope")
  set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fsanitize=address -fsanitize-address-use-after-scope")
endif()

The application also needs to be built with ASAN, and reference the ASAN-instrumented build of libzmq.

@WallStProg
Copy link
Contributor Author

WallStProg commented Oct 27, 2018

A couple of quick notes:

  • I can confirm that this problem (use-after-free) exists in 4.2.5 and master, in addition to 4.2.3.

  • Issue intermittent memory leak for req/rep send/recv. #2602 looks like a dead ringer for this one, but since the use-after-free still occurs on both 4.2.5 and master, the fix for that issue would seem to not cover all cases.

WallStProg added a commit to WallStProg/zmqtests that referenced this issue Oct 27, 2018
@bjovke
Copy link
Contributor

bjovke commented Oct 29, 2018

Hello @WallStProg .
Can you replace this code in zmq::socket_base_t::term_endpoint (socket_base.cpp):

    for (endpoints_t::iterator it = range.first; it != range.second; ++it) {
        //  If we have an associated pipe, terminate it.
        if (it->second.second != NULL)
            it->second.second->terminate (false);
        term_child (it->second.first);
    }

with:

std::vector<pipe_t *> pipes_to_terminate(std::distance(range.first, range.second));

for (endpoints_t::iterator it = range.first; it != range.second; ++it) {
    //  If we have an associated pipe, terminate it.
    if (it->second.second != NULL) {
        pipes_to_terminate.push_back(it->second.second);
    }
}
for (size_t i = 0, found_pipes = 0, pipes_size = pipes.size(); i < pipes_size; ++i) {
    if (std::find(pipes_to_terminate.begin(), pipes_to_terminate.end(), pipes[i]) != pipes_to_terminate.end()) {
        pipes[i]->terminate(false);
        ++found_pipes;
        if (found_pipes == pipes_to_terminate.size()) {
            break;
        }
    }
}
for (endpoints_t::iterator it = range.first; it != range.second; ++it) {
    term_child(it->second.first);
}

and retry your tests.
I know, this can be done in a better way, formatting is not OK and so on....
I would just like to know if I'm on the right track or not.

Greetings.

@WallStProg
Copy link
Contributor Author

Thanks @bjovke ! Giving that a try now, on top of 4.2.5.

FWIW, posted the full function w/patch here: https://gist.github.com/WallStProg/1e66727aa6baa7c7885ce524aa53874e

@bjovke
Copy link
Contributor

bjovke commented Oct 29, 2018

@WallStProg Just one additional comment. std::distance and std::find require C++11. I was lazy, sorry. If you have any issues they can be each easily replaced with for loop.
But this is just an idea to try. The final commits (if any) will be different.

@WallStProg
Copy link
Contributor Author

Thanks again @bjovke !

FWIW, I've been running our torture test for about five hours now and so far it's all clean, so that's encouraging. (It usually takes about ten minutes to trigger the UAF).

I would love to understand better what's going on here, and why this patch works -- if you can explain a bit I'd be very grateful.

@bjovke
Copy link
Contributor

bjovke commented Oct 29, 2018

int rc = process_commands (0, false); line in zmq::socket_base_t::term_endpoint() might terminate the pipe (and free it).
This can be clearly seen in your stack trace:

    #1 0x7f9b20bbb659 in zmq::pipe_t::~pipe_t() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:101:1
    #2 0x7f9b20bbf2d0 in zmq::pipe_t::process_pipe_term_ack() /home/btorpey/work/libzmq/4.2.3/src/pipe.cpp:377:5
    #3 0x7f9b20b9c9f6 in zmq::object_t::process_command(zmq::command_t&) /home/btorpey/work/libzmq/4.2.3/src/object.cpp:118:9

You have properly concluded that the free and use after free happen in the same thread. They even happen in the same function during one execution - zmq::socket_base_t::term_endpoint().

The pipe deletes (frees) itself in zmq::pipe_t::process_pipe_term_ack (). You will notice that in this function there's a call sink->pipe_terminated (this); which leads to calling zmq::socket_base_t::pipe_terminated (). There the pointer to pipe which was just deleted is properly removed from array pipes member of zmq::socket_base_t.
But, the pointer to deleted pipe remains in multimap endpoints member of zmq::socket_base_t and it's being used just a little bit later here:

    for (endpoints_t::iterator it = range.first; it != range.second; ++it) {
        //  If we have an associated pipe, terminate it.
        if (it->second.second != NULL)
            it->second.second->terminate (false);
        term_child (it->second.first);
    }

it->second.second is the pointer to freed pipe_t and points to freed memory. it->second.second->terminate (false); is where the use after free happens.

What is missing is to remove the deleted pipe from endpoints of socket.

So, the most logical solution would be to remove the pipe from endpoints in zmq::socket_base_t::pipe_terminated (), set it to NULL actually.
But, zmq::socket_base_t::pipe_terminated () doesn't know to which endpoint this pipe belongs to so it would have to iterate through all socket's endpoints, which can be expensive.
That's why I've put the code into zmq::socket_base_t::term_endpoint(). This function knows to which endpoint pipe belongs because it receives the endpoint string as a parameter and here we can narrow the amount of endpoints to be searched (range).

The problem with the code I sent you is that at first sight it looks not very optimal - there could be many iterations through pipes array and endpoints multimap. Compared to this, for example, removing a pipe from pipes has O(1) complexity due to optimized array.
These iterations are done only to find the pipe_t entry in endpoints multimap.

One of the solutions would be to have a std:map<pipe_t, std::string> to inexpensively find pipe's associated endpoint name.
So this code I sent you might have performance impact in case of large amount of sockets and frequent disconnects.
Also, there might be other side effects so it would be nice if more people join the discussion ( @sigiesec , @teterycz , @bluca ).

@WallStProg
Copy link
Contributor Author

Thanks @bjovke for your thoughtful, and detailed, reply. I need to understand the code (and especially the shutdown sequence) better, but your explanation is a big help.

Also, ran clean with your patch for ~ 7 hours, then reverted the patch and got a UAF in three minutes, so I'd say the patch is looking good. Will continue to test and report back.

WallStProg added a commit to WallStProg/zmqtests that referenced this issue Oct 30, 2018
@bjovke
Copy link
Contributor

bjovke commented Oct 30, 2018

@WallStProg Great!
Now what is left to do is to make this patch a little better. Mostly regarding performance.
In the meantime I guess you can use this patch if you wish.

bjovke pushed a commit to bjovke/libzmq that referenced this issue Oct 30, 2018
…t(). Issue zeromq#3245.

Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t.
bjovke pushed a commit to bjovke/libzmq that referenced this issue Oct 30, 2018
…t(). Issue zeromq#3245.

Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t. Resolves issue zeromq#3245.
bjovke pushed a commit to bjovke/libzmq that referenced this issue Oct 30, 2018
…t(). Issue zeromq#3245.

Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t. Resolves issue zeromq#3245.
@WallStProg
Copy link
Contributor Author

Hi @bjovke -- if you have time to answer a few questions, I'd appreciate it very much!

You have properly concluded that the free and use after free happen in the same thread. They even happen in the same function during one execution - zmq::socket_base_t::term_endpoint().

Not every time, but yes it does happen that way sometimes.

...
But, zmq::socket_base_t::pipe_terminated () doesn't know to which endpoint this pipe belongs to so it would have to iterate through all socket's endpoints, which can be expensive.

That's the part that's been confusing me -- it seems that a pipe represents either a ZMQ socket or an endpoint, and in the latter case the endpoint address could be stored in the pipe_t structure itself, which would make this easy.

From a data-modelling perspective there would be a many-to-many relationship between sockets and endpoints. In other words, each (ZMQ) socket can connect to many endpoints, and each endpoint can be connected to many (ZMQ) sockets.

One of the solutions would be to have a std:map<pipe_t, std::string> to inexpensively find pipe's associated endpoint name.

That gets to another question I have, which is what is the cardinality of the relationship between pipe_ts and endpoints?

At least for TCP, a single endpoint should represent a single OS-level socket/fd.

Or am I missing something here?

Also, there might be other side effects so it would be nice if more people join the discussion ( @sigiesec , @teterycz , @bluca ).

Indeed -- I'd love to hear from the project maintainers on this.

@WallStProg
Copy link
Contributor Author

That's the part that's been confusing me -- it seems that a pipe represents either a ZMQ socket or an endpoint, and in the latter case the endpoint address could be stored in the pipe_t structure itself, which would make this easy.

Which is exactly what your PR does, so I guess that part at least is correct.

bluca added a commit that referenced this issue Oct 30, 2018
Problem: Use of pipe_t after free in zmq::socket_base_t::term_endpoint(). Issue #3245.
@bjovke
Copy link
Contributor

bjovke commented Oct 30, 2018

@WallStProg
The pipe_t is an interface to the two way thread safe message queue. For each socket's endpoint there's one pipe_t so socket can send/receive data for that endpoint. For network sockets on the other side of this queue is an I/O thread which does all the actual network communication.

In this case pipe_t is used by a socket to communicate with endpoint. But generally this might not be the case. It can be used for any kind of communication between two threads (or thread with itself). So pipe_t might not have anything to do with socket's endpoints. In this case endpoint uri stored in pipe_t is an empty string.

We need this info inside pipe_t only to efficiently mark the pipe as NULL in _endpoints multimap inside socket_base_t after it has been freed. This is because _endpoints multimap's key is uri string, not pipe, so searching for specific pipe in it is inefficient.

Have you done some tests with the latest master branch?
The PR is slightly different than the previous fix (but does the same thing) so it's a good idea to try it out.

@WallStProg
Copy link
Contributor Author

@bjovke -- I've tested with your PR, and after an hour-and-a-half got no UAFs. (I get an UAF in less than ten minutes with 4.2.5). I'm continuing to test and will post back here with more results, but it certainly looks good so far. Thanks!

And thanks also for the explanation of some of the internal workings of the library. While there's lots of documentation that discusses ZeroMQ from the outside, there is precious little that talks about how it works on the inside. Very helpful and much appreciated!

@bjovke
Copy link
Contributor

bjovke commented Oct 31, 2018

@WallStProg Nice to hear that it works now.
I'm wondering now what's going on with the ZMQ_LINGER problem? Did this PR help that in any way?
If this is still happening it seems to me that it would be more appropriate to create another issue for that instead of mixing it with "use after free" because these things might not be related.

@WallStProg
Copy link
Contributor Author

Hi @bjovke -- I haven't tested the linger issue specifically with the PR, but it's a good idea to give it a try, and I will do that and report back.

The linger issue is already being tracked separately as #3252, and we'll start banging on that now that the UAF appears to be resolved. (The UAF was higher priority because it could conceivably cause a crash, while the linger issue is at least avoidable by setting it at socket create time).

Can't thank you enough for all your help!

@bluca
Copy link
Member

bluca commented Nov 1, 2018

Fixed by #3285

@bluca bluca closed this as completed Nov 1, 2018
MohammadAlTurany pushed a commit to FairRootGroup/libzmq that referenced this issue Jan 28, 2019
…t(). Issue zeromq#3245.

Solution: When pipe_t is freed (terminated) remove it from _endpoints member of zmq::socket_base_t. Resolves issue zeromq#3245.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants