-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zmq_disconnect doesn't release resources associated with terminated endpoints #3186
Comments
I've been trying to track down this issue for a while, since the memory is not recognized as a leak by memcheck et al. Nevertheless, when running a process for a long time, while other processes are repeatedly starting and stopping (and connecting and disconnecting from the long-running process), I notice memory growing from a few hundred MB to several GB. Using massif I was able to generate statistics on memory usage over the life of the program, and from those reports it appears that the culprit is zmq_poller_poll. After taking a quick look at the code, I'm guessing that the issue has to do with socket_poller_t::items being added to but never deleted from. I'm happy to try to create a reproduction, but before I go down that rabbit-hole, I thought I would ask here to see if this sounds reasonable, or if I'm missing something obvious. Thanks! |
Can you provide a program that reproduces the issue? zmq_poller_poll is only called when you call zmq_poll from your code. I can't imagine that this happens here, since zmq_poller_poll exits and cleans up after every poll event. However, in the title you refer to zmq_poller_add, which you would use together with zmq_poller_wait(_all). In that case, if you forget to remove the obsolete sockets again with zmq_poller_remove, you will exhibit unconstrained memory growth. Or are you referring to some of the other polling mechanisms that are used internally? |
Thanks Simon! Unfortunately creating a simple repro is not going to be ... simple. We've implemented a discovery mechanism using a separate channel using zmq_proxy to support point-to-point connections. This creates a fully-connected network where every node's SUB socket connects to every node's PUB socket. When run in steady state, memory use is constant. But when other processes connect and disconnect rapidly, any long-running processes' memory utilization grows and grows. The memory is not leaking per se, since valgrind reports no leaks at shutdown. I was initially suspicious of the push_back in I'll post back here when I have more information. Thanks again! |
Unconstrained memory growth is somewhat harder to analyze than memory leaks unfortunately. I am not familiar with Linux tooling for this unfortunately. Under Windows, I would use Intel Inspector e.g. to analyze Memory Growth. You can checkpoint the current memory during execution and then show the differences between checkpoints, along with the allocation callstacks, just as valgrind would do for a leak. |
Hi @sigiesec (et al): I finally was able to put together a repro using plain old ZeroMQ, without any of the "special" bits from my application. (I guess that's what vacations are for ;-) The code is at https://github.com/WallStProg/zmqtests/tree/master/3186 Anyway, the pure ZeroMQ code exhibits the same unconstrained memory growth as my application, and running valgrind memcheck/massif on it gives similar results. Not sure how helpful this will be for you, as it's all pretty Linux-specific, but it's the best I can do. (FWIW, I'm running CentOS 6.9). As before, it may just be that I'm doing something stupid, but if so I don't see it. Any suggestions, hints, tips are greatly appreciated! |
I was able to capture memory use with valgrind by terminating the program with SIGTERM, which skipped the shutdown code (zmq_close, etc.). This shows a number of "still-reachable" blocks that I suspect may point at the culprit. Specifically, the leaks with block count of 141 are interesting, since that is exactly equal to the number of connects:
I'm guessing that ZeroMQ is hanging on to some data about the endpoints even after they are disconnected. That memory gets cleaned up when the socket is closed (so doesn't normally show up as a "leak"), but until then that memory can grow without bound.
To reproduce:
|
valgrind output referenced above: |
Hi @sigiesec I am seeing the leak as described by @WallStProg by attaching to the process using Intel VTune Amplifier which is part of Intel System Studio. You can get a free 3 month license and download a copy from https://software.intel.com/en-us/system-studio/choose-download#technical Select analysis type as Memory Consumption and Attach to Process using PID after the process is already started. I've attached some preliminary results below. Please look at attached leak.xlsx file for additional details.
Additional details from Intel VTune Amplifier |
Looks like both VTune and valgrind agree on the top consumers of memory:
|
Hi @sigiesec, This seems to be related to an open issue dating back to 2014: #1256 One of the comments mentions pipes not being destroyed but marked as inactive, eventually being destroyed when the context is destroyed: From the same Intel VTune run I posted above leaks in pipe.cpp, attached pipe.xlsx shows details. My results are based on the test program provided by @WallStProg. Any insight or guidance on this issue would be much appreciated. |
A pipe is destroyed when a pipe_term_ack command is received, which is sent by the peer using send_pipe_term_ack, which is called in a number of situations. Unfortunately, I am not very familiar with this part of the libzmq architecture, so I cannot really tell if there is a situation where send_pipe_term_ack should be called, but it isn't. Just one thing to consider: I am not sure at which the test connects/disconnects. It might take some time to clear up resources. zmq_disconnect (and also zmq_close) do not work synchronously. |
I don't think the description of this issue is accurate. This isn't related to zmq_poller_add at all, as far as I understand this. The function that doesn't behave as you expect it is zmq_disconnect, right? |
Is there someone you would suggest might be better able to help? (FWIW, I've already posted on the dev list to solicit commercial support, which is very much on the table to help us "get to the finish line").
From the top, massif, VTune etc. results, as well as from the related issues, it appears that the resources don't get cleaned up no matter how long you wait (in our cases, minutes to hours). They are only getting cleaned up when the socket is closed and/or context terminated.
True enough -- that was my best guess at the time. Perhaps a better description would be "zmq_disconnect doesn't release resources associated with terminated endpoints"? |
I tried something out today: When just calling zmq_disconnect as in the test_reqrep_tcp test cases, the resources are not released until the zmq_ctx_term call. When calling something that involves calling process_commands, they are released earlier (which would be calling zmq_getsockopt for ZMQ_EVENTS, e.g.). I remember that there were some other issues raised in the past that were related to commands not being processed on an unused socket. However, I think in your example there are periodically calls to zmq_* functions that also call process_commands, so this does not seem to be the issue in your case :( |
Thanks @sigiesec I've already tried the ZMQ_EVENTS hack as mentioned in #1256 -- no joy. I've got a couple other (ineffective) work-arounds as well, which I'll commit to the repo once I've cleared up some of the #ifdef-mess (prob. by making them command-line params). I'm guessing some kind of race condition, perhaps associated with the endpoint being moved to inactive status, and/or the other side of the connection being dropped while the disconnect is "in flight". At this point, I think my best option is to trace through the zmq_connect/zmq_disconnect code in the debugger, but any hints on what to look for would be welcome! (It's a lot to learn, and complicated by the fact that the "tricky bits" are happening on the IO thread, I think). |
I added a dummy send in the "else if (msg.command == 'D')" block which seems to force the cleanup. `else if (msg.command == 'D') ... // send dummy message to PUB socket to force cleanup
|
Well, I think we've figured out the problem, and the solution is at the same time simple, but potentially pretty complicated. The root cause of the leak appears to be that when we disconnect the dataSub socket from the dataPub socket, somebody needs to trigger The latest peer.cpp calls We've also implemented a similar hack in our internal test program that triggered this discussion -- that program simply subscribes to a single topic and throws away messages. That code only starts to leak when other subscribers repeatedly connect and disconnect, and this is because the program only subscribes, but never publishes data. The hack that fixes the leak is to publish a dummy message with a dummy topic (with no subscribers) every "n" messages. That gives the dataPub socket an opportunity to clean up disconnected endpoints, and as with the peer.cpp test program that eliminates the "leak". There are some problems with this work-around, however:
I would very much appreciate any suggestions that anyone has on whether there is an offical/canonical approach to dealing with the problem of triggering
Thanks in advance for your ideas! |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
Issue description
A long-running process which is connected to and disconnected from many times consumes ever-greater amounts of memory.
This does not show up as a leak (e.g., w/valgrind), since the memory is apparently released when the long-running process shuts down.
Environment
The text was updated successfully, but these errors were encountered: