-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rosmaster leaves sockets in CLOSE_WAIT state #610
Comments
Please try to provide a reproducible example. Without that I simply can't change core logic like this. Such code changes need to be "validated" that they actually address the problem. Also if you propose code changes do not past the diff in a ticket. Please use a PR for that instead. But your current patch would likely not be accepted - hard coding a fixed upper limit of 100 proxies looks very bogus to me. |
I have similar problem described in #325 |
@dirk-thomas I understand the situation you're in. So far I've spent about a week in trying to reproduce this reliably in laboratory conditions, and all I've managed to do is crash my computer twice by launching thousands of instances by accident :) Having said that, I think the data does point to a leak in ServerProxy caching. The diff in question wasn't intended for merging as such, it was more for "this is what we tried and it helps for us". For example, the value N=100 was selected after careful analysis by pulling a nice round number out of a hat. My pet theory for the root cause that items in _proxies only get dropped if an API gets bumped. So a node comes up, rosmaster sends it some sort of update and then the node shuts down, leaving the ServerProxy alive and the socket in CLOSE_WAIT state.... and then master never needs to send any more updates so it'd notice the other end has gone cold. Thus having a limited number of proxies help; if one takes that to the limit, then disabling caching entirely helps too. For example, there's another xmlrpcapi() function in rospy/core.py which doesn't do any caching. I'll keep up my attempts to get a lab-reproducible set of nodes; I'm open to any suggestions on how to tease out the bug more easily. |
We have similar problem on baxter too... |
@aginika please let me know the symptoms of your issue on Baxter |
I did some experiments, and it's easy to show that ServerProxy objects pile up. I ran these two loops:
combined with a debug log entry in util.py that prints out the size of the |
Contrary to what I've said earlier, rospy vs roscpp might actually be significant. The Python HTTP server responds with HTTP/1.0, so there's no keep-alive and the XML-RPC client in rosmaster knows this. I'll write a simple C++ node when I next have time to dedicate to this issue. Might be a week or two, though, so I don't mind if someone beats me to it :) |
@lrasinen I can't reproduce a problem with your two line command line invocations of Regarding the increasing number of sockets in CLOSE_WAIT state I think I found an easy way to reproduce it. I did the following:
When doing the same with the |
Gah, sorry for omitting the version details earlier. Anyway, this is on Ubuntu Trusty and the rosmaster package is "1.11.10-0trusty-20141229-2145-+0000". As far as I can tell, that's the latest. If not, I'll have a chat with our IT... My exact change was this oneliner:
(Also at https://github.com/lrasinen/ros_comm/tree/proxy-debug for merging convenience; but that code just has the patch applied to it and not tested as such) You're right, the shell examples don't have the CLOSE_WAIT problem, but with the above change you can see the slow growth in the amount of proxies. That's what led me to reconsider the difference between roscpp and rospy. I had a look at the talker / listener launch and that looks very similar to the conditions we're experiencing (where the role of the Ctrl-C is taken by a camera node crashing). I'm willing to say it's a good reproduction of the problem. |
I don't think that the increasing number of cached proxies is the primary problem. The number also increases when running the |
While looking into the problem a bit more I found the following weird behavior:
In all cases the launch file was started from the same environment (devel space) and the sources in the workspace had the same version as the Debian packages. |
My theory about the role of server proxies goes like this.
Anyway, to test this I wrote even more debugging at https://github.com/lrasinen/ros_comm/tree/moalv (Mother of all layering violations) My afternoon schedule looks busy so I can't test the roscpp part of the above theory today, I'll try to get back to this later this week though. |
I was able to get a few CLOSE_WAIT sockets by running 100 copies of In the ServerProxy debug output we can see file descriptor 7: and its counterpart in But most of the proxies (currently about 400) don't have any attached sockets so there might be some other factor still at play. Writing to a closed socket will eventually fail and close the connection, so perhaps there's some list that gets pruned before that occurs. |
I disabled the
|
We observed the same problem on our robots. I finally debugged it down to a single line of code that has to be added in rosgraph (mgrrx@4f06033): diff --git a/tools/rosgraph/src/rosgraph/xmlrpc.py b/tools/rosgraph/src/rosgraph/xmlrpc.py
index 7d9aad8..a9c3d52 100644
--- a/tools/rosgraph/src/rosgraph/xmlrpc.py
+++ b/tools/rosgraph/src/rosgraph/xmlrpc.py
@@ -76,6 +76,8 @@ def isstring(s):
return isinstance(s, str)
class SilenceableXMLRPCRequestHandler(SimpleXMLRPCRequestHandler):
+ protocol_version = 'HTTP/1.1'
+
def log_message(self, format, *args):
if 0:
SimpleXMLRPCRequestHandler.log_message(self, format, *args) Caching the ServerProxies in the rosmaster only makes sense if HTTP/1.1 is used. With HTTP/1.0, a connection to the rosmaster remains in CLOSE_WAIT until the entry in the _proxies dictionary gets deleted or by e.g. calling _proxies[uri]._ServerProxy__close() to close the socket after a timeout or something similar. Update: Unfortunately this fix didn't solve all problems we observed. I now replaced the httplib backend of xmlrpclib by python-requests. In combination with the change from above I could reduce the number of CLOSE_WAIT connections to 1 on our system (~80 nodes). Previously, we had more than 100 CLOSE_WAIT connections when starting the robot. Can you please verify if that fixes the issue. I'd like to open a pull request as soon as possible. |
…brary for HTTP/1.1 requests
Can you please clarify what exact change you are proposing. |
My first attempt was to make sure HTTP/1.1 is used for every connection. mgrrx/ros_comm@4f06033 fixed this and eliminated all CLOSE_WAITs in my dummy world (playing around with talker/listener nodes from the tutorial packages). However, after testing on the real robot, I found out that it is still not working and connections remain in the CLOSE_WAIT state. After a bit of googling and playing around with xmlrpclib and the underlaying httplib, I figured out that:
If I understand the httplib code correctly, connections are only closed if you are sending the header "connection: close". Although a client might close the connection, the socket remains in the CLOSE_WAIT state then. I found some other projects that are using the requests library as the http backend for xmlrpclib. I implemented that in rosgraph and updated other packages that are using xmlrpclib mgrrx@f1f3340 . The subsequent commits update the tests and docstrings. So my proposed solution is to use the requests library for http connections and to make sure HTTP/1.1 is used everywhere. |
Please see #371 for a related patch. If you could provide a PRs for these changes which passes all the cases which we blocking the previous PR that would be great. But such a change will require extensive testing and might only be considered for future ROS releases (maybe still for Kinetic but I am not sure about that). |
My implementation passes the test cases mentioned in #371 without blocking any terminal. I totally agree with you that this needs extensive testing but I'll keep testing the modified core on our robots anyway before opening the PR. |
The referenced PRs are currently actively being tested. As soon as we have made sure that the changes don't introduce regressions they will be merged and released into Kinetic. Afterwards they will be considered for being backported to Jade / Indigo. You could try the patches from the referenced PRs on your robot and report back in these about your experience. It always help to know if a patch does or doesn't work for more users. |
Thank you @dirk-thomas for the answer. So, to make it clear, what you suggest is to compile In our Baxter machine we have ROS indigo installed system-wide from the debian packages (yes I don't like this either, but I inherited the robot from other people), so we would need to do the following:
It's not a problem doing that , I just wanted to be sure that we do all the right steps in order to avoid having further problems down the road. |
Plus, @rethink-imcmahon correct me if I'm wrong, but we do not have any way to recompile manually the patched version of ROS on the Baxter machine. Am I right? |
Since the PR only touches Python code you don't need to compile anything. Simply clone the branch into your workspace (or create an empty one), build the ws, and source the result. That will put all the Python code on the PYTHONPATH in front of the system packages. |
…brary for HTTP/1.1 requests
* rospy: switched from poll to epoll for tcpros. Fixes #107 for me and is related to #610 as connections remained in CLOSE_WAIT. Added a check for CLOSE_WAIT connections in the RegManager loop * using hasattr instead of dir * Tests for #107, currently failing of course * Simplified the code for selecting the Poller backend. Added a fallback to select.poll if select.epoll is not available in Linux. OS X will still use kqueue and Windows the noop operator. * Removed copyrights from tests Removed whitespace * Rewrote the check for closed connections. The crucial part is now guarded but the lock is only acquired when connection list needs to be modified. This avoids unnecessary locking. * Remove copyright
* Check CLOSE_WAIT sockets and close them whenever call xmlrpcapi
* Check and close CLOSE_WAIT socket whenever xmlrpcapi is called
WIth #1104 being merged can you please double check that all cases from this ticket are being covered and that this can be closed. |
Assuming that the referenced PR addressed the problem I will close this. Please feel free to comment and this can be reopened. |
Somehow related to #495, but not reproducible as easily.
In some scenarios, likely involving a node (or nodes) restarting, rosmaster can leave sockets in a CLOSE_WAIT state, eventually exhausting the limit on open file descriptors and becoming unresponsive.
Our system has about 30 nodes and 150 topics, and when the system is run without restarting, there is a steady creep upwards in the amount of CLOSE_WAIT sockets in systemwide monitoring. Detailed examination assigns most of these to the rosmaster process. See picture.
The sudden jump in the image has not been fully explained but it's likely associated with a hardware problem that caused parts of the system to restart repeatedly. Still, the trend is obvious.
After diagnosing the problem we've been trying to reproduce it with a simpler setup and/or collect logs, and the results are inconclusive. After comparing the rosmaster logs with lsof output, the leaks do appear to be related to ServerProxy objects.
It does not seem to matter whether the node uses roscpp or rospy, or whether it's a publisher-only or a subscriber-only. There are even instances of CLOSE_WAIT sockets being associated with nodes that do not publish or subscribe anything (they just read some parameters on startup).
We've run a belt-and-suspenders approach since hitting the limit, which involves a) nightly restarts and b) keeping the ServerProxy cache trimmed (patch below).
The nightly restarts have kept us from hitting the limit, but we're planning to run without them for a while to see if the ServerProxy patch helps. The default daily cycle already looks different:
The spikes are due to rosbag restarting every hour so we get bags to manageable size.
When there is a leak, it seems to be contained:
(These are system-wide numbers, rosmaster has 90 CLOSE_WAIT sockets)
The text was updated successfully, but these errors were encountered: