-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic service call failures #1976
Comments
I believe I am seeing the same issue, but it seems only on armhf (Raspian) and not on amd64. Did you get anywhere with this? Edit: Seen on ROS Noetic on Raspbian Buster armhf. Can't seem to repro on Ubuntu 20.04 Noetic amd64..? |
I'm experiencing the same issue on Ubuntu 18.04 (melodic) and 20.04 (noetic) running on an Intel Core i7-3820 processor. @dirk-thomas Any idea what's the problem here? |
@stevegolton Unfortunately, I still have no idea how to resolve this issue. Did you run the modified beginner_tutorials nodes on your amd64 machine or was it your own code? How long did you run it? Sometimes, it takes several hours until the first service call failure occurs. Additionally, I have the impression that failures occur more frequently if there is a higher load on the rosmaster, e.g., when running multiple instances of the test nodes in parallel. We have seen the issue on Intel Xeon E3-1505M and on Intel Core i7-6700K CPUs, using Ubuntu18.04 and ROS melodic. |
Ah ok I did run the beginner tutorials on Intel but only for about 10 minutes. I was running my own code on Rasbian, which pretty much does the same thing - calls a trivial service in a tight loop, but the issue would usually occur in the first 10 seconds or so. |
It seems the buffer overrun error originates from:
So the Indeed with debug symbols included, I ran an example under GDB and set a breakpoint on this line. When the issue was hit, the values of
So it seems somewhere an
So it looks like that message.message_start is NULL, probably never initialized.
The constructor for SerializedMessage looks like this:
So I guess Looking deeper into this will require some more poking around into the internals of of |
After further debugging this issue, I have found at least one possible execution behavior which causes service call failures. The mechanism seems to be that the TCP connection is closed via the POLLRDHUP event before the client can receive the response from the server (for details, see below). When I remove the POLLRDHUP case from TransportTCP::socketUpdate (line 707 in melodic-devel), I can no longer reproduce these service call failures. The listener to the POLLRDHUP event has been introduced by PR #1704, which has not yet been applied to the kinetic release. This probably explains why we have observed different behavior on melodic and kinetic. @cwecht @dirk-thomas Would it perhaps make sense to add the POLLRDHUP event listener only for topic connections, but not for (non-persistent) service connections? Or do you see any better solution how to fix this? Detailed analysisTypical execution sequence of a failing service call (line numbers referring to melodic-devel):
Of course, in most cases the timing is such that Connection::write() terminates early enough so that the callbacks ServiceServerLink::onRequestWritten() and ServiceServerLink::onResponseOkAndLength() are ready to receive the response correctly. Only in very rare cases, depending on system load etc., Connection::write() may require too much time so that the response is missed as in the above case. I believe that this might not be the only way how a service call may fail due to race conditions or similar issues. At least the buffer overrun error that @stevegolton has investigated seems to have a different cause. |
Could the "shutdown" be delayed at any of the stages mentioned in the investigation? Either by locks/mutexes or by ensuring all received data has been handled? I'm really not keen on adding retry loops around every service call ... we have seen this after 9M service calls on an otherwise idle pc or more often when stressed. |
We also encountered this sporadically. As a workaround we are now using persistent service connections... |
Summary
Sporadic service call failures, i.e. ServiceClient::call returns false despite the service is available (and always returns true in its callback function).
Setup Information
Detailed Description
We experience sporadic service call failures, i.e. ServiceClient::call returns false despite the service is available (and always returns true in its callback function). In order to create a minimum working example, we have modified the AddTwoInts example from beginner_tutorials so that the service is repeatedly called in an infinite loop. Once in about 10,000 to 100,000 times, the service call fails.
Additionally, the error message
Exception thrown while while deserializing service call: Buffer Overrun
is printed on the console sometimes. This message originates from line 186 ofros/service_client.h
(btw, note the typo in the message text). We do not know whether the two issues are related or not.By contrast, the example runs without any errors on ROS kinetic.
How should such service call failures be handled? Is it the responsibility of the client to try again a few times? Or is it an issue within ROS which should be fixed? And why does the behavior of ROS melodic differ from kinetic? I do not see any relevant changes in the ros_comm source code.
Minimal example
See beginner_tutorials_modified.tar.gz for the example package.
Source files:
add_two_ints_client.cpp
add_two_ints_server.cpp (unchanged from https://github.com/ros/catkin_tutorials/tree/master/create_package_srvclient/catkin_ws/src/beginner_tutorials/src)
Execution log
Typical output of the example client:
Background information
For those who are interested, some background information where we need so much service calls that these issues become relevant: We have an application that performs extensive precomputation of kinematic reachability and therefore frequently calls an inverse kinematics service via the moveit interface. When porting this application from kinetic to melodic, segmentation faults occurred after some computing time. These have been fixed meanwhile by applying PR #1950. While debugging the application further, we noticed the service call failures described above.
The text was updated successfully, but these errors were encountered: