Sporadic service call failures #1976

chris20623 · 2020-06-23T14:59:16Z

Summary

Sporadic service call failures, i.e. ServiceClient::call returns false despite the service is available (and always returns true in its callback function).

Setup Information

ROS version: melodic (rosversion: 1.14.6)
OS version: Ubuntu 18.04
CMake Version: 3.10.2
GCC Version: 7.5.0

Detailed Description

We experience sporadic service call failures, i.e. ServiceClient::call returns false despite the service is available (and always returns true in its callback function). In order to create a minimum working example, we have modified the AddTwoInts example from beginner_tutorials so that the service is repeatedly called in an infinite loop. Once in about 10,000 to 100,000 times, the service call fails.

Additionally, the error message Exception thrown while while deserializing service call: Buffer Overrun is printed on the console sometimes. This message originates from line 186 of ros/service_client.h (btw, note the typo in the message text). We do not know whether the two issues are related or not.

By contrast, the example runs without any errors on ROS kinetic.

How should such service call failures be handled? Is it the responsibility of the client to try again a few times? Or is it an issue within ROS which should be fixed? And why does the behavior of ROS melodic differ from kinetic? I do not see any relevant changes in the ros_comm source code.

Minimal example

See beginner_tutorials_modified.tar.gz for the example package.

Source files:

add_two_ints_client.cpp

#include "ros/ros.h"
#include "beginner_tutorials/AddTwoInts.h"
#include <cstdlib>

int main(int argc, char **argv)
{
  ros::init(argc, argv, "add_two_ints_client");

  ros::NodeHandle n;
  ros::ServiceClient client = n.serviceClient<beginner_tutorials::AddTwoInts>("add_two_ints");
  beginner_tutorials::AddTwoInts srv;
  std::size_t i = 0;
  while (ros::ok())
  {
    srv.request.a = rand();
    srv.request.b = rand();
    ++i;
    if (client.call(srv))
    {
      //ROS_INFO("Sum: %ld", (long int)srv.response.sum);
    }
    else
    {
      ROS_ERROR("Failed to call service add_two_ints in attempt %ld", i);
    }
  }

  return 0;
}

add_two_ints_server.cpp (unchanged from https://github.com/ros/catkin_tutorials/tree/master/create_package_srvclient/catkin_ws/src/beginner_tutorials/src)

#include "ros/ros.h"
#include "beginner_tutorials/AddTwoInts.h"

bool add(beginner_tutorials::AddTwoInts::Request  &req,
         beginner_tutorials::AddTwoInts::Response &res)
{
  res.sum = req.a + req.b;
  ROS_INFO("request: x=%ld, y=%ld", (long int)req.a, (long int)req.b);
  ROS_INFO("sending back response: [%ld]", (long int)res.sum);
  return true;
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "add_two_ints_server");
  ros::NodeHandle n;

  ros::ServiceServer service = n.advertiseService("add_two_ints", add);
  ROS_INFO("Ready to add two ints.");
  ros::spin();

  return 0;
}

Execution log

Typical output of the example client:

$ rosrun beginner_tutorials add_two_ints_client 
[ERROR] [1592908525.665423362]: Failed to call service add_two_ints in attempt 18221
[ERROR] [1592908525.688159153]: Failed to call service add_two_ints in attempt 18233
[ERROR] [1592908713.978057353]: Failed to call service add_two_ints in attempt 176349
[ERROR] [1592908714.084723845]: Failed to call service add_two_ints in attempt 176430
[ERROR] [1592908841.292468750]: Failed to call service add_two_ints in attempt 282092
[ERROR] [1592908917.783882743]: Failed to call service add_two_ints in attempt 347690
[ERROR] [1592908917.923579974]: Exception thrown while while deserializing service call: Buffer Overrun
[ERROR] [1592908917.923624772]: Failed to call service add_two_ints in attempt 347803
[ERROR] [1592909524.440756998]: Failed to call service add_two_ints in attempt 851196
…

Background information

For those who are interested, some background information where we need so much service calls that these issues become relevant: We have an application that performs extensive precomputation of kinematic reachability and therefore frequently calls an inverse kinematics service via the moveit interface. When porting this application from kinetic to melodic, segmentation faults occurred after some computing time. These have been fixed meanwhile by applying PR #1950. While debugging the application further, we noticed the service call failures described above.

The text was updated successfully, but these errors were encountered:

stevegolton · 2020-07-16T14:04:37Z

I believe I am seeing the same issue, but it seems only on armhf (Raspian) and not on amd64.

Did you get anywhere with this?

Edit: Seen on ROS Noetic on Raspbian Buster armhf. Can't seem to repro on Ubuntu 20.04 Noetic amd64..?

astero-a11y · 2020-07-16T15:08:25Z

I'm experiencing the same issue on Ubuntu 18.04 (melodic) and 20.04 (noetic) running on an Intel Core i7-3820 processor.

@dirk-thomas Any idea what's the problem here?

chris20623 · 2020-07-17T15:53:03Z

@stevegolton Unfortunately, I still have no idea how to resolve this issue.

Did you run the modified beginner_tutorials nodes on your amd64 machine or was it your own code? How long did you run it? Sometimes, it takes several hours until the first service call failure occurs. Additionally, I have the impression that failures occur more frequently if there is a higher load on the rosmaster, e.g., when running multiple instances of the test nodes in parallel.

We have seen the issue on Intel Xeon E3-1505M and on Intel Core i7-6700K CPUs, using Ubuntu18.04 and ROS melodic.

stevegolton · 2020-07-17T19:22:34Z

Ah ok I did run the beginner tutorials on Intel but only for about 10 minutes. I was running my own code on Rasbian, which pretty much does the same thing - calls a trivial service in a tight loop, but the issue would usually occur in the first 10 seconds or so.

stevegolton · 2020-07-19T12:58:42Z

It seems the buffer overrun error originates from:
roscpp_core/roscpp_serialization/include/ros/serialization.h:681

  ROS_FORCE_INLINE uint8_t* advance(uint32_t len)
  {
    uint8_t* old_data = data_;
    data_ += len;
    if (data_ > end_)
    {
      // Throwing directly here causes a significant speed hit due to the extra code generated
      // for the throw statement
      throwStreamOverrun(); // <--- HERE
    }
    return old_data;
  }

So the data_ > end_ check is failing somewhere.

Indeed with debug symbols included, I ran an example under GDB and set a breakpoint on this line. When the issue was hit, the values of data_ and end_ were both NULL and len was 1.

(gdb) print data_
$1 = (uint8_t *) 0x1 <error: Cannot access memory at address 0x1>
(gdb) print end_
$2 = (uint8_t *) 0x0
(gdb) print len
$3 = 1

So it seems somewhere an IStream object is being created with NULL values. Here is where the IStream is created:
roscpp_core/roscpp_serialization/include/ros/serialization.h:858

template<typename M>
inline void deserializeMessage(const SerializedMessage& m, M& message)
{
  IStream s(m.message_start, static_cast<uint32_t>(m.num_bytes - (m.message_start - m.buf.get())));
  deserialize(s, message);
}

So it looks like that message.message_start is NULL, probably never initialized.
Looking further up the stack, it looks like the issues start here:
ros_comm/roscpp/include/ros/service_client.h:111

template<typename MReq, typename MRes>
  bool call(const MReq& req, MRes& resp, const std::string& service_md5sum)
  {
    namespace ser = serialization;
    SerializedMessage ser_req = ser::serializeMessage(req);
    SerializedMessage ser_resp;
    bool ok = call(ser_req, ser_resp, service_md5sum);
    if (!ok)
    {
      return false;
    }

    try
    {
      ser::deserializeMessage(ser_resp, resp);
    }
    catch (std::exception& e)
    {
      deserializeFailed(e);
      return false;
    }

    return true;
  }

The constructor for SerializedMessage looks like this:

SerializedMessage()
  : buf(boost::shared_array<uint8_t>())
  , num_bytes(0)
  , message_start(0)
  , type_info(0)
  {}

So I guess message_start was never initialized in the call method. I guess the times when "Buffer Overrun" error was seen is when call returns true but doesn't initialize the ser_resp message, and the other times is when call just returns false without printing any error message internally.

Looking deeper into this will require some more poking around into the internals of of service_client and service_client_link, which I am struggling to wrap my head around.

chris20623 · 2020-08-10T11:24:38Z

After further debugging this issue, I have found at least one possible execution behavior which causes service call failures. The mechanism seems to be that the TCP connection is closed via the POLLRDHUP event before the client can receive the response from the server (for details, see below).

When I remove the POLLRDHUP case from TransportTCP::socketUpdate (line 707 in melodic-devel), I can no longer reproduce these service call failures.

The listener to the POLLRDHUP event has been introduced by PR #1704, which has not yet been applied to the kinetic release. This probably explains why we have observed different behavior on melodic and kinetic.

@cwecht @dirk-thomas Would it perhaps make sense to add the POLLRDHUP event listener only for topic connections, but not for (non-persistent) service connections? Or do you see any better solution how to fix this?

Detailed analysis

Typical execution sequence of a failing service call (line numbers referring to melodic-devel):

Client	Server
invoke ServiceClient::call() from add_two_ints_client.cpp:22
invoke ServiceServerLink::call() from service_client.cpp:143
invoke ServiceServerLink::processNextCall() from service_server_link.cpp:362
invoke Connection::write() from service_server_link.cpp:325
	add() in add_two_ints_server.cpp, from service_publication.cpp:125
	ServiceClientLink::processResponse() from service_publication.cpp:128
	Connection::write() from service_client_link.cpp:240
	ServiceClientLink::onResponseWritten()
	Connection::drop() from service_client_link.cpp:233
	TransportTCP::close() from connection.cpp:340
TransportTCP::socketUpdate() with event 8196 (POLLRDHUP \| POLLOUT)
TransportTCP::close() from transport_tcp:725
Connection::drop() with reason 0 (TransportDisconnect)
ServiceServerLink::onConnectionDropped()
ServiceServerLink::clearCalls() from service_server_link.cpp:175
ServiceServerLink::cancelCall()
Connection::write() returns (this is the function call from the 4th row of this table)
ServiceServerLink::call() returns false in service_server_link.cpp:381 (since the call has been canceled before the response has been received)
ServiceClient::call() returns false in service_client.cpp:152

Of course, in most cases the timing is such that Connection::write() terminates early enough so that the callbacks ServiceServerLink::onRequestWritten() and ServiceServerLink::onResponseOkAndLength() are ready to receive the response correctly. Only in very rare cases, depending on system load etc., Connection::write() may require too much time so that the response is missed as in the above case.

I believe that this might not be the only way how a service call may fail due to race conditions or similar issues. At least the buffer overrun error that @stevegolton has investigated seems to have a different cause.

simonschmeisser · 2021-02-25T15:41:21Z

Could the "shutdown" be delayed at any of the stages mentioned in the investigation? Either by locks/mutexes or by ensuring all received data has been handled? I'm really not keen on adding retry loops around every service call ... we have seen this after 9M service calls on an otherwise idle pc or more often when stressed.

flixr · 2021-03-25T18:16:47Z

We also encountered this sporadically. As a workaround we are now using persistent service connections...

dirk-thomas added the bug label Aug 10, 2020

k-okada added a commit to k-okada/jsk_roseus that referenced this issue Dec 10, 2021

allow service-call to fail 3 times, see ros/ros_comm#1976 for reason

1d4c870

k-okada added a commit to k-okada/jsk_roseus that referenced this issue Dec 10, 2021

allow service-call to fail 3 times, see ros/ros_comm#1976 for reason

6ef297f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic service call failures #1976

Sporadic service call failures #1976

chris20623 commented Jun 23, 2020

stevegolton commented Jul 16, 2020 •

edited

Loading

astero-a11y commented Jul 16, 2020

chris20623 commented Jul 17, 2020

stevegolton commented Jul 17, 2020 •

edited

Loading

stevegolton commented Jul 19, 2020

chris20623 commented Aug 10, 2020

simonschmeisser commented Feb 25, 2021

flixr commented Mar 25, 2021

Sporadic service call failures #1976

Sporadic service call failures #1976

Comments

chris20623 commented Jun 23, 2020

Summary

Setup Information

Detailed Description

Minimal example

Execution log

Background information

stevegolton commented Jul 16, 2020 • edited Loading

astero-a11y commented Jul 16, 2020

chris20623 commented Jul 17, 2020

stevegolton commented Jul 17, 2020 • edited Loading

stevegolton commented Jul 19, 2020

chris20623 commented Aug 10, 2020

Detailed analysis

simonschmeisser commented Feb 25, 2021

flixr commented Mar 25, 2021

stevegolton commented Jul 16, 2020 •

edited

Loading

stevegolton commented Jul 17, 2020 •

edited

Loading