Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic service call failures #1976

Open
chris20623 opened this issue Jun 23, 2020 · 8 comments
Open

Sporadic service call failures #1976

chris20623 opened this issue Jun 23, 2020 · 8 comments
Labels

Comments

@chris20623
Copy link

Summary

Sporadic service call failures, i.e. ServiceClient::call returns false despite the service is available (and always returns true in its callback function).

Setup Information

  • ROS version: melodic (rosversion: 1.14.6)
  • OS version: Ubuntu 18.04
  • CMake Version: 3.10.2
  • GCC Version: 7.5.0

Detailed Description

We experience sporadic service call failures, i.e. ServiceClient::call returns false despite the service is available (and always returns true in its callback function). In order to create a minimum working example, we have modified the AddTwoInts example from beginner_tutorials so that the service is repeatedly called in an infinite loop. Once in about 10,000 to 100,000 times, the service call fails.

Additionally, the error message Exception thrown while while deserializing service call: Buffer Overrun is printed on the console sometimes. This message originates from line 186 of ros/service_client.h (btw, note the typo in the message text). We do not know whether the two issues are related or not.

By contrast, the example runs without any errors on ROS kinetic.

How should such service call failures be handled? Is it the responsibility of the client to try again a few times? Or is it an issue within ROS which should be fixed? And why does the behavior of ROS melodic differ from kinetic? I do not see any relevant changes in the ros_comm source code.

Minimal example

See beginner_tutorials_modified.tar.gz for the example package.

Source files:

add_two_ints_client.cpp

#include "ros/ros.h"
#include "beginner_tutorials/AddTwoInts.h"
#include <cstdlib>

int main(int argc, char **argv)
{
  ros::init(argc, argv, "add_two_ints_client");

  ros::NodeHandle n;
  ros::ServiceClient client = n.serviceClient<beginner_tutorials::AddTwoInts>("add_two_ints");
  beginner_tutorials::AddTwoInts srv;
  std::size_t i = 0;
  while (ros::ok())
  {
    srv.request.a = rand();
    srv.request.b = rand();
    ++i;
    if (client.call(srv))
    {
      //ROS_INFO("Sum: %ld", (long int)srv.response.sum);
    }
    else
    {
      ROS_ERROR("Failed to call service add_two_ints in attempt %ld", i);
    }
  }

  return 0;
}

add_two_ints_server.cpp (unchanged from https://github.com/ros/catkin_tutorials/tree/master/create_package_srvclient/catkin_ws/src/beginner_tutorials/src)

#include "ros/ros.h"
#include "beginner_tutorials/AddTwoInts.h"

bool add(beginner_tutorials::AddTwoInts::Request  &req,
         beginner_tutorials::AddTwoInts::Response &res)
{
  res.sum = req.a + req.b;
  ROS_INFO("request: x=%ld, y=%ld", (long int)req.a, (long int)req.b);
  ROS_INFO("sending back response: [%ld]", (long int)res.sum);
  return true;
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "add_two_ints_server");
  ros::NodeHandle n;

  ros::ServiceServer service = n.advertiseService("add_two_ints", add);
  ROS_INFO("Ready to add two ints.");
  ros::spin();

  return 0;
}

Execution log

Typical output of the example client:

$ rosrun beginner_tutorials add_two_ints_client 
[ERROR] [1592908525.665423362]: Failed to call service add_two_ints in attempt 18221
[ERROR] [1592908525.688159153]: Failed to call service add_two_ints in attempt 18233
[ERROR] [1592908713.978057353]: Failed to call service add_two_ints in attempt 176349
[ERROR] [1592908714.084723845]: Failed to call service add_two_ints in attempt 176430
[ERROR] [1592908841.292468750]: Failed to call service add_two_ints in attempt 282092
[ERROR] [1592908917.783882743]: Failed to call service add_two_ints in attempt 347690
[ERROR] [1592908917.923579974]: Exception thrown while while deserializing service call: Buffer Overrun
[ERROR] [1592908917.923624772]: Failed to call service add_two_ints in attempt 347803
[ERROR] [1592909524.440756998]: Failed to call service add_two_ints in attempt 851196
…

Background information

For those who are interested, some background information where we need so much service calls that these issues become relevant: We have an application that performs extensive precomputation of kinematic reachability and therefore frequently calls an inverse kinematics service via the moveit interface. When porting this application from kinetic to melodic, segmentation faults occurred after some computing time. These have been fixed meanwhile by applying PR #1950. While debugging the application further, we noticed the service call failures described above.

@stevegolton
Copy link

stevegolton commented Jul 16, 2020

I believe I am seeing the same issue, but it seems only on armhf (Raspian) and not on amd64.

Did you get anywhere with this?

Edit: Seen on ROS Noetic on Raspbian Buster armhf. Can't seem to repro on Ubuntu 20.04 Noetic amd64..?

@astero-a11y
Copy link

I'm experiencing the same issue on Ubuntu 18.04 (melodic) and 20.04 (noetic) running on an Intel Core i7-3820 processor.

@dirk-thomas Any idea what's the problem here?

@chris20623
Copy link
Author

@stevegolton Unfortunately, I still have no idea how to resolve this issue.

Did you run the modified beginner_tutorials nodes on your amd64 machine or was it your own code? How long did you run it? Sometimes, it takes several hours until the first service call failure occurs. Additionally, I have the impression that failures occur more frequently if there is a higher load on the rosmaster, e.g., when running multiple instances of the test nodes in parallel.

We have seen the issue on Intel Xeon E3-1505M and on Intel Core i7-6700K CPUs, using Ubuntu18.04 and ROS melodic.

@stevegolton
Copy link

stevegolton commented Jul 17, 2020

Ah ok I did run the beginner tutorials on Intel but only for about 10 minutes. I was running my own code on Rasbian, which pretty much does the same thing - calls a trivial service in a tight loop, but the issue would usually occur in the first 10 seconds or so.

@stevegolton
Copy link

It seems the buffer overrun error originates from:
roscpp_core/roscpp_serialization/include/ros/serialization.h:681

  ROS_FORCE_INLINE uint8_t* advance(uint32_t len)
  {
    uint8_t* old_data = data_;
    data_ += len;
    if (data_ > end_)
    {
      // Throwing directly here causes a significant speed hit due to the extra code generated
      // for the throw statement
      throwStreamOverrun(); // <--- HERE
    }
    return old_data;
  }

So the data_ > end_ check is failing somewhere.

Indeed with debug symbols included, I ran an example under GDB and set a breakpoint on this line. When the issue was hit, the values of data_ and end_ were both NULL and len was 1.

(gdb) print data_
$1 = (uint8_t *) 0x1 <error: Cannot access memory at address 0x1>
(gdb) print end_
$2 = (uint8_t *) 0x0
(gdb) print len
$3 = 1

So it seems somewhere an IStream object is being created with NULL values. Here is where the IStream is created:
roscpp_core/roscpp_serialization/include/ros/serialization.h:858

template<typename M>
inline void deserializeMessage(const SerializedMessage& m, M& message)
{
  IStream s(m.message_start, static_cast<uint32_t>(m.num_bytes - (m.message_start - m.buf.get())));
  deserialize(s, message);
}

So it looks like that message.message_start is NULL, probably never initialized.
Looking further up the stack, it looks like the issues start here:
ros_comm/roscpp/include/ros/service_client.h:111

template<typename MReq, typename MRes>
  bool call(const MReq& req, MRes& resp, const std::string& service_md5sum)
  {
    namespace ser = serialization;
    SerializedMessage ser_req = ser::serializeMessage(req);
    SerializedMessage ser_resp;
    bool ok = call(ser_req, ser_resp, service_md5sum);
    if (!ok)
    {
      return false;
    }

    try
    {
      ser::deserializeMessage(ser_resp, resp);
    }
    catch (std::exception& e)
    {
      deserializeFailed(e);
      return false;
    }

    return true;
  }

The constructor for SerializedMessage looks like this:

SerializedMessage()
  : buf(boost::shared_array<uint8_t>())
  , num_bytes(0)
  , message_start(0)
  , type_info(0)
  {}

So I guess message_start was never initialized in the call method. I guess the times when "Buffer Overrun" error was seen is when call returns true but doesn't initialize the ser_resp message, and the other times is when call just returns false without printing any error message internally.

Looking deeper into this will require some more poking around into the internals of of service_client and service_client_link, which I am struggling to wrap my head around.

@chris20623
Copy link
Author

After further debugging this issue, I have found at least one possible execution behavior which causes service call failures. The mechanism seems to be that the TCP connection is closed via the POLLRDHUP event before the client can receive the response from the server (for details, see below).

When I remove the POLLRDHUP case from TransportTCP::socketUpdate (line 707 in melodic-devel), I can no longer reproduce these service call failures.

The listener to the POLLRDHUP event has been introduced by PR #1704, which has not yet been applied to the kinetic release. This probably explains why we have observed different behavior on melodic and kinetic.

@cwecht @dirk-thomas Would it perhaps make sense to add the POLLRDHUP event listener only for topic connections, but not for (non-persistent) service connections? Or do you see any better solution how to fix this?

Detailed analysis

Typical execution sequence of a failing service call (line numbers referring to melodic-devel):

ClientServer
invoke ServiceClient::call() from add_two_ints_client.cpp:22
invoke ServiceServerLink::call() from service_client.cpp:143
invoke ServiceServerLink::processNextCall() from service_server_link.cpp:362
invoke Connection::write() from service_server_link.cpp:325
add() in add_two_ints_server.cpp, from service_publication.cpp:125
ServiceClientLink::processResponse() from service_publication.cpp:128
Connection::write() from service_client_link.cpp:240
ServiceClientLink::onResponseWritten()
Connection::drop() from service_client_link.cpp:233
TransportTCP::close() from connection.cpp:340
TransportTCP::socketUpdate() with event 8196 (POLLRDHUP | POLLOUT)
TransportTCP::close() from transport_tcp:725
Connection::drop() with reason 0 (TransportDisconnect)
ServiceServerLink::onConnectionDropped()
ServiceServerLink::clearCalls() from service_server_link.cpp:175
ServiceServerLink::cancelCall()
Connection::write() returns (this is the function call from the 4th row of this table)
ServiceServerLink::call() returns false in service_server_link.cpp:381 (since the call has been canceled before the response has been received)
ServiceClient::call() returns false in service_client.cpp:152

Of course, in most cases the timing is such that Connection::write() terminates early enough so that the callbacks ServiceServerLink::onRequestWritten() and ServiceServerLink::onResponseOkAndLength() are ready to receive the response correctly. Only in very rare cases, depending on system load etc., Connection::write() may require too much time so that the response is missed as in the above case.

I believe that this might not be the only way how a service call may fail due to race conditions or similar issues. At least the buffer overrun error that @stevegolton has investigated seems to have a different cause.

@simonschmeisser
Copy link

Could the "shutdown" be delayed at any of the stages mentioned in the investigation? Either by locks/mutexes or by ensuring all received data has been handled? I'm really not keen on adding retry loops around every service call ... we have seen this after 9M service calls on an otherwise idle pc or more often when stressed.

@flixr
Copy link
Contributor

flixr commented Mar 25, 2021

We also encountered this sporadically. As a workaround we are now using persistent service connections...

k-okada added a commit to k-okada/jsk_roseus that referenced this issue Dec 10, 2021
k-okada added a commit to k-okada/jsk_roseus that referenced this issue Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants