Resolves #577 by reworking onRetryTimer Sub logic #670

rethink-imcmahon · 2015-09-23T21:08:04Z

This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

There were multiple potential causes this issue, and after investigating
several, it was determined that the onRetryTimer function was attempting
to access its parent Subscription instance during CallbackQueue execution.
This could lead to a scenario where invalid memory is accessed if all of
the other Subscription resources had been destroyed in another thread.

In that rare circumstance, the partial execution of the onRetryTimer
callback attempted to access the destroyed Subscription parent_, causing
an invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
inside the CallbackQueue::callOneCB function, which makes this a
time-dependant thread scheduling issue.

To solve this issue, some of the logic in onRetryTimer had to be reworked
to avoid accessing the parent Subscription in case it has been destroyed.
Thought was given to mutex locking the parent as it was accessed, but this
resulted in the Subscription class being destroyed while its member mutex
was locked in onRetryTimer inside the CallbackQueue thread, causing a
pthread mutex error. This occurs since mutexes cannot be destroyed if they
are locked. Instead, the need for the parent Subscription access in
onRetryTimer was removed entirely.

The function onRetryTimer() accesses its parent Subscription to attempt
to get the name of the topic that needs to be reconnected in the event
of a disconnect and add that name to a log message. We simply removed
the topic name from the log messages. The second problematic parent
instance access occurred to determine if the Transport type is TCPROS.
If it was anything else (say UDP), the publisher link is dropped.
Checking the type of Transport connection was moved into the main
Subscription thread via the onConnectionDropped() method, so the check
can happen even before the onRetryTimer method is bound to the
CallbackQueue. UDP retry connection dropping should function exactly
the same as before this patch.

The end result of this investigation and patch is less of a likelihood
for invalid pieces of memory to be accessed during normal ROS callback
execution.

This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. There were multiple potential causes this issue, and after investigating several, it was determined that the onRetryTimer function was attempting to access its parent Subscription instance during CallbackQueue execution. This could lead to a scenario where invalid memory is accessed if all of the other Subscription resources had been destroyed in another thread. In that rare circumstance, the partial execution of the onRetryTimer callback attempted to access the destroyed Subscription parent_, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() inside the CallbackQueue::callOneCB function, which makes this a time-dependant thread scheduling issue. To solve this issue, some of the logic in onRetryTimer had to be reworked to avoid accessing the parent Subscription in case it has been destroyed. Thought was given to mutex locking the parent as it was accessed, but this resulted in the Subscription class being destroyed while its member mutex was locked in onRetryTimer inside the CallbackQueue thread, causing a pthread mutex error. This occurs since mutexes cannot be destroyed if they are locked. Instead, the need for the parent Subscription access in onRetryTimer was removed entirely. The function onRetryTimer() accesses its parent Subscription to attempt to get the name of the topic that needs to be reconnected in the event of a disconnect and add that name to a log message. We simply removed the topic name from the log messages. The second problematic parent instance access occurred to determine if the Transport type is TCPROS. If it was anything else (say UDP), the publisher link is dropped. Checking the type of Transport connection was moved into the main Subscription thread via the onConnectionDropped() method, so the check can happen even before the onRetryTimer method is bound to the CallbackQueue. UDP retry connection dropping should function exactly the same as before this patch. The end result of this investigation and patch is less of a likelihood for invalid pieces of memory to be accessed during normal ROS callback execution.

ros-pull-request-builder · 2015-09-23T21:08:05Z

Can one of the admins verify this patch?

rethink-imcmahon · 2015-09-23T21:09:21Z

Supersedes both #654 and #660.

rethink-imcmahon · 2015-09-23T21:12:46Z

This patch has been tested for the last 24 hours against all of our virtual robots with no coredumps or obvious errors in testing.

rethink-imcmahon · 2015-09-23T21:16:51Z

Huge thanks to @jeffadamsc for sitting down with me for several hours to devise this solution. Mind giving this a quick once over, Jeff?

dirk-thomas · 2015-09-23T21:19:58Z

@ros-pull-request-builder

rethink-imcmahon · 2015-09-23T21:26:35Z

Couple quick thoughts on this patch:

I think this is likely to pass the pull request builder, since its basically the same patch as Possible solution to #577 by reworking onRetryTimer() logic #660 .
It may make sense to get a few people familiar with ros_comm to review the code, since this is a logic rework to avoid the bug, rather than a strict bug fix.
I will give an update on our nightly runs status with this patch by the end of the week.

ros-pull-request-builder · 2015-09-23T21:48:33Z

Test passed.
Refer to this link for build results: http://jenkins.ros.org/job/_pull_request-indigo-ros_comm/394/

rethink-imcmahon · 2015-09-24T19:04:31Z

Nope. Three coredumps from last night:
1)

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xe47918a9 in __gxx_personality_v0 ()
   from /usr/lib/i386-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0xe47918a9 in __gxx_personality_v0 ()
   from /usr/lib/i386-linux-gnu/libstdc++.so.6
#1  0xe09ff080 in ?? ()
#2  0xe497906d in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#3  0xe4979f64 in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#4  0xe49acbdb in ros::shutdownCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#5  0xe49af78d in boost::detail::function::function_invoker2<bool (*)(roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&), bool, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&>::invoke(boost::detail::function::function_buffer&, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#6  0x00000000 in ?? ()

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xe49b3923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
(gdb) bt
#0  0xe49b3923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#1  0xe49b4919 in ros::TransportPublisherLink::onConnectionDropped(boost::shared_ptr<ros::Connection> const&, ros::Connection::DropReason) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#2  0xe498a06d in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#3  0xe498af64 in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#4  0xe49bdbdb in ros::shutdownCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#5  0xe49c078d in boost::detail::function::function_invoker2<bool (*)(roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&), bool, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&>::invoke(boost::detail::function::function_buffer&, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xe4a2d923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
(gdb) bt
#0  0xe4a2d923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#1  0xe4a2e919 in ros::TransportPublisherLink::onConnectionDropped(boost::shared_ptr<ros::Connection> const&, ros::Connection::DropReason) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#2  0xe4a0406d in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#3  0xe4a04f64 in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#4  0xe4a37bdb in ros::shutdownCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#5  0xe4a3a78d in boost::detail::function::function_invoker2<bool (*)(roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&), bool, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&>::invoke(boost::detail::function::function_buffer&, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so

This is leading us down a different path. I'll give an update after we have a new patch.

rethink-imcmahon · 2015-10-07T16:14:27Z

Superseded by #677

This was referenced Sep 23, 2015

Prevent Invalid callback execution. Resolves #577 #654

Closed

Possible solution to #577 by reworking onRetryTimer() logic #660

Closed

rethink-imcmahon closed this Oct 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolves #577 by reworking onRetryTimer Sub logic #670

Resolves #577 by reworking onRetryTimer Sub logic #670

rethink-imcmahon commented Sep 23, 2015

ros-pull-request-builder commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

dirk-thomas commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

ros-pull-request-builder commented Sep 23, 2015

rethink-imcmahon commented Sep 24, 2015

rethink-imcmahon commented Oct 7, 2015

Resolves #577 by reworking onRetryTimer Sub logic #670

Resolves #577 by reworking onRetryTimer Sub logic #670

Conversation

rethink-imcmahon commented Sep 23, 2015

ros-pull-request-builder commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

dirk-thomas commented Sep 23, 2015

rethink-imcmahon commented Sep 23, 2015

ros-pull-request-builder commented Sep 23, 2015

rethink-imcmahon commented Sep 24, 2015

rethink-imcmahon commented Oct 7, 2015