Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolves #577 by reworking onRetryTimer Sub logic #670

Closed

Conversation

rethink-imcmahon
Copy link

This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

There were multiple potential causes this issue, and after investigating
several, it was determined that the onRetryTimer function was attempting
to access its parent Subscription instance during CallbackQueue execution.
This could lead to a scenario where invalid memory is accessed if all of
the other Subscription resources had been destroyed in another thread.

In that rare circumstance, the partial execution of the onRetryTimer
callback attempted to access the destroyed Subscription parent_, causing
an invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
inside the CallbackQueue::callOneCB function, which makes this a
time-dependant thread scheduling issue.

To solve this issue, some of the logic in onRetryTimer had to be reworked
to avoid accessing the parent Subscription in case it has been destroyed.
Thought was given to mutex locking the parent as it was accessed, but this
resulted in the Subscription class being destroyed while its member mutex
was locked in onRetryTimer inside the CallbackQueue thread, causing a
pthread mutex error. This occurs since mutexes cannot be destroyed if they
are locked. Instead, the need for the parent Subscription access in
onRetryTimer was removed entirely.

The function onRetryTimer() accesses its parent Subscription to attempt
to get the name of the topic that needs to be reconnected in the event
of a disconnect and add that name to a log message. We simply removed
the topic name from the log messages. The second problematic parent
instance access occurred to determine if the Transport type is TCPROS.
If it was anything else (say UDP), the publisher link is dropped.
Checking the type of Transport connection was moved into the main
Subscription thread via the onConnectionDropped() method, so the check
can happen even before the onRetryTimer method is bound to the
CallbackQueue. UDP retry connection dropping should function exactly
the same as before this patch.

The end result of this investigation and patch is less of a likelihood
for invalid pieces of memory to be accessed during normal ROS callback
execution.

This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

There were multiple potential causes this issue, and after investigating
several, it was determined that the onRetryTimer function was attempting
to access its parent Subscription instance during CallbackQueue execution.
This could lead to a scenario where invalid memory is accessed if all of
the other Subscription resources had been destroyed in another thread.

In that rare circumstance, the partial execution of the onRetryTimer
callback attempted to access the destroyed Subscription parent_, causing
an invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
inside the CallbackQueue::callOneCB function, which makes this a
time-dependant thread scheduling issue.

To solve this issue, some of the logic in onRetryTimer had to be reworked
to avoid accessing the parent Subscription in case it has been destroyed.
Thought was given to mutex locking the parent as it was accessed, but this
resulted in the Subscription class being destroyed while its member mutex
was locked in onRetryTimer inside the CallbackQueue thread, causing a
pthread mutex error. This occurs since mutexes cannot be destroyed if they
are locked. Instead, the need for the parent Subscription access in
onRetryTimer was removed entirely.

The function onRetryTimer() accesses its parent Subscription to attempt
to get the name of the topic that needs to be reconnected in the event
of a disconnect and add that name to a log message. We simply removed
the topic name from the log messages. The second problematic parent
instance access occurred to determine if the Transport type is TCPROS.
If it was anything else (say UDP), the publisher link is dropped.
Checking the type of Transport connection was moved into the main
Subscription thread via the onConnectionDropped() method, so the check
can happen even before the onRetryTimer method is bound to the
CallbackQueue. UDP retry connection dropping should function exactly
the same as before this patch.

The end result of this investigation and patch is less of a likelihood
for invalid pieces of memory to be accessed during normal ROS callback
execution.
@ros-pull-request-builder
Copy link
Member

Can one of the admins verify this patch?

@rethink-imcmahon
Copy link
Author

Supersedes both #654 and #660.

@rethink-imcmahon
Copy link
Author

This patch has been tested for the last 24 hours against all of our virtual robots with no coredumps or obvious errors in testing.

@rethink-imcmahon
Copy link
Author

Huge thanks to @jeffadamsc for sitting down with me for several hours to devise this solution. Mind giving this a quick once over, Jeff?

@dirk-thomas
Copy link
Member

@rethink-imcmahon
Copy link
Author

Couple quick thoughts on this patch:

  • I think this is likely to pass the pull request builder, since its basically the same patch as Possible solution to #577 by reworking onRetryTimer() logic #660 .
  • It may make sense to get a few people familiar with ros_comm to review the code, since this is a logic rework to avoid the bug, rather than a strict bug fix.
  • I will give an update on our nightly runs status with this patch by the end of the week.

@ros-pull-request-builder
Copy link
Member

Test passed.
Refer to this link for build results: http://jenkins.ros.org/job/_pull_request-indigo-ros_comm/394/

@rethink-imcmahon
Copy link
Author

Nope. Three coredumps from last night:
1)

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xe47918a9 in __gxx_personality_v0 ()
   from /usr/lib/i386-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0xe47918a9 in __gxx_personality_v0 ()
   from /usr/lib/i386-linux-gnu/libstdc++.so.6
#1  0xe09ff080 in ?? ()
#2  0xe497906d in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#3  0xe4979f64 in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#4  0xe49acbdb in ros::shutdownCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#5  0xe49af78d in boost::detail::function::function_invoker2<bool (*)(roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&), bool, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&>::invoke(boost::detail::function::function_buffer&, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#6  0x00000000 in ?? ()
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xe49b3923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
(gdb) bt
#0  0xe49b3923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#1  0xe49b4919 in ros::TransportPublisherLink::onConnectionDropped(boost::shared_ptr<ros::Connection> const&, ros::Connection::DropReason) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#2  0xe498a06d in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#3  0xe498af64 in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#4  0xe49bdbdb in ros::shutdownCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#5  0xe49c078d in boost::detail::function::function_invoker2<bool (*)(roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&), bool, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&>::invoke(boost::detail::function::function_buffer&, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xe4a2d923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
(gdb) bt
#0  0xe4a2d923 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#1  0xe4a2e919 in ros::TransportPublisherLink::onConnectionDropped(boost::shared_ptr<ros::Connection> const&, ros::Connection::DropReason) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#2  0xe4a0406d in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#3  0xe4a04f64 in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#4  0xe4a37bdb in ros::shutdownCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so
#5  0xe4a3a78d in boost::detail::function::function_invoker2<bool (*)(roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&), bool, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&>::invoke(boost::detail::function::function_buffer&, roscpp::EmptyRequest_<std::allocator<void> >&, roscpp::EmptyResponse_<std::allocator<void> >&) () from /opt/ros/indigo/lib/i386-linux-gnu/libroscpp.so

This is leading us down a different path. I'll give an update after we have a new patch.

@rethink-imcmahon
Copy link
Author

Superseded by #677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants