Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Invalid callback execution. Resolves #577 #654

Closed

Conversation

rethink-imcmahon
Copy link

This issue manifested with periodic coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

The underlying issue was a copy of CallbackInfo being created in
callback_queue.cpp. By creating a copy of this object, its shared
pointer reference count for CallbackInterface was incremented,
preventing destruction of the CallbackInterface by an external thread,
even if all of the other Subscription resources had been destroyed.
This lead to partial execution of the onRetryTimer callback, until
it attempted to access the destroyed Subscription parent_, causing an
invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
which would make this a time-dependant thread scheduling issue.

To remedy the the invalid callback, a reference to the CallbackInfo
object in Thread Local Storage (tls) is made rather than a copy. This
prevents the CallbackInterface shared pointer reference count from being
incremented (since only the reference is taken). To access the callback
for execution, a weak pointer is made from the CallbackInterface. Then
after the mutex lock is acquired, the weak pointer is used to create a
shared pointer for callback execution. If the weak pointer fails to
return a shared pointer, the CallbackInterface has been destroyed.
In that case, the callback is Invalid and execution is prevented.

@ros-pull-request-builder
Copy link
Member

Can one of the admins verify this patch?

@rethink-imcmahon
Copy link
Author

I just wanted to get this up here for all to review. I tested this patch with my listener script spawning / destroying 100 subscriber every 0.5 seconds for 16 hours with no coredump.

We will apply this patch to ros_comm on our virtual robots internally, and report back if this definitively solves the issue after a few days.

@rethink-imcmahon
Copy link
Author

@jeffadamsc @chris-smith if you could take a quick peek at my logic here, it'd be much appreciated :)

@rethink-imcmahon
Copy link
Author

Also, I believe the reason that #615 failed was that it did not have the second half of the solution: once you're making reference of CallbackInfo instead of a copy, taking the reference of info.callback does not cause the CallbackInterface shared pointer reference count to increment. This object is then destroyed when the Subscription is destroyed, causing a segfault when attempting to access
cb->call(). The weak_ptr works around that limitation by checking to make sure the CallbackInterface is still a valid shared_ptr before executing the callback.

@jeffadamsc
Copy link

This logic looks sound to me and I can see how it would fix this bug. That 16 hour test used to fail about every 60 seconds, so that's pretty convincing.

@@ -378,8 +384,11 @@ CallbackQueue::CallOneResult CallbackQueue::callOneCB(TLS* tls)
}
else
{
tls->cb_it = tls->callbacks.erase(tls->cb_it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think we should keep this erase in here... at least in the case where we can get a lock on the weak ptr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow! Good catch on that typo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an artifact that I introduced when cleaning this patch up with comments today. Last night's test included this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

This issue manifested with periodic coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

The underlying issue was a copy of CallbackInfo being created in
callback_queue.cpp. By creating a copy of this object, its shared
pointer reference count for CallbackInterface was incremented,
preventing destruction of the CallbackInterface by an external thread,
even if all of the other Subscription resources had been destroyed.
This lead to partial execution of the onRetryTimer callback, until
it attempted to access the destroyed Subscription parent, causing an
invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
which would make this a time-dependant thread scheduling issue.

To remedy the the invalid callback, a reference to the CallbackInfo
object in Thread Local Storage (tls) is made rather than a copy. This
prevents the CallbackInterface shared pointer reference count from being
incremented (since only the reference is taken). To access the callback
for execution, a weak pointer is made from the CallbackInterface. Then
after the mutex lock is acquired, the weak pointer is used to create a
shared pointer for callback execution. If the weak pointer fails to
return a shared pointer, the CallbackInterface has been destroyed.
In that case, the callback is Invalid and execution is prevented.
@dirk-thomas
Copy link
Member

The logic of the patch looks sound to me too. I can imagine that this case is difficult to reproduce.

@dirk-thomas
Copy link
Member

@@ -72,6 +73,7 @@ class ROSCPP_DECL CallbackInterface
virtual bool ready() { return true; }
};
typedef boost::shared_ptr<CallbackInterface> CallbackInterfacePtr;
typedef boost::weak_ptr<CallbackInterface> CallbackInterfaceWPtr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename the typedef to be a little bit more verbose:

CallbackInterfaceWeakPtr

@rethink-imcmahon
Copy link
Author

@dirk-thomas sure. I have no preference what name the typedef has, but I was attempting to follow the convention in the other header files of ros_comm. Let me know if you'd still prefer it as WeakPtr and I'll push up the change.

@ros-pull-request-builder
Copy link
Member

Test failed.
Refer to this link for build results: http://jenkins.ros.org/job/_pull_request-indigo-ros_comm/296/

@dirk-thomas
Copy link
Member

@rethink-imcmahon I am probably used to the names we use in ROS 2 currently. With all the references you provided just keep it as is (WPtr).

@dirk-thomas
Copy link
Member

We have to figure out if the failing tests are just flaky or an actual problem. I will trigger the job a few times...

@ros-pull-request-builder
Copy link
Member

Test failed.
Refer to this link for build results: http://jenkins.ros.org/job/_pull_request-indigo-ros_comm/297/

@rethink-imcmahon
Copy link
Author

I looked into these test failures a bit today, meaning that I dove into the rather complicated swath of code that is message_filters. Test results from my local workstation included some successes and a failure. Surprisingly, only one of the tests that failed on in the Jenkins jobs occurred on my workstation:

$ catkin_make run_tests_message_filters_rostest
...
...
[ROSTEST]-----------------------------------------------------------------------

[message_filters.rosunit-test_subscriber/simple][passed]
[message_filters.rosunit-test_subscriber/subUnsubSub][passed]
[message_filters.rosunit-test_subscriber/subInChain][passed]
[message_filters.rosunit-test_subscriber/singleNonConstCallback][passed]
[message_filters.rosunit-test_subscriber/multipleNonConstCallbacksFilterSubscriber][passed]
[message_filters.rosunit-test_subscriber/multipleCallbacksSomeFilterSomeDirect][passed]

SUMMARY
 * RESULT: SUCCESS
 * TESTS: 6
 * ERRORS: 0
 * FAILURES: 0

rostest log file is in /data/users/imcmahon/.ros/log/rostest-Threepio-32131.log
-- run_tests.py: verify result "/data/users/imcmahon/dev/ros_core_ws/build/test_results/message_filters/rostest-test_test_subscriber.xml"
Built target run_tests_message_filters_rostest_test_test_subscriber.xml
testtime_sequencer ... ERROR!
ERROR: test max time allotted
  File "/usr/lib/python2.7/unittest/case.py", line 331, in run
    testMethod()
  File "/data/users/imcmahon/dev/ros_core_ws/src/ros_comm/tools/rostest/src/rostest/runner.py", line 144, in fn
    self.test_parent.run_test(test)
  File "/data/users/imcmahon/dev/ros_core_ws/src/ros_comm/tools/rostest/src/rostest/rostest_parent.py", line 101, in run_test
    return self.runner.run_test(test)
  File "/data/users/imcmahon/dev/ros_core_ws/src/ros_comm/tools/roslaunch/src/roslaunch/launch.py", line 671, in run_test
    raise RLTestTimeoutException("test max time allotted")
--------------------------------------------------------------------------------

[ROSTEST]-----------------------------------------------------------------------


SUMMARY
 * RESULT: FAIL
 * TESTS: 0
 * ERRORS: 1
 * FAILURES: 0

The timeout failure implies that the issue occurs in a while loop, and eventInEventOut is a good contender.

It almost seems like time_sequencer.h an alternate implementation of the callback_queue in roscpp. Looking into the underlying simple_filter.h made my head spin a bit. Any insight as to what this test is attempting to accomplish would be greatly appreciated.

@rethink-imcmahon
Copy link
Author

I've been working with this patch in my workspace for the last several days without coredump. However, today I realized my robot state publisher was for some reason refusing to publish fixed joint frames. I managed to narrow the issue down to this callback not being executed

  timer_ = n_tilde.createTimer(publish_interval_, &JointStateListener::callbackFixedJoint, this);

Which is not great. After removing the ros_comm with this patch from my workspace, everything ran fine. I need to investigate this further, as it seems this patch has some unintended consequences with respect to ROS Timers.

@rethink-imcmahon
Copy link
Author

I looked into the broken Timer callbacks today. Fortunately the issue is highly reproducible. When the patch is applied 0% of the Timer callbacks are executed, and without this invalid callback patch, the callback timers are invoked 100% of the time. Here's the gdb execution traces -
Functioning Timer callbacks without the patch:
https://gist.github.com/rethink-imcmahon/26dc7ad7bcaf768a8af6#file-timer_callback_functioning-cpp-L119-L143

362   CallbackInfo info = *tls->cb_it;
(gdb) n
363   CallbackInterfacePtr& cb = info.callback;
(gdb) print info
$2 = {callback = {px = 0x7fffe0000ae0, pn = {pi_ = 0x7fffe0000b30}}, removal_id = 6405040, marked_for_removal = false}
(gdb) n
365   IDInfoPtr id_info = getIDInfo(info.removal_id);
(gdb) n
366   if (id_info)
(gdb) n
368     boost::shared_lock<boost::shared_mutex> rw_lock(id_info->calling_rw_mutex);
(gdb) n
370     uint64_t last_calling = tls->calling_in_this_thread;
(gdb) n
371     tls->calling_in_this_thread = id_info->id;
(gdb) n
373     CallbackInterface::CallResult result = CallbackInterface::Invalid;
(gdb) n
375       if (info.marked_for_removal)
(gdb) n
381         tls->cb_it = tls->callbacks.erase(tls->cb_it);
(gdb) n
382         result = cb->call();
(gdb) n
[ INFO] [1441056798.935178020]: Callback 2 triggered

Aaaand the broken callback execution (with patch 654):
https://gist.github.com/rethink-imcmahon/d3ef40c66b1b1fcdf2df#file-timer_callback_broken-cpp-L94-L118

363   CallbackInfo& info = *tls->cb_it;
(gdb) n
369   CallbackInterfaceWPtr weak_cb = info.callback;
(gdb) n
371   IDInfoPtr id_info = getIDInfo(info.removal_id);
(gdb) n
372   if (id_info)
(gdb) n
374     boost::shared_lock<boost::shared_mutex> rw_lock(id_info->calling_rw_mutex);
(gdb) n
376     uint64_t last_calling = tls->calling_in_this_thread;
(gdb) n
377     tls->calling_in_this_thread = id_info->id;
(gdb) n
379     CallbackInterface::CallResult result = CallbackInterface::Invalid;
(gdb) n
381       if (info.marked_for_removal)
(gdb) n
387         tls->cb_it = tls->callbacks.erase(tls->cb_it);
(gdb) n
389         if (CallbackInterfacePtr cb = weak_cb.lock())
(gdb) n
392         }
(gdb) print weak_cb.lock()
$1 = {px = 0x0, pn = {pi_ = 0x0}}

So the weak_cb is not able to construct a shared_ptr of the CallbackInterface. Very interesting! This implies that the interface is being garbage collected in between making the reference and execution. That also would mean prior to this patch the only reference to the CallbackInterface shared pointer was held here, created by the copy constructor of 362 CallbackInfo info = *tls->cb_it;... eep! I am not clear on why this should be the only reference, and that will require a bit more digging.

@rethink-imcmahon
Copy link
Author

I have modified the weak_ptr to be a shared_ptr, but not a reference to a shared_ptr, which was the issue with #615. You can find it here: rethink-imcmahon#2. Currently testing this on my workstation with the crazy subscriber and rostimer roscpp_tutorials running at the same time. As of right now, the timers are working and the subscribers are not coredumping. I will test overnight, then apply to our virtual machines and report back.

@rethink-imcmahon
Copy link
Author

@dirk-thomas could you please kick off another pull request builder? This patch appears to resolve both the Timer issue and the Callback issue on my local workstation (at least I have no coredumps in the last 24 hours). Thanks!

@dirk-thomas
Copy link
Member

Sure, thank you for investigating the problem!

@ros-pull-request-builder

@ros-pull-request-builder
Copy link
Member

Test failed.
Refer to this link for build results: http://jenkins.ros.org/job/_pull_request-indigo-ros_comm/318/

@rethink-imcmahon
Copy link
Author

Welp. 3 failures is better than 8 failures, I suppose... I'll get digging

@dirk-thomas
Copy link
Member

That looks better then "just better". I think these three tests currently fail on "master" the same way so are not related (http://jenkins.ros.org/job/devel-indigo-ros_comm/191/testReport/). I think you can consider this state "stable".

@dirk-thomas
Copy link
Member

Yeah, references of smart pointers are really bad since they undermine the reference counting. I can perfectly imagine how the current patch is fixing the problem.

@rethink-imcmahon
Copy link
Author

Well, that's good news. I will have this patch applied to our virtual servers. That's a fairly time consuming process, so I appreciate the test run here. I'll report back after a few days of rigorous testing.

@ros-pull-request-builder
Copy link
Member

Test failed.
Refer to this link for build results: http://jenkins.ros.org/job/_pull_request-indigo-ros_comm/319/

@rethink-imcmahon
Copy link
Author

The QA team got back to me with a new coredump today. Looks like my most recent patch (above) put us back to square one. The backtrace is nearly identical to that in the initial #577 issue:

Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt full
#0  atomic_conditional_increment (pw=0x31223d72) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:92
        rv = <optimized out>
#1  add_ref_lock (this=0x31223d6e) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:139
No locals.
#2  shared_count (r=..., this=<synthetic pointer>) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:589
No locals.
#3  shared_ptr<ros::Subscription> (r=..., this=<synthetic pointer>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:414
No locals.
#4  lock (this=0xe091dabc) at /usr/include/boost/smart_ptr/weak_ptr.hpp:178
No locals.
#5  ros::TransportPublisherLink::onRetryTimer (this=0xe091dab0)
    at /build/ros-indigo-roscpp-cbwEiP/ros-indigo-roscpp-1.11.8/src/libros/transport_publisher_link.cpp:204
        parent = {px = 0x0, pn = {pi_ = 0x31223d6e}}
        __PRETTY_FUNCTION__ = "void ros::TransportPublisherLink::onRetryTimer(const ros::WallTimerEvent&)"
#6  0xe4a9ffc9 in operator() (a1=..., p=<optimized out>, this=0xe0b3c870)
    at /usr/include/boost/bind/mem_fn_template.hpp:165
No locals.
#7  operator()<boost::_mfi::mf1<void, ros::TransportPublisherLink, const ros::WallTimerEvent&>, boost::_bi::list1<const ros::WallTimerEvent&> > (a=<synthetic pointer>, f=..., this=0xe0b3c878) at /usr/include/boost/bind/bind.hpp:313
No locals.
#8  operator()<ros::WallTimerEvent> (a1=..., this=0xe0b3c870) at /usr/include/boost/bind/bind_template.hpp:47
No locals.
#9  boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, ros::TransportPublisherLink, ros::WallTimerEvent const&>, boost::_bi::list2<boost::_bi::value<ros::TransportPublisherLink*>, boost::arg<1> > >, void, ros::WallTimerEvent const&>::invoke (function_obj_ptr=..., a0=...)
    at /usr/include/boost/function/function_template.hpp:153
        f = 0xe0b3c870
#10 0xe4a475b0 in operator() (a0=..., this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
No locals.
#11 ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call (
    this=0xdf500788) at /build/ros-indigo-roscpp-cbwEiP/ros-indigo-roscpp-1.11.8/include/ros/timer_manager.h:184
        tracked = {px = <optimized out>, pn = {pi_ = 0x0}}
        event = {last_expected = {<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {sec = 1442302324, 
              nsec = 909113830}, <No data fields>}, last_real = {<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1442302324, nsec = 910173803}, <No data fields>}, 
          current_expected = {<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {sec = 1442302325, 
              nsec = 9113830}, <No data fields>}, 
          current_real = {<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {sec = 1442302325, 
              nsec = 9394871}, <No data fields>}, profile = {
            last_duration = {<ros::DurationBase<ros::WallDuration>> = {sec = 0, nsec = 614}, <No data fields>}}}
        cb_start = {<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {sec = 1442302325, 
            nsec = 9395058}, <No data fields>}
        cb_end = {<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {sec = 3836862464, 
            nsec = 3767529904}, <No data fields>}
        info = {px = 0xe0b3c860, pn = {pi_ = 0xe0b228a0}}
#12 0xe4a751c0 in ros::CallbackQueue::callOneCB (this=this@entry=0x9835438, tls=tls@entry=0xe0a00468)
    at /build/ros-indigo-roscpp-cbwEiP/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:386
        rw_lock = {m = 0xdf5004b8, is_locked = true}
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        __PRETTY_FUNCTION__ = "ros::CallbackQueue::CallOneResult ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)"
        info = @0xe0a44ab0: {callback = {px = 0xdf500788, pn = {pi_ = 0xdf500628}}, 
          removal_id = 18446744073184462944, marked_for_removal = false}
        cb = {px = 0xdf500788, pn = {pi_ = 0xdf500628}}
        id_info = {px = 0xdf5004b0, pn = {pi_ = 0xdf500f18}}
#13 0xe4a76104 in ros::CallbackQueue::callAvailable (this=0x9835438, timeout=...)
    at /build/ros-indigo-roscpp-cbwEiP/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:333
        tls = 0xe0a00468
        called = 0
#14 0xe4aa928b in ros::internalCallbackQueueThreadFunc ()
    at /build/ros-indigo-roscpp-cbwEiP/ros-indigo-roscpp-1.11.8/src/libros/init.cpp:282
        queue = {px = 0x9835438, pn = {pi_ = 0x98391c0}}
#15 0xe4aabe3d in boost::detail::thread_data<void (*)()>::run (this=0x9839ab8)
    at /usr/include/boost/thread/detail/thread.hpp:117
No locals.
#16 0xe47f1681 in ?? () from /usr/lib/i386-linux-gnu/libboost_thread.so.1.54.0
No symbol table info available.
#17 0xf7621f70 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
No symbol table info available.
#18 0xf754abee in clone () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.

@rethink-imcmahon
Copy link
Author

Superseded by #670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants