-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent Invalid callback execution. Resolves #577 #654
Conversation
Can one of the admins verify this patch? |
I just wanted to get this up here for all to review. I tested this patch with my We will apply this patch to |
@jeffadamsc @chris-smith if you could take a quick peek at my logic here, it'd be much appreciated :) |
Also, I believe the reason that #615 failed was that it did not have the second half of the solution: once you're making reference of |
This logic looks sound to me and I can see how it would fix this bug. That 16 hour test used to fail about every 60 seconds, so that's pretty convincing. |
@@ -378,8 +384,11 @@ CallbackQueue::CallOneResult CallbackQueue::callOneCB(TLS* tls) | |||
} | |||
else | |||
{ | |||
tls->cb_it = tls->callbacks.erase(tls->cb_it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think we should keep this erase in here... at least in the case where we can get a lock on the weak ptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wow! Good catch on that typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an artifact that I introduced when cleaning this patch up with comments today. Last night's test included this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This issue manifested with periodic coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. The underlying issue was a copy of CallbackInfo being created in callback_queue.cpp. By creating a copy of this object, its shared pointer reference count for CallbackInterface was incremented, preventing destruction of the CallbackInterface by an external thread, even if all of the other Subscription resources had been destroyed. This lead to partial execution of the onRetryTimer callback, until it attempted to access the destroyed Subscription parent, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() which would make this a time-dependant thread scheduling issue. To remedy the the invalid callback, a reference to the CallbackInfo object in Thread Local Storage (tls) is made rather than a copy. This prevents the CallbackInterface shared pointer reference count from being incremented (since only the reference is taken). To access the callback for execution, a weak pointer is made from the CallbackInterface. Then after the mutex lock is acquired, the weak pointer is used to create a shared pointer for callback execution. If the weak pointer fails to return a shared pointer, the CallbackInterface has been destroyed. In that case, the callback is Invalid and execution is prevented.
ca550dd
to
cceb61a
Compare
The logic of the patch looks sound to me too. I can imagine that this case is difficult to reproduce. |
@@ -72,6 +73,7 @@ class ROSCPP_DECL CallbackInterface | |||
virtual bool ready() { return true; } | |||
}; | |||
typedef boost::shared_ptr<CallbackInterface> CallbackInterfacePtr; | |||
typedef boost::weak_ptr<CallbackInterface> CallbackInterfaceWPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rename the typedef to be a little bit more verbose:
CallbackInterfaceWeakPtr
@dirk-thomas sure. I have no preference what name the |
Test failed. |
@rethink-imcmahon I am probably used to the names we use in ROS 2 currently. With all the references you provided just keep it as is ( |
We have to figure out if the failing tests are just flaky or an actual problem. I will trigger the job a few times... |
Test failed. |
I looked into these test failures a bit today, meaning that I dove into the rather complicated swath of code that is
The timeout failure implies that the issue occurs in a while loop, and It almost seems like |
I've been working with this patch in my workspace for the last several days without coredump. However, today I realized my robot state publisher was for some reason refusing to publish fixed joint frames. I managed to narrow the issue down to this callback not being executed
Which is not great. After removing the |
I looked into the broken Timer callbacks today. Fortunately the issue is highly reproducible. When the patch is applied 0% of the Timer callbacks are executed, and without this invalid callback patch, the callback timers are invoked 100% of the time. Here's the
Aaaand the broken callback execution (with patch 654):
So the |
I have modified the |
@dirk-thomas could you please kick off another pull request builder? This patch appears to resolve both the Timer issue and the Callback issue on my local workstation (at least I have no coredumps in the last 24 hours). Thanks! |
Sure, thank you for investigating the problem! |
Test failed. |
Welp. 3 failures is better than 8 failures, I suppose... I'll get digging |
That looks better then "just better". I think these three tests currently fail on "master" the same way so are not related (http://jenkins.ros.org/job/devel-indigo-ros_comm/191/testReport/). I think you can consider this state "stable". |
Yeah, references of smart pointers are really bad since they undermine the reference counting. I can perfectly imagine how the current patch is fixing the problem. |
Well, that's good news. I will have this patch applied to our virtual servers. That's a fairly time consuming process, so I appreciate the test run here. I'll report back after a few days of rigorous testing. |
Test failed. |
The QA team got back to me with a new coredump today. Looks like my most recent patch (above) put us back to square one. The backtrace is nearly identical to that in the initial #577 issue:
|
Superseded by #670 |
This issue manifested with periodic coredumps in the
onRetryTimer()
callback in
transport_publisher_link.cpp
. These crashes were infrequent,and extremely difficult to reproduce. They appeared to happen when
many
Subscription
s were being destroyed, which made it likely to bea destruction scheduling anomaly.
The underlying issue was a copy of
CallbackInfo
being created incallback_queue.cpp
. By creating a copy of this object, its sharedpointer reference count for
CallbackInterface
was incremented,preventing destruction of the
CallbackInterface
by an external thread,even if all of the other
Subscription
resources had been destroyed.This lead to partial execution of the
onRetryTimer
callback, untilit attempted to access the destroyed
Subscription
parent_
, causing aninvalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the
callback_queue
is blocked byattempting to lock the
boost::shared_lock<boost::shared_mutex> rw_lock()
which would make this a time-dependant thread scheduling issue.
To remedy the the invalid callback, a reference to the
CallbackInfo
object in Thread Local Storage (
tls
) is made rather than a copy. Thisprevents the
CallbackInterface
shared pointer reference count from beingincremented (since only the reference is taken). To access the callback
for execution, a weak pointer is made from the
CallbackInterface
. Thenafter the mutex lock is acquired, the weak pointer is used to create a
shared pointer for callback execution. If the weak pointer fails to
return a shared pointer, the
CallbackInterface
has been destroyed.In that case, the callback is Invalid and execution is prevented.