-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolves #577 by reworking onRetryTimer Sub logic #670
Resolves #577 by reworking onRetryTimer Sub logic #670
Conversation
This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. There were multiple potential causes this issue, and after investigating several, it was determined that the onRetryTimer function was attempting to access its parent Subscription instance during CallbackQueue execution. This could lead to a scenario where invalid memory is accessed if all of the other Subscription resources had been destroyed in another thread. In that rare circumstance, the partial execution of the onRetryTimer callback attempted to access the destroyed Subscription parent_, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() inside the CallbackQueue::callOneCB function, which makes this a time-dependant thread scheduling issue. To solve this issue, some of the logic in onRetryTimer had to be reworked to avoid accessing the parent Subscription in case it has been destroyed. Thought was given to mutex locking the parent as it was accessed, but this resulted in the Subscription class being destroyed while its member mutex was locked in onRetryTimer inside the CallbackQueue thread, causing a pthread mutex error. This occurs since mutexes cannot be destroyed if they are locked. Instead, the need for the parent Subscription access in onRetryTimer was removed entirely. The function onRetryTimer() accesses its parent Subscription to attempt to get the name of the topic that needs to be reconnected in the event of a disconnect and add that name to a log message. We simply removed the topic name from the log messages. The second problematic parent instance access occurred to determine if the Transport type is TCPROS. If it was anything else (say UDP), the publisher link is dropped. Checking the type of Transport connection was moved into the main Subscription thread via the onConnectionDropped() method, so the check can happen even before the onRetryTimer method is bound to the CallbackQueue. UDP retry connection dropping should function exactly the same as before this patch. The end result of this investigation and patch is less of a likelihood for invalid pieces of memory to be accessed during normal ROS callback execution.
Can one of the admins verify this patch? |
This patch has been tested for the last 24 hours against all of our virtual robots with no coredumps or obvious errors in testing. |
Huge thanks to @jeffadamsc for sitting down with me for several hours to devise this solution. Mind giving this a quick once over, Jeff? |
Couple quick thoughts on this patch:
|
Test passed. |
Nope. Three coredumps from last night:
This is leading us down a different path. I'll give an update after we have a new patch. |
Superseded by #677 |
This issue manifested with random coredumps in the
onRetryTimer()
callback in
transport_publisher_link.cpp
. These crashes were infrequent,and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.
There were multiple potential causes this issue, and after investigating
several, it was determined that the
onRetryTimer
function was attemptingto access its parent Subscription instance during
CallbackQueue
execution.This could lead to a scenario where invalid memory is accessed if all of
the other Subscription resources had been destroyed in another thread.
In that rare circumstance, the partial execution of the
onRetryTimer
callback attempted to access the destroyed Subscription
parent_
, causingan invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the
boost::shared_lock<boost::shared_mutex> rw_lock()
inside the
CallbackQueue::callOneCB
function, which makes this atime-dependant thread scheduling issue.
To solve this issue, some of the logic in
onRetryTimer
had to be reworkedto avoid accessing the parent Subscription in case it has been destroyed.
Thought was given to mutex locking the parent as it was accessed, but this
resulted in the Subscription class being destroyed while its member mutex
was locked in
onRetryTimer
inside theCallbackQueue
thread, causing apthread mutex
error. This occurs since mutexes cannot be destroyed if theyare locked. Instead, the need for the parent Subscription access in
onRetryTimer
was removed entirely.The function
onRetryTimer()
accesses its parent Subscription to attemptto get the name of the topic that needs to be reconnected in the event
of a disconnect and add that name to a log message. We simply removed
the topic name from the log messages. The second problematic parent
instance access occurred to determine if the Transport type is
TCPROS
.If it was anything else (say
UDP
), the publisher link is dropped.Checking the type of Transport connection was moved into the main
Subscription thread via the
onConnectionDropped()
method, so the checkcan happen even before the
onRetryTimer
method is bound to theCallbackQueue
.UDP
retry connection dropping should function exactlythe same as before this patch.
The end result of this investigation and patch is less of a likelihood
for invalid pieces of memory to be accessed during normal ROS callback
execution.