-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577
Comments
+1 on this issue. I'm investigating a core with a very similar stack trace. It's incredibly difficult to reproduce, but I've managed to get two instances of the dump now:
The only useful piece of information that I can provide in reproducing the dump is that it happens (sometimes) as I kill a node that published / subscribed to the same topics that the dumping node subscribed / published to respectively. Poking around in frame 8 (died @
Should / how would this callback be marked for removal?
|
@dirk-thomas & @wjwwood, Have you seen any instances of this segfault at OSRF? I now have several instances from various ROS nodes, mostly on disconnect, but the core dumps are frustratingly hard to debug past a certain point (all with traces similar to the one shown above). I'm trying to distill this crash down into a simple, reproducible example, so any thoughts or tips would be greatly appreciated. |
@rethink-imcmahon I have not. I'll try to spend some time today or tomorrow looking at it. |
Ok, I still don't have a nice reproducible example, but I think I have uncovered the issue. Inside
What seems odd here is that we are not referencing the original
Furthermore, if you look at another very similar block of code in the same file, you can see the reference is properly executed for a
Copy vs reference? It really shouldn't matter, should it? Well in the L362-383 block it does, since there is a I'm wide open for critiques of this theory & ideas on a reliable method for reproducing this crash... |
|
I have created PR #615 to add the missing reference. Even if it is unclear if it helps solving this problem not copying the callback info at that point is the "right" thing to do. (Update: but as trivial as the patch looks like it doesn't seem to pass all unit tests) Can you please share some information about your scenario where you get the crashes with the roscpp_tutorial talker / listener. Then others might be able to reproduce it the same way. |
I want to summarize my thoughts when reading through the code - may be it helps others to trace the flow faster. The callback iterators in the thread local storage (TLS) are the ones being already popped of the queue to be called within a specific thread without requiring locks on the queue itself. There are four lines which use
The second one checks the flag for callbacks in the queue and therefore does so with a locked The third and fouth check the flag for a callback in the TLS and therefore do so while having a lock of the The first one actually marks all pending callbacks in the TLS for removal. The surrounding But since the TLS contains only thread local callbacks the block changing the flags ( ros_comm/clients/roscpp/src/libros/callback_queue.cpp Lines 193 to 204 in ea9ca68
|
Ok, thank you for the clarifications Dirk. I'm still trying to wrap my head around some of it, but the explanation is certainly helpful. Just brainstorming ways to debug this, I was thinking of adding an artificial wait around the lock to simulate a starved thread. I'll let you know if this line of thought turns up anything. As for the talker / listener, I'll commit them to my fork and put a link here. They did segfault overnight but in another piece of initialization code. I'll look into that later, but for now this is still my focus. |
What are the unit tests checking for exactly? |
Are you referring to the unit tests failing for #615? I don't know what they are checking exactly. Will have to look closer into them but it seems they are failing due to the added reference. |
@dirk-thomas or @stwirth, do you have any updates on this bug? I'd be curious to know why #615 unit tests were failing, and how we can fix up this patch |
@rethink-imcmahon I don't have updates. The bug never appeared again so I don't spend time on it. |
Also no updates from me. I haven't spent any more time on this since the last comment. I don't expect to have time for looking further into it and therefore tried to summarize my investigation as good as possible to enable others to continue looking into it. |
I've observed this crash but I don't have anything meaningful to add:
|
We have many virtual machines running a wide variety of ROS tasks nightly, so this issue is causing a handful of coredumps every night. In trying to see why the fix I proposed in May didn't work, I added the "reference-instead-of-copy" patch locally. Eventually, I managed to make it core dump:
This coredump was quite hard to produce, but after hundreds of subscribers were spun up / SIGINT, I managed to get what you see here. So it seems that before, when we were copying CallbackInfo rather than grabbing a reference to it, we were seeing the partial execution of the callback ( ros_comm/clients/roscpp/src/libros/subscription.cpp Lines 158 to 176 in ace5b9c
Despite the new core dump, think I still prefer having the reference for CallbackInfo rather than the direct copy, since having the copy lead us into a weird assembly code state, and the reference gives us a null |
The head of our QA department sent me some stats on this bug today. We run an extensive suite of tests on a nightly basis with many virtual robots in the cloud. In just the virtual robots (not counting the instances of this bug that we find on our real hardware) he finds in the past month:
It seems every ROS subscriber is susceptible to the crash, and the nodes that create-destroy subscriptions regularly are the worst offenders. I am still actively hunting this bug down, but I would welcome any help from others in the community. @wjwwood @dirk-thomas @jacquelinekay any thoughts on how to raise awareness on this issue in the community? |
@rethink-imcmahon I'd like to have a crack at it, but realistically I'm going to be too busy for a few weeks. Maybe you can just send an email to ros-users@ explaining that several people have taken some time to look at the problem but it eludes us. Someone might take it as a challenge and find a problem that we have overlooked. |
@rethink-imcmahon here are a few ideas for creating a test to reliably reproduce this crash (since it sounds like it happens randomly): The fact that it's difficult to reproduce but happens more frequently when there are many subscribers makes me wonder if it happens when there's a scheduling anomaly. Can you try adding a sleep right before line 382 of Or, to confirm or deny your theory about the subscriptions dropping, you could make a test where one thread spins doing some pub/sub work, and another thread causes all subscriptions to drop. |
I made a little progress on this today. @jacquelinekay, thanks for the tip. I added the line "usleep(100);" before line 382 of callback_queue.cpp and that did help to reproduce the bug more often. I also modified a simple test by @rethink-imcmahon as follows and I can reproduce the bug about every minute:
I've started working on solutions to the problem, but no big revelations yet. |
This issue manifested with periodic coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. The underlying issue was a copy of CallbackInfo being created in callback_queue.cpp. By creating a copy of this object, its shared pointer reference count for CallbackInterface was incremented, preventing destruction of the CallbackInterface by an external thread, even if all of the other Subscription resources had been destroyed. This lead to partial execution of the onRetryTimer callback, until it attempted to access the destroyed Subscription parent, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() which would make this a time-dependant thread scheduling issue. To remedy the the invalid callback, a reference to the CallbackInfo object in Thread Local Storage (tls) is made rather than a copy. This prevents the CallbackInterface shared pointer reference count from being incremented (since only the reference is taken). To access the callback for execution, a weak pointer is made from the CallbackInterface. Then after the mutex lock is acquired, the weak pointer is used to create a shared pointer for callback execution. If the weak pointer fails to return a shared pointer, the CallbackInterface has been destroyed. In that case, the callback is Invalid and execution is prevented.
This issue manifested with periodic coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. The underlying issue was a copy of CallbackInfo being created in callback_queue.cpp. By creating a copy of this object, its shared pointer reference count for CallbackInterface was incremented, preventing destruction of the CallbackInterface by an external thread, even if all of the other Subscription resources had been destroyed. This lead to partial execution of the onRetryTimer callback, until it attempted to access the destroyed Subscription parent, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() which would make this a time-dependant thread scheduling issue. To remedy the the invalid callback, a reference to the CallbackInfo object in Thread Local Storage (tls) is made rather than a copy. This prevents the CallbackInterface shared pointer reference count from being incremented (since only the reference is taken). To access the callback for execution, a weak pointer is made from the CallbackInterface. Then after the mutex lock is acquired, the weak pointer is used to create a shared pointer for callback execution. If the weak pointer fails to return a shared pointer, the CallbackInterface has been destroyed. In that case, the callback is Invalid and execution is prevented.
This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. There were multiple potential causes this issue, and after investigating several, it was determined that the onRetryTimer function was attempting to access its parent Subscription instance during CallbackQueue execution. This could lead to a scenario where invalid memory is accessed if all of the other Subscription resources had been destroyed in another thread. In that rare circumstance, the partial execution of the onRetryTimer callback attempted to access the destroyed Subscription parent_, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() inside the CallbackQueue::callOneCB function, which makes this a time-dependant thread scheduling issue. To solve this issue, some of the logic in onRetryTimer had to be reworked to avoid accessing the parent Subscription in case it has been destroyed. Thought was given to mutex locking the parent as it was accessed, but this resulted in the Subscription class being destroyed while its member mutex was locked in onRetryTimer inside the CallbackQueue thread, causing a pthread mutex error. This occurs since mutexes cannot be destroyed if they are locked. Instead, the need for the parent Subscription access in onRetryTimer was removed entirely. The function onRetryTimer() accesses its parent Subscription to attempt to get the name of the topic that needs to be reconnected in the event of a disconnect and add that name to a log message. We simply removed the topic name from the log messages. The second problematic parent instance access occurred to determine if the Transport type is TCPROS. If it was anything else (say UDP), the publisher link is dropped. Checking the type of Transport connection was moved into the main Subscription thread via the onConnectionDropped() method, so the check can happen even before the onRetryTimer method is bound to the CallbackQueue. UDP retry connection dropping should function exactly the same as before this patch. The end result of this investigation and patch is less of a likelihood for invalid pieces of memory to be accessed during normal ROS callback execution.
This issue manifested with periodic coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. The underlying issue was a copy of CallbackInfo being created in callback_queue.cpp. By creating a copy of this object, its shared pointer reference count for CallbackInterface was incremented, preventing destruction of the CallbackInterface by an external thread, even if all of the other Subscription resources had been destroyed. This lead to partial execution of the onRetryTimer callback, until it attempted to access the destroyed Subscription parent, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() which would make this a time-dependant thread scheduling issue. To remedy the the invalid callback, a reference to the CallbackInfo object in Thread Local Storage (tls) is made rather than a copy. This prevents the CallbackInterface shared pointer reference count from being incremented (since only the reference is taken). To access the callback for execution, a weak pointer is made from the CallbackInterface. Then after the mutex lock is acquired, the weak pointer is used to create a shared pointer for callback execution. If the weak pointer fails to return a shared pointer, the CallbackInterface has been destroyed. In that case, the callback is Invalid and execution is prevented.
This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned out the issue was caused by a few functions that were attached to boost::signals2::signals. These functions did not have shared_ptr's to the classes that they were bound to, causing a coredump if the original classes were destructed. This was particularly problematic if the class was destroyed while the signal function was in the middle of execution. Due to a mutex locking strategy in CallbackQueue::CallOneCB() causing semi- random thread execution timing, this issue was only seen on occasion. The solution to this bug was to rework how DropSignals were implemented. Additional function parameters for tracked_objects were added to for dropping connections in order to allow for an object's shared_from_this shared pointer to be passed into the function that was boost::bind'ed to a signal2::signal. This prevented the destruction of a Subscriber from destroying the objects which thse Drop functions were acting upon. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurance of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps relating to this issue have been discovered over the course of a week of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs.
This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned there were two issues at play: the onRetryTimer function did not having a valid reference to the TransportPublusherLink class, and the onConnectionDropped function was not registered to the boost::signals2::signal in such a way that tracked the existence of the underlying ROS Subscriber connection. The first issue (which was uncovered in ros#577) was resolved by calling getInternalTimerManger()::add with shared_from_this instead of an empty shared pointer. Testing this fix then revealed the second issue. Initially, the onConnectionDropped seemed to be fixed by keeping track of the TransportPublisherLink class with a shared_from_this pointer, though it did not pass all of the ros_comm tests. However, changing anything this deep in ROS runs the risk of unintended consequences, and upon further testing, we discovered that ros_controls/realtime_tools::RealTimePublishers were not being properly notified when connections were dropped, and entering deadlock conditions. Taking inspiration from the boost::signals2 Automatic Connection Management tutorial [1], we then used signals2::slot_type and track() mechanisms in order to maintain a reference to the TransportPublisherLink class during execution of the onConnectionDropped() code. This was functionally similar to the previous approach, with the added benefit that boost::signals2 handles all shared_from_this() reference counting under the hood, without preventing destruction of the parent class (which we believe to be the source of the unintended RealTimePublisher deadlocking). Upon the tracked object's destruction, the signal2 library automatically disconnects the signal. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurrence of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps and RealTimePublisher deadlocks have been seen over the course of a several days of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs. [1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management
Re #577: Signal slot & onRetryTimer callback fix
@dirk-thomas I think it's safe to close this issue now that #677 is merged. Thanks! |
Awesome @rethink-imcmahon! Thanks for all the effort! |
Is there an ETA on when this fix will transition to the public debs? We're having lots of problems with this on the VMs here at ClearPath. |
The changes in shadow-fixed have been announced six days ago (http://lists.ros.org/pipermail/ros-release/2015-October/004708.html) and we are waiting for people to test them thoroughly during that soak period. Probably within a week from now it will be synced to public. |
This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned there were two issues at play: the onRetryTimer function did not having a valid reference to the TransportPublusherLink class, and the onConnectionDropped function was not registered to the boost::signals2::signal in such a way that tracked the existence of the underlying ROS Subscriber connection. The first issue (which was uncovered in ros#577) was resolved by calling getInternalTimerManger()::add with shared_from_this instead of an empty shared pointer. Testing this fix then revealed the second issue. Initially, the onConnectionDropped seemed to be fixed by keeping track of the TransportPublisherLink class with a shared_from_this pointer, though it did not pass all of the ros_comm tests. However, changing anything this deep in ROS runs the risk of unintended consequences, and upon further testing, we discovered that ros_controls/realtime_tools::RealTimePublishers were not being properly notified when connections were dropped, and entering deadlock conditions. Taking inspiration from the boost::signals2 Automatic Connection Management tutorial [1], we then used signals2::slot_type and track() mechanisms in order to maintain a reference to the TransportPublisherLink class during execution of the onConnectionDropped() code. This was functionally similar to the previous approach, with the added benefit that boost::signals2 handles all shared_from_this() reference counting under the hood, without preventing destruction of the parent class (which we believe to be the source of the unintended RealTimePublisher deadlocking). Upon the tracked object's destruction, the signal2 library automatically disconnects the signal. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurrence of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps and RealTimePublisher deadlocks have been seen over the course of a several days of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs. [1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management
We've put the shadow build of ros_comm 1.11.15-0 on our internal apt servers here at ClearPath and are running it without incident. |
Thanks for reporting back. We will sync it once the shadow-fixed repo has no regressions. Currently 1.11.16 is being built for a different reason. |
This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned there were two issues at play: the onRetryTimer function did not having a valid reference to the TransportPublusherLink class, and the onConnectionDropped function was not registered to the boost::signals2::signal in such a way that tracked the existence of the underlying ROS Subscriber connection. The first issue (which was uncovered in ros#577) was resolved by calling getInternalTimerManger()::add with shared_from_this instead of an empty shared pointer. Testing this fix then revealed the second issue. Initially, the onConnectionDropped seemed to be fixed by keeping track of the TransportPublisherLink class with a shared_from_this pointer, though it did not pass all of the ros_comm tests. However, changing anything this deep in ROS runs the risk of unintended consequences, and upon further testing, we discovered that ros_controls/realtime_tools::RealTimePublishers were not being properly notified when connections were dropped, and entering deadlock conditions. Taking inspiration from the boost::signals2 Automatic Connection Management tutorial [1], we then used signals2::slot_type and track() mechanisms in order to maintain a reference to the TransportPublisherLink class during execution of the onConnectionDropped() code. This was functionally similar to the previous approach, with the added benefit that boost::signals2 handles all shared_from_this() reference counting under the hood, without preventing destruction of the parent class (which we believe to be the source of the unintended RealTimePublisher deadlocking). Upon the tracked object's destruction, the signal2 library automatically disconnects the signal. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurrence of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps and RealTimePublisher deadlocks have been seen over the course of a several days of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs. [1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management
Re ros#577: Signal slot & onRetryTimer callback fix
We witnessed a crash of
diagnostic_agregator/aggregator_node
on one of our robots, that is what i could find:Unfortunately that happened only once on one robot. No way to reproduce it. Has anybody seen something like that or can give me a hint on how to debug this further?
The text was updated successfully, but these errors were encountered: