-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when transitioning from state with a high frequency callback client behaviour #86
Comments
Hello yassierzar, Thank you for bringing this issue to our attention and providing an example. Ensuring pointer safety in high-speed applications is a something important for us, and we appreciate your help in identifying any potential race conditions. Not sure if I understand totally the issue. I believe that such situations should have been considered and prevented already in our current code. I will reproduce and debug the example you provided to learn more about it. Before making any decisions about breaking changes in the code, let's carefully analyze and discuss this matter in more detail. Thank you again for bringing this to our attention, we will find a solution for this case. |
I am reproducing your example from your fork repo and branch. |
Yes, its a weird issue and its quite difficult to explain it clearly in text. I think its a case of a ROS callback executed on a high frequency will sometimes reference a class member pointer to a client while the owning class (e.g. a client behaviour) is being destroyed. With low frequency callbacks, I think it won't be as much of a problem because your signalling system will prevent callbacks from being triggered while the CB is being destroyed. I don't mind setting up a call with you and Brett to explain this if its not clear.
Of course, I was mostly thinking out loud and making notes for future discussions (I often forget about ideas). However, I think there is a discussion to be had (perhaps at another venue) around using naked pointers as opposed to smart pointers that could perhaps be more robust against these kinds of errors.
Interesting, I just ran it and it crashes roughly 5 seconds after start as soon as the state transitions (see log below). Are you running the
|
Indeed, I was in some incorrect branch, now I was able to reproduce the error. During my debugging the segfault happens here: But, why shouldnt that raise a segmentation fault? Where is the requiresClient call? |
That's a good point and a silly mistake on my part. I've edited the example to add the missing |
We've been running into this problem more consistently now and seems to be problem with any transition, not only high-frequency ones. The problem seems centred around long-lived objects (components, clients), where their pointers are being dereferenced or made to point at invalid memory upon state transitions. For example, I print out the address of a particularly troublesome component after a
We can see the component at Components and clients are long-lived objects and I can see that their memory is still valid. However, the CB's member variables, including the pointers to long-lived objects, seem to become invalid or change after/during transitions, which causes the segfaults we've been experiencing. The pointers also don't reset to |
Could a potential workaround for checking CB deallocation be to see if It also looks like the CBs' |
I have been having a look to the code, it is right there is a race condition for multithreaded application that could cause a callback on a destroyed object. Hence, that is a bug that it is being shown in your application. Disconnection of callbacks for life-time objects must be (by design) just before they are disposed. This is the first solution I propose: |
I've come across what might be a bug.
In my system, I have a CB subscribed to a high-frequency teleop topic with the standard SMACC signal system, which then publishes messages onto another topic via another SMACC client. However, I've experienced segfaults occurring during state transitions when publishing a message from within the subscriber CB callback. GDB confirmed my suspicion that I'm referencing a
nullptr
when trying to publish a message via the client pointer. I suspect the CB'sonExit()
has been called and all the class members have been deallocated before the client pointer is dereferenced and the message is published, triggering a segfault.The event timeline looks something like this:
Does this make sense? I've added an example node here that hopefully make things clearer. The issue seems to be that there's no way to tell when a pointer to a client has been deallocated inside a CB callback, except for maybe doing a check for
nullptr
before attempting to deallocate e.g.but this is effectively a race condition (the pointer can still be deallocated between these instructions). This issue only seems to become apparent with high frequency callback triggers - lower frequency CB's manage to destroy themselves more predictably
The obvious solution to me would be to replace the naked pointers with
std::shared_ptr<TClient>
instead and returnstd::weak_ptr<TClient>
to the CBs via therequiresClient()
calls. On other words, a typical CB would go fromto
In my opinion, this has significant safety benefits and will prevent silly errors like the one I described above. I'm happy to look at this and open a PR. However, this refactor will need significant changes to SMACC's API. Thoughts?
┆Issue is synchronized with this Jira Task by Unito
The text was updated successfully, but these errors were encountered: