Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when transitioning from state with a high frequency callback client behaviour #86

Open
yassiezar opened this issue Feb 14, 2023 · 8 comments

Comments

@yassiezar
Copy link
Contributor

yassiezar commented Feb 14, 2023

I've come across what might be a bug.

In my system, I have a CB subscribed to a high-frequency teleop topic with the standard SMACC signal system, which then publishes messages onto another topic via another SMACC client. However, I've experienced segfaults occurring during state transitions when publishing a message from within the subscriber CB callback. GDB confirmed my suspicion that I'm referencing a nullptr when trying to publish a message via the client pointer. I suspect the CB's onExit() has been called and all the class members have been deallocated before the client pointer is dereferenced and the message is published, triggering a segfault.

The event timeline looks something like this:

- topic data published  -> onMessageReceived() triggered -> dereference class pointer to client -> SEGFAULT
-        transition triggered -> call onExit() -> destroy CB

Does this make sense? I've added an example node here that hopefully make things clearer. The issue seems to be that there's no way to tell when a pointer to a client has been deallocated inside a CB callback, except for maybe doing a check for nullptr before attempting to deallocate e.g.

if(cl_pub_ == nullptr) return;
cl_pub_->publish(msg)

but this is effectively a race condition (the pointer can still be deallocated between these instructions). This issue only seems to become apparent with high frequency callback triggers - lower frequency CB's manage to destroy themselves more predictably

The obvious solution to me would be to replace the naked pointers with std::shared_ptr<TClient> instead and return std::weak_ptr<TClient> to the CBs via the requiresClient() calls. On other words, a typical CB would go from

class CbExample : public smacc::ISmaccClientBehavior {
  public:
    void onEntry() override {
      if(!this->requiresClient(cl_pub_)) {
        ROS_ERROR("Client doesn't exist");
        return;
      }
      std_msgs::Empty msg;
      cl_pub_->publish(msg);
    }
  private:
    ClPub* cl_pub_;
};

to

class CbExample : public smacc::ISmaccClientBehavior {
  public:
    void onEntry() override {
      cl_pub_ = this->requiresClient();
      if(auto lock = cl_pub_.lock()) {
        std_msgs::Empty msg;
        lock->publish(msg);
      }
      else {
        ROS_WARN("ClPub went out of scope");
      }
    }
  private:
    std::weak_ptr<ClPub> cl_pub_;
};

In my opinion, this has significant safety benefits and will prevent silly errors like the one I described above. I'm happy to look at this and open a PR. However, this refactor will need significant changes to SMACC's API. Thoughts?

┆Issue is synchronized with this Jira Task by Unito

@pabloinigoblasco
Copy link
Collaborator

pabloinigoblasco commented Feb 16, 2023

Hello yassierzar,

Thank you for bringing this issue to our attention and providing an example. Ensuring pointer safety in high-speed applications is a something important for us, and we appreciate your help in identifying any potential race conditions.

Not sure if I understand totally the issue. I believe that such situations should have been considered and prevented already in our current code. I will reproduce and debug the example you provided to learn more about it.

Before making any decisions about breaking changes in the code, let's carefully analyze and discuss this matter in more detail. Thank you again for bringing this to our attention, we will find a solution for this case.

@pabloinigoblasco
Copy link
Collaborator

I am reproducing your example from your fork repo and branch.
It is not crashing for me, Do you think there is anything I could do to make it crash?

@yassiezar
Copy link
Contributor Author

yassiezar commented Feb 17, 2023

Not sure if I understand totally the issue. I believe that such situations should have been considered and prevented already in our current code. I will reproduce and debug the example you provided to learn more about it.

Yes, its a weird issue and its quite difficult to explain it clearly in text. I think its a case of a ROS callback executed on a high frequency will sometimes reference a class member pointer to a client while the owning class (e.g. a client behaviour) is being destroyed. With low frequency callbacks, I think it won't be as much of a problem because your signalling system will prevent callbacks from being triggered while the CB is being destroyed. I don't mind setting up a call with you and Brett to explain this if its not clear.

Before making any decisions about breaking changes in the code, let's carefully analyze and discuss this matter in more detail. Thank you again for bringing this to our attention, we will find a solution for this case.

Of course, I was mostly thinking out loud and making notes for future discussions (I often forget about ideas). However, I think there is a discussion to be had (perhaps at another venue) around using naked pointers as opposed to smart pointers that could perhaps be more robust against these kinds of errors.

I am reproducing your example from your fork repo and branch.
It is not crashing for me, Do you think there is anything I could do to make it crash?

Interesting, I just ran it and it crashes roughly 5 seconds after start as soon as the state transitions (see log below). Are you running the sm_atomic_early_exit node? I.e. rosrun sm_atomic_early_exit sm_atomic_early_exit_node

[ INFO] [1676629749.311695474]: [SmaccStateMachine] initiate_impl
[ WARN] [1676629749.311827340]: State machine base creation:SmAtomicEarlyExit
[ INFO] [1676629749.313078562]: [SmaccStateMachine] Initializing ROS communication mechanisms
[ INFO] [1676629749.313092726]: [SmaccStateMachine] Initializing state machine
[ WARN] [1676629749.313100908]: [sm_atomic_early_exit::State1] creating 
[ INFO] [1676629749.313140385]: [sm_atomic_early_exit::State1] Creating ros NodeHandle for this state: SmAtomicEarlyExit
[ INFO] [1676629749.313171077]: [sm_atomic_early_exit::State1] State object created. Initializating...
[ INFO] [1676629749.313688321]: [sm_atomic_early_exit::State1] -- STATIC STATE DESCRIPTION --
[ INFO] [1676629749.313699753]: [sm_atomic_early_exit::State1] Creating static client behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313707296]: [sm_atomic_early_exit::State1] Configuring orthogonal: sm_atomic_early_exit::OrTimer
[ INFO] [1676629749.313720177]: [Orthogonal sm_atomic_early_exit::OrTimer] adding client behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313726979]: [sm_atomic_early_exit::State1] Creating static client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629749.313734835]: [sm_atomic_early_exit::State1] Configuring orthogonal: sm_atomic_early_exit::OrTimer
[ INFO] [1676629749.313742767]: [Orthogonal sm_atomic_early_exit::OrTimer] adding client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629749.313750481]: [sm_atomic_early_exit::State1] ---- END STATIC DESCRIPTION
[ INFO] [1676629749.313757522]: [sm_atomic_early_exit::State1] State runtime configuration
[ INFO] [1676629749.313765311]: [Orthogonal sm_atomic_early_exit::OrTimer] runtimeConfigure, current Behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313772765]: [Orthogonal sm_atomic_early_exit::OrTimer] runtimeConfigure, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629749.313780869]: [StateMachine] setting state active : sm_atomic_early_exit::State1
[ INFO] [1676629749.313793460]: [sm_atomic_early_exit::State1] State OnEntry
[ INFO] [1676629749.313799884]: On Entry!
[ INFO] [1676629749.313806883]: [sm_atomic_early_exit::State1] State OnEntry code finished
[ INFO] [1676629749.313813699]: [Orthogonal sm_atomic_early_exit::OrExample] OnEntry
[ INFO] [1676629749.313820767]: [Orthogonal sm_atomic_early_exit::OrTimer] OnEntry, current Behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313838542]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313846669]: [Orthogonal sm_atomic_early_exit::OrTimer] OnEntry, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629749.313855695]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629749.313862700]: [StateMachine] setting state active : sm_atomic_early_exit::State1
[ WARN] [1676629749.357850471]: Signal detector frequency (ros param ~signal_detector_loop_freq): 20.000000
[ INFO] [1676629749.359349223]: [SignalDetector] loop rate hz:20
[ INFO] [1676629749.359424795]: [SignalDetector] running in single threaded mode
[ INFO] [1676629749.359476609]: [SignalDetector] heartbeat
[ INFO] [1676629754.309856895]: NOTIFY TRANSITION: smacc::Transition<cl_ros_timer::EvTimer<cl_ros_timer::CbTimerCountdownOnce, sm_atomic_early_exit::OrTimer>, sm_atomic_early_exit::State2, smacc::default_transition_tags::SUCCESS, boost::statechart::detail::no_context<cl_ros_timer::EvTimer<cl_ros_timer::CbTimerCountdownOnce, sm_atomic_early_exit::OrTimer> >, &boost::statechart::detail::no_context<cl_ros_timer::EvTimer<cl_ros_timer::CbTimerCountdownOnce, sm_atomic_early_exit::OrTimer> >::no_function>
[ WARN] [1676629754.309993305]: exiting state: sm_atomic_early_exit::State1
[ INFO] [1676629754.310076794]: Notification State Exit: leaving state0x7f9b500088f0
[ INFO] [1676629754.310190692]: [Orthogonal sm_atomic_early_exit::OrExample] OnExit
[ INFO] [1676629754.310248891]: [Orthogonal sm_atomic_early_exit::OrTimer] OnExit, current Behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629754.310293966]: [Orthogonal sm_atomic_early_exit::OrTimer] OnExit, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629754.310369701]: Client behavior deallocated.
[ WARN] [1676629754.310438975]: Client behavior deallocated.
[ WARN] [1676629754.310496246]: [StateMachine] Disconnecting scoped-lifetime SmaccSignal subscription
[ WARN] [1676629754.310545275]: [StateMachine] Disconnecting scoped-lifetime SmaccSignal subscription
[ INFO] [1676629754.310585711]: On Exit!
[ WARN] [1676629754.310625609]: state exit: sm_atomic_early_exit::State1
[ WARN] [1676629754.310674146]: [sm_atomic_early_exit::State2] creating 
[ INFO] [1676629754.310771453]: [sm_atomic_early_exit::State2] Creating ros NodeHandle for this state: SmAtomicEarlyExit
[ INFO] [1676629754.310833681]: [sm_atomic_early_exit::State2] State object created. Initializating...
[ INFO] [1676629754.319295376]: [sm_atomic_early_exit::State2] -- STATIC STATE DESCRIPTION --
[ INFO] [1676629754.319378369]: [sm_atomic_early_exit::State2] Creating static client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629754.319425234]: [sm_atomic_early_exit::State2] Configuring orthogonal: sm_atomic_early_exit::OrTimer
[ INFO] [1676629754.319477522]: [Orthogonal sm_atomic_early_exit::OrTimer] adding client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629754.319534497]: [sm_atomic_early_exit::State2] Creating static client behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.319596191]: [sm_atomic_early_exit::State2] Configuring orthogonal: sm_atomic_early_exit::OrExample
[ INFO] [1676629754.319667641]: [Orthogonal sm_atomic_early_exit::OrExample] adding client behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.319720762]: [sm_atomic_early_exit::State2] Creating static client behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.319770833]: [sm_atomic_early_exit::State2] Configuring orthogonal: sm_atomic_early_exit::OrExample
[ INFO] [1676629754.319823136]: [Orthogonal sm_atomic_early_exit::OrExample] adding client behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.319873761]: [sm_atomic_early_exit::State2] ---- END STATIC DESCRIPTION
[ INFO] [1676629754.319925516]: [sm_atomic_early_exit::State2] State runtime configuration
[ INFO] [1676629754.319961619]: Entering State2
[ INFO] [1676629754.319997934]: [Orthogonal sm_atomic_early_exit::OrExample] runtimeConfigure, current Behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.320039657]: [Orthogonal sm_atomic_early_exit::OrExample] runtimeConfigure, current Behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.320078436]: [Orthogonal sm_atomic_early_exit::OrTimer] runtimeConfigure, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629754.320116627]: [StateMachine] setting state active : sm_atomic_early_exit::State2
[ INFO] [1676629754.320167367]: [sm_atomic_early_exit::State2] State OnEntry
[ INFO] [1676629754.320203706]: On Entry!
[ INFO] [1676629754.320239907]: [sm_atomic_early_exit::State2] State OnEntry code finished
[ INFO] [1676629754.320279298]: [Orthogonal sm_atomic_early_exit::OrExample] OnEntry, current Behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.320320714]: [Orthogonal sm_atomic_early_exit::OrExample] OnEntry, current Behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.320377293]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is smacc::CbSubscriptionCallbackBase<std_msgs::Empty_<std::allocator<void> > >
[ INFO] [1676629754.320421849]: [Orthogonal sm_atomic_early_exit::OrTimer] OnEntry, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629754.320472790]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629754.320511390]: [StateMachine] setting state active : sm_atomic_early_exit::State2
fish: “rosrun sm_atomic_early_exit sm_…” terminated by signal SIGSEGV (Address boundary error)

@pabloinigoblasco
Copy link
Collaborator

Indeed, I was in some incorrect branch, now I was able to reproduce the error.
I tested, and debugged.

During my debugging the segfault happens here:
https://github.com/yassiezar/SMACC/blob/19dec7998e55c48b87337343af1d92cc080b7841/smacc_sm_reference_library/sm_atomic_early_exit/include/sm_atomic_early_exit/clients/client_behaviors/cb_example_sub.h#L9

But, why shouldnt that raise a segmentation fault?
cl_pub is nullptr.

Where is the requiresClient call?

@yassiezar
Copy link
Contributor Author

yassiezar commented Feb 20, 2023

That's a good point and a silly mistake on my part. I've edited the example to add the missing requresClient() call and now it doesn't crash. I'll do some more digging to try and diagnose

@yassiezar
Copy link
Contributor Author

yassiezar commented Mar 23, 2023

We've been running into this problem more consistently now and seems to be problem with any transition, not only high-frequency ones. The problem seems centred around long-lived objects (components, clients), where their pointers are being dereferenced or made to point at invalid memory upon state transitions.

For example, I print out the address of a particularly troublesome component after a requiresComponent call and get the address 0xfffe7e193010. When the SM crashes on a segfault during a transition however, the address the component points at is 0xffff00000000, i.e. its seems to not be pointing at the right memory anymore. However, inspecting the memory in GDB gives

0xfffe7e193010:

p *(fox_rt::CpTeleop*)0xfffe7e193010
$3 = {<smacc::ISmaccComponent> = {_vptr.ISmaccComponent = 0xffff00000000, stateMachine_ = 0x141, 
    owner_ = 0xfffe7fa03900}, kMaxLinearVelScale = 1.390639178345914e-309, kMaxAngularVelScale = 0, 
  kNormalLinearVelScale = 0, kNormalAngularVelScale = 0, button_map_ = std::map with 0 elements, 
  mapping_enabled_ = {_M_base = {static _S_alignment = 1, _M_i = false}}, 
  mtx_ = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
          __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
        __size = '\000' <repeats 40 times>, "\001\000\000\000\001\000\000", 
        __align = 0}}, <No data fields>}}

0xffff00000000:

p *(fox_rt::CpTeleop*)0xffff00000000
$4 = {<smacc::ISmaccComponent> = {_vptr.ISmaccComponent = 0xffff00000020, stateMachine_ = 0x0, 
    owner_ = 0x21000}, kMaxLinearVelScale = 6.6781865217069613e-319, 
  kMaxAngularVelScale = 4.2439915819305446e-314, kNormalLinearVelScale = 4.9406564584124654e-324, 
  kNormalAngularVelScale = 0, 
  button_map_ = std::map with 281470681751280 elements<error reading variable: Cannot access memory at address 0x18>, mapping_enabled_ = {_M_base = {static _S_alignment = 1, _M_i = false}}, 
  mtx_ = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
          __nusers = 0, __kind = 9072, __spins = 65535, __list = {__prev = 0x0, 
            __next = 0xffff00000080}}, 
        __size = '\000' <repeats 16 times>, "p#\000\000\377\377\000\000\000\000\000\000\000\000\000\000\200\000\000\000\377\377\000\000\200\000\000\000\377\377\000", __align = 0}}, <No data fields>}}

We can see the component at 0xfffe7e193010 is valid and contains the data we'd expect, while 0xffff00000000 contains nonsense. This happens elsewhere with clients as well, but this case above happens most often.

Components and clients are long-lived objects and I can see that their memory is still valid. However, the CB's member variables, including the pointers to long-lived objects, seem to become invalid or change after/during transitions, which causes the segfaults we've been experiencing. The pointers also don't reset to nullptr so I there's no way to determine whether they're invalid during runtime. I'm at a loss of what might be happening or how I can prevent/detect this pointer reallocation. Any ideas?

@yassiezar
Copy link
Contributor Author

yassiezar commented Mar 23, 2023

Could a potential workaround for checking CB deallocation be to see if this == nullptr, like its set here? Also, why is only the zero'th CB being set to nullptr here?

It also looks like the CBs' onExit() calls are done before the signals are disconnected here. Would it make sense to move the disconnection step before the onExit() calls? onExit implies that the CB is done and ready to be destroyed, but this isn't the case because the signal lives on and a CB's callback can still be triggered, at which point it will call deallocated member variables and cause segfaults.

@pabloinigoblasco
Copy link
Collaborator

I have been having a look to the code, it is right there is a race condition for multithreaded application that could cause a callback on a destroyed object. Hence, that is a bug that it is being shown in your application.

Disconnection of callbacks for life-time objects must be (by design) just before they are disposed.

This is the first solution I propose:
#91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants