Segmentation fault in bt_navigator when passing a goal #1439

maxlein · 2019-12-16T15:39:51Z

Bug report

I was playing around with navigation stack with default, out of the box planners, behavior trees, etc. and I got a seg fault when passing a goal while shuttle was already executing one.

Required Info:

Operating System:
- Ubuntu 18.04
Version or commit hash:
- latest source
DDS implementation:
- CycloneDDS

Steps to reproduce issue

Don't know it exactly anymore, but it was something like this:

Let shuttle move to a goal.
While executing pass it another goal.
Then the error may occurred during recovery

Expected behavior

No seg fault, some error output

Actual behavior

Segmentation fault

Additional information

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f582d35f515 in BT::CoroActionNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
[Current thread is 1 (Thread 0x7f5819826700 (LWP 15978))]
(gdb) bt
#0  0x00007f582d35f515 in BT::CoroActionNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#1  0x00007f5819c39dd0 in nav2_behavior_tree::RecoveryNode::tick (this=0x557d05ba2a10) at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/plugins/control/recovery_node.cpp:72
#2  0x00007f582d374f5f in BT::TreeNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#3  0x00007f5819a30e3b in nav2_behavior_tree::PipelineSequence::tick (this=0x557d05b7c210) at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/plugins/control/pipeline_sequence.cpp:82
#4  0x00007f582d374f5f in BT::TreeNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#5  0x00007f5819c39dd0 in nav2_behavior_tree::RecoveryNode::tick (this=0x557d05b6a250) at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/plugins/control/recovery_node.cpp:72
#6  0x00007f582d374f5f in BT::TreeNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#7  0x00007f5829e99689 in nav2_behavior_tree::BehaviorTreeEngine::run(std::unique_ptr<BT::Tree, std::default_delete<BT::Tree> >&, std::function<void ()>, std::function<bool ()>, std::chrono::duration<long, std::ratio<1l, 1000l> >) (this=<optimized out>, tree=std::unique_ptr<BT::Tree> = {...}, onLoop=..., cancelRequested=..., loopTimeout=..., loopTimeout@entry=...)
    at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/src/behavior_tree_engine.cpp:56
#8  0x00007f582c949a3b in nav2_bt_navigator::BtNavigator::navigateToPose (this=0x557d059ba7c0) at /home/kin/playground_ws/src/navigation2/nav2_bt_navigator/src/bt_navigator.cpp:219
#9  0x00007f582c95ca06 in std::function<void ()>::operator()() const (this=0x557d05b6a460) at /usr/include/c++/7/bits/std_function.h:706
#10 nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::work (this=0x557d05b6a400) at /home/kin/playground_ws/install/nav2_util/include/nav2_util/simple_action_server.hpp:143
#11 0x00007f582c95d08c in nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}::operator()() const (__closure=<optimized out>) at /home/kin/playground_ws/install/nav2_util/include/nav2_util/simple_action_server.hpp:134
#12 std::__invoke_impl<void, nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>(std::__invoke_other, nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}&&) (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#13 std::__invoke<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>(std::__invoke_result&&, (nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#14 std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /usr/include/c++/7/thread:234
#15 std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >::operator()() (this=<optimized out>) at /usr/include/c++/7/thread:243
#16 std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::operator()() const (this=0x7f5819824e40)
    at /usr/include/c++/7/future:1362
#17 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void> >::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/7/bits/std_function.h:302
#18 0x00007f582d354a79 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
   from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#19 0x00007f582bd5a827 in __pthread_once_slow (once_control=0x7f57fc0172b8, init_routine=0x7f582c62f830 <__once_proxy>) at pthread_once.c:116
#20 0x00007f582c950c1c in __gthread_once (__func=<optimized out>, __once=0x7f57fc0172b8) at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:699
#21 std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (__f=
    @0x7f5819824e10: (void (std::__future_base::_State_baseV2::*)(std::__future_base::_State_baseV2 * const, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>()> *, bool *)) 0x7f582d354a50 <std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>, __once=...)
    at /usr/include/c++/7/mutex:684
#22 std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) (__ignore_failure=false, __res=..., 
    this=0x7f57fc0172a0) at /usr/include/c++/7/future:401
#23 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::_Async_state_impl(std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>&&)::{lambda()#1}::operator()() const (__closure=0x557d05b019b8) at /usr/include/c++/7/future:1667
#24 std::__invoke_impl<void, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::_Async_state_impl(std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>&&)::{lambda()#1}>(std::__invoke_other, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::_Async_state_impl(std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>&&)::{lambda()#1}&&) (
    __f=...) at /usr/include/c++/7/bits/invoke.h:60

The text was updated successfully, but these errors were encountered:

yathartha3 · 2020-01-22T06:10:42Z

Hi @maxlein, I tried reproducing this issue. I gave a goal and repeatedly kept giving a new goal while it was already executing one. However, I do not see the problem (segmentation fault, or any other error) that you have mentioned.
Do you still have this issue?

Jconn · 2020-02-07T04:32:01Z

I am hitting this issue. I can reproduce it by setting a goal in rviz, and then cancelling the goal in rviz. It usually takes 4 repetitions for this crash to occur.

The line that causes the segfault is here in the behavior tree lib. i am running the head of the ros2-devel branch, I didn't check if that is what I should be running

I am surprised by this:
thread_local static Ordinator ordinator;
I'm seeing that most of these behavior tree nodes, like ComputePathToPose, are being touched by multiple threads in my 4x start-stop test, I wouldn't expect wanting each of those threads to have their own state.

SteveMacenski · 2020-02-07T04:38:11Z

@Jconn we run 3.1.1: https://github.com/ros/rosdistro/blob/master/eloquent/distribution.yaml#L208 from ros2-devel, at the moment.

Any suggestion of how to work around this or fix it? I think the big reason it hasn't been addressed yet is a lack of insight into how to resolve it.

It may be worth also filing a ticket in BT.CPP and cross linking this one.

Jconn · 2020-02-07T04:47:45Z

I'm planning on digging into this more but I'll be afk until monday. Adding what I know in the meantime incase someone else is digging into this too

…

On Thu, Feb 6, 2020, 10:38 PM Steven Macenski ***@***.***> wrote: @Jconn <https://github.com/Jconn> we run 3.1.1: https://github.com/ros/rosdistro/blob/master/eloquent/distribution.yaml#L208 from ros2-devel, at the moment. Any suggestion of how to work around this or fix it? I think the big reason it hasn't been addressed yet is a lack of insight into how to resolve it. It may be worth also filing a ticket in BT.CPP and cross linking this one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1439>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXLVP5EEPVO2PVHFR7BS3LRBTQTJANCNFSM4J3L2GRA> .

SteveMacenski · 2020-02-07T04:58:35Z

I appreciate it. Crashes are something I take very seriously and want to work to resolve them swiftly.

Its not clear to me from this if its our fault in not being thread safe or if its the BT.CPP project's issue. Once you dig more, circle back and let me know.

Edit: wait 4 threads? 4 navigation threads or things from the cpp library? My impression was the BT.CPP was supposed to be single threaded, that's the whole point of BTs 😕

SteveMacenski · 2020-02-07T05:02:33Z

I don't know if this is directly related or not, but seems to be on topic to make you aware of: #1285

Jconn · 2020-02-07T11:04:23Z

Edit: wait 4 threads? 4 navigation threads or things from the cpp library? My impression was the BT.CPP was supposed to be single threaded, that's the whole point of BTs

~4 navigation threads ticking things like the ComputePathToPose over the span of setting and cancelling 4 NavigateToPose goals

Jconn · 2020-02-08T17:40:19Z

Each time we cancel a NavigateToPose goal, we call the behavior tree reset function , which sets the state of every node in the behavior tree to idle.

The nav2 BtActionNode inherits from BT::CoroActionNode

Even though we set the CoroActionNode to status idle, that doesn’t reset the state of:
_p->coro or _p->pending_destroy in that object.
But those two variable states are only valid on the current thread, because they dictate how the node interacts with: thread_local static Ordinator ordinator;

Because of the thread_local specifier, every thread has a unique ordinator object.

The next time we set a NavigateToPose goal, a different thread is spawned to handle the request. That thread starts running the behavior tree object that had its NavigateToPose request cancelled.

This is how the CoroActionNode gets in a bad state, because _p->coro and _p->pending_destroy have not been reset, but this new thread has a new ordinator

I tested with a handful of rviz NavigateToPose requests, then rviz cancels, and I’m no longer able to reproduce when I clean up state at the end of a NavigateToPose request. I’m doing this by reinitializing the whole tree:

BT::Tree temp_tree = bt_->buildTreeFromText(xml_string_, blackboard_);
  tree_ = std::make_unique<BT::Tree>();
  tree_->root_node = temp_tree.root_node;
  tree_->nodes = std::move(temp_tree.nodes);
  temp_tree.root_node = nullptr;

I don't really want to reset the whole dang tree, but I couldn't find any kind of BT::CoroActionNode::reset that did what I wanted.

Jconn · 2020-02-08T17:53:03Z

This may explain the issue you linked too, #1285

In that scenario, the NavigateToPose request occurs, the action server spawns a thread to handle the request, the request is cancelled and the thread dies.

Then, a separate thread resets the stack. When it resets the stack, the tree_ object must be destroyed. The coroutine destructor tries to clean up the ordinator state when that happens, but since we are destroying the object from a different thread, we have a different ordinator

I don't understand how the action server makes threads for handling requests, the "new request, new thread" that I'm claiming is only verified through print statements

SteveMacenski · 2020-02-10T04:26:23Z

Thank you for your debugging efforts, that is a very good summary that closes the loop on this.

I'm curious from what you've seen - how long does it take to create a new BT? Is that creation something problematic (>100ms?)

I'd merge a PR with that fix if the creation time is reasonable.

Jconn · 2020-02-10T16:05:02Z

I haven't characterized that beyond just some terminal staring. I was happy to just make some progress lol

I'll submit something later today or tomorrow once I do some timing and more testing

SteveMacenski · 2020-02-10T22:05:46Z

Sounds good, I moved these tickets over to "In progress"

Thanks for taking lead on this. I look forward to no more crashes :-)

Jconn · 2020-02-10T22:34:10Z

well the tree recreation seems to fix the issue for me, but when I use the standard "navigate_w_replanning_and_recovery" behavior tree the recreation takes anywhere between 200 and 300 milliseconds.

I was checking open PRs and I think @mjeronimo probably knew all this Coro node stuff when he made #1322, and there was some hesitation on there about making this kind of change.

@bpwilcox I see you were tagged in that PR to look into alternative solutions. Did you find anything?

SteveMacenski · 2020-02-10T22:40:39Z

200-300ms is a little high for me to be comfortable with that long term, but if it gets rid of crashes, I think its not a terrible idea to start with. Any other suggestions that would avoid that overhead? I think in that thread I suggest a BT pointer-swap thread where we, on bringup, initialize 2 trees, execute 1. Then the next time around we use the cached tree and then spin up a thread to kill and reinitialize the first tree (and repeat). This way we don't block execution for spinning up the tree.

Also, I wouldn't expect a response from them anytime soon, Intel's moved folks around.

Jconn · 2020-02-11T06:35:21Z

I couldn't think of any other ways unfortunately..I think the ping-pong strategy you mention is good.

I cherry-picked #1322 onto master and I'm testing that.

I will file an issue in the behavior tree repo to get some input on if there's a lighter fix. I would prefer to get the slow creation merged in and then iterate on a more efficient solution

SteveMacenski · 2020-02-11T14:05:25Z

Makes sense, link that ticket here and keep me posted :-)

facontidavide · 2020-03-18T22:42:00Z

I am joining this thread much later than I should...

First of all, I see a big problem with BehaviorTreeEngine::resetTree()

As pointed out by @Jconn , this does not reset correctly the running nodes.

I think the correct implementation is:

void resetTree(BT::TreeNode * root_node)
{
   root_node->halt();
   root_node->setStatus(BT::NodeStatus::IDLE);
}

The only correct way to do it is to let the halt() callback propagate through the tree, from the root.

About the coroutine, I have two comments:

This sentence seems VERY surprising to me: "The next time we set a NavigateToPose goal, a different thread is spawned to handle the request." Why do you need to spawn different threads every time?
The Coroutine library I am using is stackless and probably not as robust as other implementations might be. An option might be to try if boost::coroutine2 is affected by the same problem or not... this is on me.
I have the feeling that we should avoid CoroAction completely.

At this point, I understand that the only thing I can do is to join the development effort, but I don't know when I will be able to commit.

It would be VERY helpful if we can create an example, separated from ROS and Navigation2, to reproduce the issue.

SteveMacenski added the 0 - Critical Critical to project, highest priority label Feb 7, 2020

This was referenced Feb 11, 2020

cherry-picking mjeronimo:fix_bt_crash_after_reset #1515

Merged

How to reset CoroAction Node to a clean state? BehaviorTree/BehaviorTree.CPP#146

Closed

SteveMacenski closed this as completed Feb 11, 2020

This was referenced Mar 18, 2020

Problem with BehaviorTreeEngine reset and halt #1600

Closed

BehaviorTree refactoring #1606

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in bt_navigator when passing a goal #1439

Segmentation fault in bt_navigator when passing a goal #1439

maxlein commented Dec 16, 2019

yathartha3 commented Jan 22, 2020

Jconn commented Feb 7, 2020 •

edited

Loading

SteveMacenski commented Feb 7, 2020

Jconn commented Feb 7, 2020 via email

SteveMacenski commented Feb 7, 2020 •

edited

Loading

SteveMacenski commented Feb 7, 2020

Jconn commented Feb 7, 2020

Jconn commented Feb 8, 2020

Jconn commented Feb 8, 2020

SteveMacenski commented Feb 10, 2020

Jconn commented Feb 10, 2020

SteveMacenski commented Feb 10, 2020

Jconn commented Feb 10, 2020

SteveMacenski commented Feb 10, 2020

Jconn commented Feb 11, 2020

SteveMacenski commented Feb 11, 2020

facontidavide commented Mar 18, 2020 •

edited

Loading

Segmentation fault in bt_navigator when passing a goal #1439

Segmentation fault in bt_navigator when passing a goal #1439

Comments

maxlein commented Dec 16, 2019

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

yathartha3 commented Jan 22, 2020

Jconn commented Feb 7, 2020 • edited Loading

SteveMacenski commented Feb 7, 2020

Jconn commented Feb 7, 2020 via email

SteveMacenski commented Feb 7, 2020 • edited Loading

SteveMacenski commented Feb 7, 2020

Jconn commented Feb 7, 2020

Jconn commented Feb 8, 2020

Jconn commented Feb 8, 2020

SteveMacenski commented Feb 10, 2020

Jconn commented Feb 10, 2020

SteveMacenski commented Feb 10, 2020

Jconn commented Feb 10, 2020

SteveMacenski commented Feb 10, 2020

Jconn commented Feb 11, 2020

SteveMacenski commented Feb 11, 2020

facontidavide commented Mar 18, 2020 • edited Loading

Jconn commented Feb 7, 2020 •

edited

Loading

SteveMacenski commented Feb 7, 2020 •

edited

Loading

facontidavide commented Mar 18, 2020 •

edited

Loading