Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in bt_navigator when passing a goal #1439

Closed
maxlein opened this issue Dec 16, 2019 · 17 comments
Closed

Segmentation fault in bt_navigator when passing a goal #1439

maxlein opened this issue Dec 16, 2019 · 17 comments
Labels
0 - Critical Critical to project, highest priority

Comments

@maxlein
Copy link
Contributor

maxlein commented Dec 16, 2019

Bug report

I was playing around with navigation stack with default, out of the box planners, behavior trees, etc. and I got a seg fault when passing a goal while shuttle was already executing one.

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Version or commit hash:
  • DDS implementation:
    • CycloneDDS

Steps to reproduce issue

Don't know it exactly anymore, but it was something like this:

  • Let shuttle move to a goal.
  • While executing pass it another goal.
  • Then the error may occurred during recovery

Expected behavior

No seg fault, some error output

Actual behavior

Segmentation fault

Additional information

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f582d35f515 in BT::CoroActionNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
[Current thread is 1 (Thread 0x7f5819826700 (LWP 15978))]
(gdb) bt
#0  0x00007f582d35f515 in BT::CoroActionNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#1  0x00007f5819c39dd0 in nav2_behavior_tree::RecoveryNode::tick (this=0x557d05ba2a10) at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/plugins/control/recovery_node.cpp:72
#2  0x00007f582d374f5f in BT::TreeNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#3  0x00007f5819a30e3b in nav2_behavior_tree::PipelineSequence::tick (this=0x557d05b7c210) at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/plugins/control/pipeline_sequence.cpp:82
#4  0x00007f582d374f5f in BT::TreeNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#5  0x00007f5819c39dd0 in nav2_behavior_tree::RecoveryNode::tick (this=0x557d05b6a250) at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/plugins/control/recovery_node.cpp:72
#6  0x00007f582d374f5f in BT::TreeNode::executeTick() () from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#7  0x00007f5829e99689 in nav2_behavior_tree::BehaviorTreeEngine::run(std::unique_ptr<BT::Tree, std::default_delete<BT::Tree> >&, std::function<void ()>, std::function<bool ()>, std::chrono::duration<long, std::ratio<1l, 1000l> >) (this=<optimized out>, tree=std::unique_ptr<BT::Tree> = {...}, onLoop=..., cancelRequested=..., loopTimeout=..., loopTimeout@entry=...)
    at /home/kin/playground_ws/src/navigation2/nav2_behavior_tree/src/behavior_tree_engine.cpp:56
#8  0x00007f582c949a3b in nav2_bt_navigator::BtNavigator::navigateToPose (this=0x557d059ba7c0) at /home/kin/playground_ws/src/navigation2/nav2_bt_navigator/src/bt_navigator.cpp:219
#9  0x00007f582c95ca06 in std::function<void ()>::operator()() const (this=0x557d05b6a460) at /usr/include/c++/7/bits/std_function.h:706
#10 nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::work (this=0x557d05b6a400) at /home/kin/playground_ws/install/nav2_util/include/nav2_util/simple_action_server.hpp:143
#11 0x00007f582c95d08c in nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}::operator()() const (__closure=<optimized out>) at /home/kin/playground_ws/install/nav2_util/include/nav2_util/simple_action_server.hpp:134
#12 std::__invoke_impl<void, nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>(std::__invoke_other, nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}&&) (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#13 std::__invoke<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>(std::__invoke_result&&, (nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#14 std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /usr/include/c++/7/thread:234
#15 std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >::operator()() (this=<optimized out>) at /usr/include/c++/7/thread:243
#16 std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::operator()() const (this=0x7f5819824e40)
    at /usr/include/c++/7/future:1362
#17 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void> >::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/7/bits/std_function.h:302
#18 0x00007f582d354a79 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
   from /opt/ros/eloquent/lib/libbehaviortree_cpp_v3.so
#19 0x00007f582bd5a827 in __pthread_once_slow (once_control=0x7f57fc0172b8, init_routine=0x7f582c62f830 <__once_proxy>) at pthread_once.c:116
#20 0x00007f582c950c1c in __gthread_once (__func=<optimized out>, __once=0x7f57fc0172b8) at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:699
#21 std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (__f=
    @0x7f5819824e10: (void (std::__future_base::_State_baseV2::*)(std::__future_base::_State_baseV2 * const, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>()> *, bool *)) 0x7f582d354a50 <std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>, __once=...)
    at /usr/include/c++/7/mutex:684
#22 std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) (__ignore_failure=false, __res=..., 
    this=0x7f57fc0172a0) at /usr/include/c++/7/future:401
#23 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::_Async_state_impl(std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>&&)::{lambda()#1}::operator()() const (__closure=0x557d05b019b8) at /usr/include/c++/7/future:1667
#24 std::__invoke_impl<void, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::_Async_state_impl(std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>&&)::{lambda()#1}>(std::__invoke_other, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}> >, void>::_Async_state_impl(std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::NavigateToPose, rclcpp::Node>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::NavigateToPose> >)::{lambda()#1}>&&)::{lambda()#1}&&) (
    __f=...) at /usr/include/c++/7/bits/invoke.h:60

@yathartha3
Copy link
Contributor

Hi @maxlein, I tried reproducing this issue. I gave a goal and repeatedly kept giving a new goal while it was already executing one. However, I do not see the problem (segmentation fault, or any other error) that you have mentioned.
Do you still have this issue?

@Jconn
Copy link
Contributor

Jconn commented Feb 7, 2020

I am hitting this issue. I can reproduce it by setting a goal in rviz, and then cancelling the goal in rviz. It usually takes 4 repetitions for this crash to occur.

The line that causes the segfault is here in the behavior tree lib. i am running the head of the ros2-devel branch, I didn't check if that is what I should be running

I am surprised by this:
thread_local static Ordinator ordinator;
I'm seeing that most of these behavior tree nodes, like ComputePathToPose, are being touched by multiple threads in my 4x start-stop test, I wouldn't expect wanting each of those threads to have their own state.

@SteveMacenski
Copy link
Member

@Jconn we run 3.1.1: https://github.com/ros/rosdistro/blob/master/eloquent/distribution.yaml#L208 from ros2-devel, at the moment.

Any suggestion of how to work around this or fix it? I think the big reason it hasn't been addressed yet is a lack of insight into how to resolve it.

It may be worth also filing a ticket in BT.CPP and cross linking this one.

@Jconn
Copy link
Contributor

Jconn commented Feb 7, 2020 via email

@SteveMacenski
Copy link
Member

SteveMacenski commented Feb 7, 2020

I appreciate it. Crashes are something I take very seriously and want to work to resolve them swiftly.

Its not clear to me from this if its our fault in not being thread safe or if its the BT.CPP project's issue. Once you dig more, circle back and let me know.

Edit: wait 4 threads? 4 navigation threads or things from the cpp library? My impression was the BT.CPP was supposed to be single threaded, that's the whole point of BTs 😕

@SteveMacenski SteveMacenski added the 0 - Critical Critical to project, highest priority label Feb 7, 2020
@SteveMacenski
Copy link
Member

I don't know if this is directly related or not, but seems to be on topic to make you aware of: #1285

@Jconn
Copy link
Contributor

Jconn commented Feb 7, 2020

Edit: wait 4 threads? 4 navigation threads or things from the cpp library? My impression was the BT.CPP was supposed to be single threaded, that's the whole point of BTs

~4 navigation threads ticking things like the ComputePathToPose over the span of setting and cancelling 4 NavigateToPose goals

@Jconn
Copy link
Contributor

Jconn commented Feb 8, 2020

Each time we cancel a NavigateToPose goal, we call the behavior tree reset function , which sets the state of every node in the behavior tree to idle.

The nav2 BtActionNode inherits from BT::CoroActionNode

Even though we set the CoroActionNode to status idle, that doesn’t reset the state of:
_p->coro or _p->pending_destroy in that object.
But those two variable states are only valid on the current thread, because they dictate how the node interacts with: thread_local static Ordinator ordinator;

Because of the thread_local specifier, every thread has a unique ordinator object.

The next time we set a NavigateToPose goal, a different thread is spawned to handle the request. That thread starts running the behavior tree object that had its NavigateToPose request cancelled.

This is how the CoroActionNode gets in a bad state, because _p->coro and _p->pending_destroy have not been reset, but this new thread has a new ordinator

I tested with a handful of rviz NavigateToPose requests, then rviz cancels, and I’m no longer able to reproduce when I clean up state at the end of a NavigateToPose request. I’m doing this by reinitializing the whole tree:

BT::Tree temp_tree = bt_->buildTreeFromText(xml_string_, blackboard_);
  tree_ = std::make_unique<BT::Tree>();
  tree_->root_node = temp_tree.root_node;
  tree_->nodes = std::move(temp_tree.nodes);
  temp_tree.root_node = nullptr;

I don't really want to reset the whole dang tree, but I couldn't find any kind of BT::CoroActionNode::reset that did what I wanted.

@Jconn
Copy link
Contributor

Jconn commented Feb 8, 2020

This may explain the issue you linked too, #1285

In that scenario, the NavigateToPose request occurs, the action server spawns a thread to handle the request, the request is cancelled and the thread dies.

Then, a separate thread resets the stack. When it resets the stack, the tree_ object must be destroyed. The coroutine destructor tries to clean up the ordinator state when that happens, but since we are destroying the object from a different thread, we have a different ordinator

I don't understand how the action server makes threads for handling requests, the "new request, new thread" that I'm claiming is only verified through print statements

@SteveMacenski
Copy link
Member

Thank you for your debugging efforts, that is a very good summary that closes the loop on this.

I'm curious from what you've seen - how long does it take to create a new BT? Is that creation something problematic (>100ms?)

I'd merge a PR with that fix if the creation time is reasonable.

@Jconn
Copy link
Contributor

Jconn commented Feb 10, 2020

I haven't characterized that beyond just some terminal staring. I was happy to just make some progress lol

I'll submit something later today or tomorrow once I do some timing and more testing

@SteveMacenski
Copy link
Member

Sounds good, I moved these tickets over to "In progress"

Thanks for taking lead on this. I look forward to no more crashes :-)

@Jconn
Copy link
Contributor

Jconn commented Feb 10, 2020

well the tree recreation seems to fix the issue for me, but when I use the standard "navigate_w_replanning_and_recovery" behavior tree the recreation takes anywhere between 200 and 300 milliseconds.

I was checking open PRs and I think @mjeronimo probably knew all this Coro node stuff when he made #1322, and there was some hesitation on there about making this kind of change.

@bpwilcox I see you were tagged in that PR to look into alternative solutions. Did you find anything?

@SteveMacenski
Copy link
Member

200-300ms is a little high for me to be comfortable with that long term, but if it gets rid of crashes, I think its not a terrible idea to start with. Any other suggestions that would avoid that overhead? I think in that thread I suggest a BT pointer-swap thread where we, on bringup, initialize 2 trees, execute 1. Then the next time around we use the cached tree and then spin up a thread to kill and reinitialize the first tree (and repeat). This way we don't block execution for spinning up the tree.

Also, I wouldn't expect a response from them anytime soon, Intel's moved folks around.

@Jconn
Copy link
Contributor

Jconn commented Feb 11, 2020

I couldn't think of any other ways unfortunately..I think the ping-pong strategy you mention is good.

I cherry-picked #1322 onto master and I'm testing that.

I will file an issue in the behavior tree repo to get some input on if there's a lighter fix. I would prefer to get the slow creation merged in and then iterate on a more efficient solution

@SteveMacenski
Copy link
Member

Makes sense, link that ticket here and keep me posted :-)

@facontidavide
Copy link
Contributor

facontidavide commented Mar 18, 2020

I am joining this thread much later than I should...

First of all, I see a big problem with BehaviorTreeEngine::resetTree()

As pointed out by @Jconn , this does not reset correctly the running nodes.

I think the correct implementation is:

void resetTree(BT::TreeNode * root_node)
{
   root_node->halt();
   root_node->setStatus(BT::NodeStatus::IDLE);
}

The only correct way to do it is to let the halt() callback propagate through the tree, from the root.

About the coroutine, I have two comments:

  1. This sentence seems VERY surprising to me: "The next time we set a NavigateToPose goal, a different thread is spawned to handle the request." Why do you need to spawn different threads every time?

  2. The Coroutine library I am using is stackless and probably not as robust as other implementations might be. An option might be to try if boost::coroutine2 is affected by the same problem or not... this is on me.

  3. I have the feeling that we should avoid CoroAction completely.

At this point, I understand that the only thing I can do is to join the development effort, but I don't know when I will be able to commit.

It would be VERY helpful if we can create an example, separated from ROS and Navigation2, to reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Critical Critical to project, highest priority
Projects
None yet
Development

No branches or pull requests

5 participants