Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BT Action Nodes crash bt_navigator node upon action server call timeout #2376

Closed
paucarre opened this issue May 29, 2021 · 8 comments
Closed

Comments

@paucarre
Copy link

paucarre commented May 29, 2021

Bug report

  • Operating System: Ubuntu 20.04
  • ROS2 Version: foxy
  • Version or commit hash: 2e18dd1
  • DDS implementation: Fast-RTPS

Steps to reproduce issue

  • Use any Behavior Tree Action Node ( BT::ActionNodeBase )
  • Call an action with lower-than-needed timeout.

Expected behavior

  • The action node itself should return a BT::NodeStatus::FAILURE whenever a problem occurs.
  • Whenever there is an error, clear and detailed error logs should be provided.
  • The ros2 node holding the BT engine ( like bt_navigator ) should not die
  • The BT engine should keep running the tree (for instance, running a recovery node )

Actual behavior

  • The whole ros2 node holding the BT engine crashes ( bt_navigator ), thus no recovery node is execute but rather the whole BT execution is suddenly halted
  • No meaningful logs provided ( only "send_goal failed" )
  • ros2 daemon needs to restart to see the bt_navigator node again ( after the bt_navigator node is re-launched after it crashes ).

Feature request

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 1, 2021

I think a better solution is to handle the exception, not remove support for exceptions. Other Bt nodes will throw so we should instead make sure that there isn't crashing when exceptions are thrown in the behavior tree.

It sounds like something got messed up, we have the BT engine in the BT navigator wrapped with a try catch when ticking nodes that should be catching exceptions. I agree if the current node exception behavior isnt allowing recovery behaviors to trigger, that is a problem. A middleware timeout issue should have a soft re-attempt possible

have you looked at main branch if any of these concerns are resolved? Much has changed since foxy on that front that isn’t backportable that may have outright resolved many of your concerns.

In your PR you mention constructed bt nodes failing — you also missed that some BT nodes may be getting input values in the constructors themselves. In that case, a crash of a required field would be a reasonable outcome.

@paucarre
Copy link
Author

paucarre commented Jun 1, 2021

Thanks for feedback @SteveMacenski

I totally agree with you that preventing exceptions from causing major damage ( ROS2 node dying) is something necessary. I will take a look at master branch and see if such checks are in place and/or an easy fix can be made. I'm a bit busy but I will try to get that this week...

I'd say though that throwing exceptions, specifically for BTNodes , is to me not the most suitable mechanism to notify errors.
These are the reasons why:

  • First and foremost, allowing exceptions as a mechanism of error notification leads existing nodes in the BehaviorTree.CPP to misbehave. Specifically, there is no expectation of exceptions being thrown (no try-catch) and no fallback mechanism is triggered upon exception. This means that navigation2 will not be able to use BehaviorTree.CPP as the nodes within the framework do not expect exceptions and thus fallback and other recovery mechanisms will not be triggered. See executeTick() method implementation does not account for exceptions
  • Second, and at the same level of the first point, currently nodes in the master branch of navigation2 do call tick() method without making any check for exceptions. Thus, accepting exceptions in nodes as a mechanism of error notification will also break the contract in navigation2 nodes.
  • If we have both Failure and throwing exceptions as an accepted mechanism for error notification, then we'll need to increase the code wrapping with try-catch clauses and check both exceptions and Failure state. I think this can very potentially make code much less readable and highly convoluted for no good reason. Besides, developers will wonder whether there are situations where Exceptions are the right way to notify problems, and others where NodeStatus::Failure is. I see it leading to confusion for the end developers as well. This is very specially true for fallback/recovery nodes where it won't be clear what the developer shall expect.
  • Errors in BT nodes are expected and acceptable, and their mechanism of notification is setting NodeStatus::Failure as their status as framework requirement. Notifying Failure guarantees code execution is predictable ( like Either/Result in FP frameworks ) .
  • BT Nodes are stateful. As I commented in "Future work that may be required" section of PR, the status of a node can be Failure or not depending on the state it is in (e.g. it has the right inputs). If we throw exceptions when we find it to "fail" in the constructor, then we don't allow nodes to become ready. If we enforce more the idea to use Failure state as much as possible, these cases will be drastically reduced.

Regarding your comment: "you also missed that some BT nodes may be getting input values in the constructors themselves. In that case, a crash of a required field would be a reasonable outcome.". While this can be an exceptional case, and thus an exception to alert might be the right way to do it, the cases that I've seen, the nodes crash within the constructor for no good reason.

For instance, say the example of TransformAvailableCondition. This node does not need a tf at construction time. This tf might be set in the blackboard at runtime. There are good reasons why something like that could take place. For instance, if <TransformAvailable child="{target_tf}" parent="{robot_tf}"> is part of a reusable SubTree that moves to robot to a target tf if such tf exists and otherwise move elsewhere, then one of the inputs that will be published during runtime on the blackboard (which, as it's reusable, will be different for each "call" of the sub-tree ). And finally, if we want recovery actions to take place, like "move elsewhere", then on the tick() method, as per current BT framework implementation will be required to return a Failure instead of an exception.

Just as an dummy example to visualize the use case:

<SetBlackboard output_key="target_tf" value="my_target_1" />
<SetBlackboard output_key="robot_tf" value="base_link" />
<SubTree ID="MoveToTargetOrGoHome" __shared_blackboard="true"/>

...

<SetBlackboard output_key="target_tf" value="my_target_2" />
<SetBlackboard output_key="robot_tf" value="base_link" />
<SubTree ID="MoveToTargetOrGoHome" __shared_blackboard="true"/>

...

<BehaviorTree ID="MoveToTargetOrGoHome">
	<RecoveryNode number_of_retries="3">
		<TransformAvailable child="{target_tf}" parent="{robot_tf}">
                 ...
		</TransformAvailable>
		<SubTree ID="GoHomeRecovery" __shared_blackboard="true"/>
	</RecoveryNode>
</BehaviorTree>

As said, it might be the case of throwing an exception in the constructor, but to me that's something that I think should be highly discouraged as in many cases, compulsory inputs are a runtime requirement. It just makes nodes not usable for general use-cases ( like reusable sub-trees )

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 1, 2021

There are certain types of failures that shouldn't just be hidden and should probably throw a notice to the behavior tree application that something has clearly gone wrong. If an invalid status / code comes by, that would be a good example of evidence that something is acting up in a way that it shouldn't and we should throw something. I agree that the robot should be able to recover from transient networking outages though -- the ability to have recovery actions or retries for network timeouts seem reasonable.

The BT has exception handling here: https://github.com/ros-planning/navigation2/blob/main/nav2_behavior_tree/src/behavior_tree_engine.cpp#L47. To make a statement that BT.CPP is somehow "incorrect" to use exceptions is invalid. This is just C++ code, you can throw as long as you have a mechanism to catch, which we do. Throwing is an important tool in our developer's toolchest and we will not outright remove it. There are failure conditions where throwing is the most reasonable thing so the behavior tree navigator or an application can handle a critical failure.

While this can be an exceptional case, and thus an exception to alert might be the right way to do it, the cases that I've seen, the nodes crash within the constructor for no good reason.

This is not an exceptional case, we do this often in Nav2. If that's not good formatting for BT.CPP, I'd be happy to field a PR to move all of the get inputs from the constructors to elsewhere so that is no longer an issue for you.

I see your PR, but I think your actual issues are:

  • Some BT nodes are getting inputs in constructors and crashing because you didn't give them to it yet
  • Bt nodes are throwing exceptions for things that it should be able to recover from

With those 2 changes, I think that fixes your mentioned concerns. Those are 2 ideas I can get behind as well, but removing exceptions from BT use overall would not be something I would agree with.


@naiveHobo how do you feel about this? I think this brings up a good point that a networking failure should probably fail the BT node, not throw an exception. When an exception is thrown, its caught by the BT engine (either navigator or application using BT) for handling, exiting the existing task. For many cases, I think its the right answer (e.g. if the node state is invalid or node couldn't be locked signaling a serious issue or mis-using API). Though for the networking failures, I think that should probably return a failure code so that it may be retried, I don't think that should probably kill the entire application task. Thoughts?

Here are all of the places exceptions are used in the BT:

plugins/control/round_robin_node.cpp:          throw BT::LogicError("Invalid status return from BT node");
plugins/control/pipeline_sequence.cpp:        throw std::runtime_error(error_msg.str());
plugins/control/recovery_node.cpp:    throw BT::BehaviorTreeException("Recovery Node '" + name() + "' must only have 2 children.");
plugins/control/recovery_node.cpp:            throw BT::LogicError("A child node must never return IDLE");
plugins/control/recovery_node.cpp:            throw BT::LogicError("A child node must never return IDLE");
plugins/decorator/speed_controller.cpp:    throw BT::BehaviorTreeException(err_msg);
include/nav2_behavior_tree/bt_conversions.hpp:    throw std::runtime_error("invalid number of fields for point attribute)");
include/nav2_behavior_tree/bt_conversions.hpp:    throw std::runtime_error("invalid number of fields for orientation attribute)");
include/nav2_behavior_tree/bt_conversions.hpp:    throw std::runtime_error("invalid number of fields for PoseStamped attribute)");
include/nav2_behavior_tree/bt_action_node.hpp:        throw std::logic_error("BtActionNode::Tick: invalid status value");
include/nav2_behavior_tree/bt_action_node.hpp:      throw std::runtime_error("send_goal failed");
include/nav2_behavior_tree/bt_action_node.hpp:      throw std::runtime_error("Goal was rejected by the action server");
include/nav2_behavior_tree/bt_action_server_impl.hpp:    throw std::runtime_error{"Failed to lock node"};

The only 2 I have potential objections to are the send goal failure and rejected in bt_action_node, the others look OK to me.

@SteveMacenski
Copy link
Member

Any movement here?

@philison
Copy link
Contributor

Were the changes that were part of the merge into the foxy-devel branch reverted? I would love to see the changes in galactic as well, since a simple rejection by an action server will cause the entire BT to fail. There is no other clean way to fix this, right?

@SteveMacenski
Copy link
Member

This was never merged into any branch, the author of the PR has not been responsive

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 25, 2021

Actions for this ticket

  • Make getting ports outside of the constructors of the Bt nodes
  • Evaluate the exceptions thrown in the BT and see which should be caught in the BT-node-level to return a certain type of code (FAILURE) rather than bring down the entire BT task.

@philison I believe these 2 below are good candidates for catching the exception in the node itself and returning FAILURE of the task. I'd merge a PR which implemented that. The other exceptions I highlighted in the comment above seem like real BT-terminating error cases that shouldn't be hidden from the application and should probably be caught and end the tree execution. What do you think? These are the 2 networking related exceptions that could be plausibly found transiently, the others are more systemic issues that need to be addressed by an application developer that probably messed something critical up.

include/nav2_behavior_tree/bt_action_node.hpp:      throw std::runtime_error("send_goal failed");
include/nav2_behavior_tree/bt_action_node.hpp:      throw std::runtime_error("Goal was rejected by the action server");

@SteveMacenski
Copy link
Member

Merging shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants