Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Feedback for Navigation Failures #3043

Closed
karl-schulz opened this issue Jun 25, 2022 · 24 comments · Fixed by #3251 or #3324
Closed

Feature Request: Feedback for Navigation Failures #3043

karl-schulz opened this issue Jun 25, 2022 · 24 comments · Fixed by #3251 or #3324
Labels
enhancement New feature or request in progress

Comments

@karl-schulz
Copy link
Contributor

karl-schulz commented Jun 25, 2022

Summary

When reacting to navigation failures, it is essential to know why it failed to be able to react to it in a reasonable way.

Explanation

  • Even inside the stack, planner's ComputePathToPose and controller's FollowPath provide no additional information than success/failure and a std_msgs/Empty.
  • From the outside of navigation, there is also no way of knowing the overall reason for a navigation failure (e.g. Planner or Controller failure, no progress, ...). Calling the NavigateToPose action, the resulting response is just a std_msgs/Empty.

Example where feedback could be useful: In the behavior trees, we often see recovery actions for planner/controller failures. This makes sense if the computation failed because the robot is stuck somehow and turning/waiting can resolve the issue. But if the planner failed because the goal is occupied, it makes no sense that the robot backs up or turns around as soon as it sees it. Different reactions depending on the cause for failure are needed.

What can we do about it?

First of all: Do you agree that this is problematic?

I am not sure what the best way of dealing with this issue is and wanted to openly ask what you think about it.

Some ideas:

  • Planner and controllers could loosely publish failure reasons on separate topics for anyone to react to it
  • Alternatively, they could use the blackboard for this
  • The action interfaces could have error fields

Still, I don't see a clean way of dealing with failure reasons in the behavior trees, the architecture doesn't seem to allow for it.

@vinnnyr
Copy link
Contributor

vinnnyr commented Jun 27, 2022

In terms of implementation, I wonder if some of this is a good application of diagnostics.

I think that is essentially the same idea as

Planner and controllers could loosely publish failure reasons on separate topics for anyone to react to it

@mikeferguson
Copy link
Contributor

Just a note, ROS diagnostics are usually meant to be "human readable" - the format is a bunch of strings, so it's not usually the best system if you want to autonomously make decisions based on the data.

@mikeferguson
Copy link
Contributor

I will also note - this feature (different recovery behaviors depending on the actual cause of failure) was a frequent request in ROS1 as well - we never actually got there.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 27, 2022

But if the planner failed because the goal is occupied, it makes no sense that the robot backs up or turns around as soon as it sees it. Different reactions depending on the cause for failure are needed.

That is all logic to be specified and customized in your behavior tree. Everyone's application is different so there's only so much I can do in the default BTs so that they work for most everyone's use-cases using generalized BT nodes. You could, and should, in that situation create some custom BT nodes for checking if the goal is occupied to take different corrective action if you don't believe a given default recovery is rational. Ex. the flow would be something like 'if can't create path, check if goal is occupied, if so, do XYZ` and you can have a variety of condition nodes for specifying actions.

With that said, there is on-going work to improve the default BTs and I'm definitely open to suggestions!

But thinking about this a bit more generally, we could give users some notion of what failed (e.g. which component crapped out: planner or controller) as a result type to return. Then that information of the action result type could be available in the BT blackboard for other condition nodes to access so that they don't need to derive that context themselves, when readily available from the planner itself. That's already 'something' but not a great deal of context -- although in practical terms planning only fails if (1) there's no valid path (2) the goal or starting points are occupied. Controllers fail when (1) its stuck (2) its in collision already.

If we want more information than just what component failed, you're then reliant on planner plugins to providing you with that proper context about failure - which requires updating the planners to throw standard exceptions / failure codes that can be caught by the planner server and returned via the result interface. This is certainly possible, but would involve a bit of leg work. We would also want to update documentation to make it clear to planner plugin authors that they are expected to return specific codes / throw exceptions in these particular situations (but that's just updating a tutorial + some doxygen in nav2_core).

Diagnostics are not the right answer. We could always improve the Action result fields to contain more information, that's not much of a problem. What's more of a problem is deciding how to what / to populate it with and how so. We need to come up with an exhaustive (but later extendable) list of return codes we'd want to send on failures so that they're API-documentation-able and usable in the BT within their contexts.

But it would be totally reasonable to pass through the error code from the planner/controller/etc to the NavigateToPose action task so that the autonomy system can know what failed and why, that seems totally reasonable / easy once the planner/controller failure codes are handled.

Still, I don't see a clean way of dealing with failure reasons in the behavior trees, the architecture doesn't seem to allow for it.

I disagree with that - we've highlighted now 2 perfectly reasonable ways of doing this with BTs 😉

@vinnnyr
Copy link
Contributor

vinnnyr commented Jun 27, 2022

ok I think I totally misunderstood this ticket to be for user knowledge, rather than programmatic recovery :)

@karl-schulz
Copy link
Contributor Author

Thanks for the thoughts, that makes sense!

I disagree with that - we've highlighted now 2 perfectly reasonable ways of doing this with BTs wink

Ok, for example the "Goal is occupied" is then meant to be caught in a separate node in the tree. It probably makes sense to only have one preferred way of doing things and this would be it. Then, a good recoveries tree should e.g. only turn if the goal is not occupied but fail straight away if it is.

Still, the caller of NavigateToPose should ideally have a way of getting this information without having to implement their own logic. Considering your explanations this is probably the more important use case for this kind of feedback.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 28, 2022

I think in that case what I mentioned before is a good option. We need to come up with a set of error codes for each major plugin class (probably just Planners and Controllers) that can be sent in the action API response. There are a few ways to communicate that, but exceptions seems like the path of least resistance. For each return code, we'd make a trivial exception deriving from runtime_error that can be caught and set the return code for failure modes. That can then be put in the BT blackboard for use.

Though, I'd argue that having some of those primitive BT nodes like checking if a pose is occupied would be generally useful regardless. That's here nor there I suppose.

The steps are pretty straight forward

  • Come up with a relatively complete list of the specific types of errors you want to be made aware of (starting or goal is occupied, could not find any valid path, could not find a path within timeout, etc)
  • Create exceptions in nav2_core/Exceptions.hpp for each of those (really just needs to derive from runtime_error and that's it
  • Update the plugins to use those specific exceptions in those specific situations. I think most already do checks for start/goal validity so its just a trivial change out to which exception is used
  • Update the planner/controller server to catch each of the exceptions separately to set the return code for the message
  • Update the planner/controller BT nodes to put that information into the blackboard for use
  • Ideally, having a BT XML proposal that uses that blackboard information as an example or perhaps even a new default if that information helps solve some general problem!

Its not surprisingly not a huge amount of work.

@SteveMacenski
Copy link
Member

Thoughts?

@karl-schulz
Copy link
Contributor Author

Sorry, we are have a bit too much on our plates at the moment :)

One detail question: If I understand you correctly, you would put the error codes/exceptions on the blackboard for planner and controller, probably clearing them when a planner/controller iteration succeeds or a new goal is given. This would mean also putting both of these exceptions in the action result, right?

I generally agree with your points. I also think that it would not be a huge amount of work. However, I think we currently don't have the capacity to implement this (and the required tests etc.) and will resort to the existing BT condition nodes for the moment.

@SteveMacenski
Copy link
Member

Yes, I think the action results should have some return code field that remaps to specific failure modes that can be used on the blackboard by other nodes to know why a failure was triggered. Each time we get a new result from an algorithm that return code would be updated.

@SteveMacenski SteveMacenski added the enhancement New feature or request label Jul 7, 2022
@jwallace42
Copy link
Contributor

I will take a crack at this tomorrow. It should be pretty straight forward.

@madgrizzle
Copy link

This solution will only work with rolling, correct? i.e., they can't be implemented in galactic and humble because it changes message definitions?

@SteveMacenski
Copy link
Member

Correct

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 23, 2022

Which servers here would we want to call this complete? Planner/Controller? Also Smoother, Waypoint, Behavior, BT Navigator?

The BT Navigator / Waypoint would be for returning total statuses beyond just "failure" to what specifically failed to an autonomy stack. So it could be that these are just pass-throughs from the task servers with a flag as to which caused it with a given code.

The smoother / behavior server are the 2 other real task server candidates. I'm not sure we can specifically enumerate the failure modes of a behavior server with such variants as spin to assisted teleop. Behavior Server is more an aggregating server to share resources more than they are really related to any particular task.

Smoothers are pretty specific. Timeout, general failure, collision, could be some options, though none are particularly insightful.

So I'm not sure it would be worthwhile to add them to smoother/behavior, but I'd like to hear some feedback from others.

@SteveMacenski
Copy link
Member

Planner codes merged!

@jwallace42 jwallace42 mentioned this issue Oct 2, 2022
7 tasks
@jwallace42 jwallace42 mentioned this issue Oct 12, 2022
7 tasks
@jwallace42
Copy link
Contributor

jwallace42 commented Oct 14, 2022

Which servers here would we want to call this complete? Planner/Controller? Also Smoother, Waypoint, Behavior, BT Navigator?

The BT Navigator / Waypoint would be for returning total statuses beyond just "failure" to what specifically failed to an autonomy stack. So it could be that these are just pass-throughs from the task servers with a flag as to which caused it with a given code.

The smoother / behavior server are the 2 other real task server candidates. I'm not sure we can specifically enumerate the failure modes of a behavior server with such variants as spin to assisted teleop. Behavior Server is more an aggregating server to share resources more than they are really related to any particular task.

Smoothers are pretty specific. Timeout, general failure, collision, could be some options, though none are particularly insightful.

So I'm not sure it would be worthwhile to add them to smoother/behavior, but I'd like to hear some feedback from others.

I think it might be worth while to come up with a mock list of errors for each server. My other question is do we care about errors we can't act on?

For the smoother server, I think adding error codes is dependent on the answer to my question above.

For the BT navigator server/Waypoint, coming up with a mock list might be useful. Do you have any ideas of the top of your head?

A think that is doesn't really make sense to add error codes for the behavior server because they don't fail under a common umbrella of errors. If we create more recoveries with some relation between them then this could be useful but even then it would become messy quickly.

@SteveMacenski
Copy link
Member

SteveMacenski commented Oct 14, 2022

I think that's a valuable exercise.

Spin: Invalid inputs or collision detected or TF

Backup / Drive Forward: Invalid inputs or collision detected or TF

AssistedTeleop: TF, perhaps add in progress checkers and such to see if anything is happening successfully? Otherwise, this is really just a pass through.

Smooth: invalid algorithm, Path cannot be smoothed, smoothed path resulted in collision or off of map, smoother time exceeded, TF, invalid inputs(path too short, costmap not updated, etc), invalid algorithm (doesn't exist from request). This is probably meaningful set of error codes to explore.

Wait: None

I think the WPF server would just be a pass through of the error codes from the planners / controller / smoother servers and maybe a few (or 1) WP specific ones like invalid request. I think BT Nav would be similar.

Also, it just came to mind that we don't have an error code yet in the planners / controllers for invalid algorithm request (e.g. FollowPath2 doesn't exist). That might be another good one to add to the core servers in a follow up PR

@jwallace42
Copy link
Contributor

jwallace42 commented Oct 20, 2022

It looks like writing error codes for the smoother could be the next step.

I think the WPF server would just be a pass through of the error codes from the planners / controller / smoother servers and maybe a few (or 1) WP specific ones like invalid request. I think BT Nav would be similar.

I agree. We just need to pipe the error codes up.

Also, it just came to mind that we don't have an error code yet in the planners / controllers for invalid algorithm request (e.g. FollowPath2 doesn't exist). That might be another good one to add to the core servers in a follow up PR

Yep!

@jwallace42 jwallace42 mentioned this issue Oct 20, 2022
7 tasks
@SteveMacenski
Copy link
Member

SteveMacenski commented Oct 31, 2022

So remaining:

  • Add invalid algorithm selection exception / error code
  • WPF / NavToPose / NavThroughPose ways to indicate error code and offending server
  • Do AssistedTeleop / spin / backup / forward? Questionable.
  • Add unit tests - all of this exception catching code is now largely not covered by current tests. Some dummy algorithms that throw an exception on calling the main method could be useful and then check the error codes of the server response that handled properly.
  • how to for future servers Add New Task Server Tutorial
  • smoother server (+ invalid algorithm + unit tests as above
  • Doc updates about this change / announcement
  • Update some BTs to use this work?

@SteveMacenski
Copy link
Member

SteveMacenski commented Nov 7, 2022

@jwallace42 want to do the smoother after the Pipe error one so that we can button up all the major task servers? That would be enough for us to announce the new feature while buttoning up the other small things (tests, invalid algorithm error code, docs - though you started that one)

@jwallace42
Copy link
Contributor

Sure. We have a rough list of the error codes already so it should be straight forward to get that in.

@jwallace42 jwallace42 mentioned this issue Nov 11, 2022
7 tasks
@SteveMacenski
Copy link
Member

Imminently completed in PR #3324 !

@annamannucci
Copy link
Contributor

annamannucci commented Sep 8, 2023

Hi @jwallace42, @SteveMacenski and all,

I'd like to rise a discussion over the exceptions of current planning plugins.
In the current implementation of the navfn_planner, StartOccupied and GoalOccupied are raised only when the start or the goal pixel are in a lethal obstacle. Though, it would make sense in my opinion to raise this error also when the start or the goal are actually in an INSCRIBED_INFLATED_OBSTACLE.

I already implemented a small change towards this direction here: boschresearch@f626a75

Though, I think it's valid to have this discussion in larger audience also because the proposed change is actually changing the semantics of the two exceptions. Thus, this choice should affect all the others nav2 planners.

What do you think about it?

Best regards,
Anna Mannucci

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 8, 2023

I think this is a good point. Can you open a PR and we can discuss there? The biggest items I'd ask is making sure you fix all instances of checking the start/goal in all the task servers that have that as an error code option. And probably break out that else statement region search into another function for readability.

CC @jwallace42 should probably review that PR as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in progress
Projects
None yet
7 participants