-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky Planner Tests (part 3) #1733
Comments
First news: this is not a flaky problem and does not related to closed #1585 ticket. The problem is a test false-positive case uncovered by ee719c9 commit. In this commit the planner tolerance was decreased from 2.0 -> to 0.5. Using
After decreasing the tolerance to 0.5 the planner became to produce the another path:
Which causes messages like:
and failures of testcases. I think, produced paths in both cases are correct, since the goal places within its tolerance <=0.5 but according to
It looks like the conditions in
|
Fixed problem in #1734 |
So my thought on this is that this test is supposed to stress the planner so I think the exact, zero tolerance planning is what we want here. A 0.5m deviation for a 0.05m resolution costmap search is really, really high. Typically that tolerance is set so that if you send the robot to a goal, lets say in a warehouse, and there's a dynamic obstacle in the way like a forklift. You want the robot to get to that area for their job, but clearly the intention isn't to wait 20 minutes to succeed based on the forklift. So for these tests, and in particular the costmap test, this should pass 100% of the time without any help with a tolerance. Since the costmap tests are hand designed ( I believe) to be possible and just make sure that search is working properly. The random tests could result in failures, but we have the planner test set at some really low bar to pass (something like 10% with 3 attempts) that it can reach 100 random goals. Does that make sense? Am I missing something? I think that this test actually makes things a little less reliable. I think the only tolerance that would be appropriate is the costmap resolution (e.g. 1 cell) for the case that the planner after reaching its goal doesnt add an extra waypoint at the literal goal location so its last waypoint is just neighboring 1 cell over. However given that things were working at one point with 0 toleranceI believe, that should be possible. |
No, that sounds very reasonable. If the test is intended to reach exact goal, does it mean that path planner fails or has some lack of accuracy? |
I'm not sure, you may want to generate a visualization. I can say that from using it in my own purposes, I have never had an issue with the planner generating an exact path when one is available. So this feels like a test issue. I'd start with the costmap test since those are exactly hardcoded and small so those should always be exact matches, except for any costmaps that are specifically designed to make it fail. |
After some digging in code, it appears that problem may be in NavFN planner inside |
Update: the The point |
Does it make sense to have tolerance less than the resolution of the map? When I do the following change the test runs fine. You can suggest a better place to do the change but maybe we can set the tolerance == resolution of the selected map.
|
Also can you take a look at this change? initial_pose is needed for the tests to run so there is no point in having a separate retry/timeout logic for it. I don't have access to CI system but I am not getting any timeout issues after this. Let me know what you think. |
I'd like agree with you that the |
The Having the tolerance less than resolution means that you want an exact match only, which might be required when running in a true grid-world like one might have to create when working with hundreds of robots in a single facility. None of this really explains how this was working beforehand and became flaky afterwards. I'm OK with adding exactly a 1 cell resolution tolerance on it for aliasing effects, but any more than that indicates an issue with the test still |
I can not find a clue that NavFn currently uses
From outside: Probably, I've missed something or did you mean that before NavFN planner another planner plugin is being called?
Yes, there is the place where
Ok, got it. This should not change the situation fundamentally, I think.
I've made some explanation in PR for this ticket in #1734 (comment).
I think, this will just hidden the problem but not resolve it. |
Lets look at this block of code. What I see happening is:
This means that there could only be exact matches if the tolerance is in increments of resolution, but in all cases, you should hit every cell along the way, whether that's center, corner, etc. So while you might not get the exact world coordinate, you should get within the same Also means that if you make tolerance hilariously large, it will go over all of the cells in a hilariously large window, which is unexpected. Good to know. But also means that the reason that 2.0 meters worked is not because it was large, but because it was a multiple of the costmap resolution used in these experiments. One way to solve that is to just use another tolerance so that's true here as well. This test is to make sure that things are working and we can get exact paths that there's nothing wacky going on, so I think that's a OK workaround. Another option is to test the test based on grid-world and not world-world, less ideal. A third option would be to fix the planner so that exact paths are respected which is what i think you did. OK, if that's what you did and that description is in line with your thinking, that sounds like a good work around, I can't think of a reason not to do that. I think the only thing would be to look at |
Both
So, from perusal it looks like the changing |
waypoint_follower test seems to be failing now for me. Have you seen this issue? No error message just starts shutting down after activating.
|
@siddhya I have noticed that on just release builds but not debug builds in the last couple of PRs. That is super weird. There's a bunch of logging in the tester node that your snippet doesn't show either (https://github.com/ros-planning/navigation2/blob/master/nav2_system_tests/src/waypoint_follower/tester.py) as if that's not even being run or something. |
This is interesting behavior. Unfortunately, I did not get the same problem, only another one I've mentioned in PR1759 causing a waypoint_follower timeouts. |
Is there a circleCI job that did it, can you give us a link? |
I do not see the issue @AlexeyMerzlyakov is seeing. For me it again was with the setup and retry logic. This time in waypoint_follower/tester.py. Gazebo takes forever to startup on my machine. Once I upped both the below values all tests passed.
|
CI would also agree, I think this can be considered closed unless it comes back again. Thanks for your help! |
@AlexeyMerzlyakov @siddhya looks like this continues to be a problem.
I'd love it if you guys worked together on fixing this one since it seems to escape us individually.
The text was updated successfully, but these errors were encountered: