Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move_base_interruptable_server seems to be causing move_base to crash #87

Merged
merged 5 commits into from
Aug 8, 2017

Conversation

warnellg
Copy link
Contributor

@warnellg warnellg commented Jul 19, 2017

Seems to happen when move_base_interruptable_server tries to call clear_costmap_service, but only sometimes.

I'm in the processes of investigating why this happens.

@warnellg warnellg self-assigned this Jul 19, 2017
@warnellg
Copy link
Contributor Author

warnellg commented Aug 1, 2017

I did a little more investigating this morning, including capturing a core dump (thanks for the tip, @jack-oquin!).

First, I'm not actually sure that this is a problem with move_base_interruptable_server. I'll leave the title of this PR as-is for now, but it may need to be updated going forward. The reason I say this is because I rebuilt after commenting out a large portion of move_base_interruptable_server.cpp and the problem didn't go away.

I'm now thinking that this could possibly be a bug within the ROS navigation stack itself. I ran our move_base node in GDB in a separate terminal, and, after a long while (I'm still unsure as to how to reliably reproduce this bug, which is worrisome), found Leela motionless at a door with the following output in the move_base terminal (entire core dump is available on Leela at /home/users/warnellg/Desktop/core.3472):
movebasesegfault2

Obviously, the backtrace here wouldn't be able to implicate any of our BWI code even if it was at fault, but the fact that the error occurs so many levels deep is why I wonder if this is some kind of bug in the ROS navigation stack. Specifically, I wonder about the global_planner code, which is implicated in the lowest-level frames.

I'm not exactly sure what the path forward is at this point, so I'm open to suggestions. Perhaps next I'll try switching our move_base's base_global_planner from global_planner/GlobalPlanner back to the default navfn/NavfnROS to see if that makes a difference.

@jack-oquin
Copy link
Member

At first glance, there seems to be enough information to open an issue in https://github.com/ros-planning/navigation/issues

It's probably worth checking whether ros-planning/navigation#584 is related.

@warnellg
Copy link
Contributor Author

warnellg commented Aug 2, 2017

@jack-oquin, that does seem like it could be very related! Though we don't seem to get that exact warning message. I suppose to test this, we'd have to compile our own navigation stack from source and use that.

Another theory I've had is that this somehow related to the way we (BWI) deal with the map. Specifically, when monitoring Leela during visit_door_list, I've noticed that the map seems to refresh after every completed goal. Further, these crashes are only occurring at the goal locations themselves (doors). Is there perhaps some kind of weird race condition cropping up here where the planner is trying to use a costmap that has temporarily been deleted because it's being replaced by some other BWI process?

@piyushk, are you able to comment on this?

@warnellg
Copy link
Contributor Author

warnellg commented Aug 3, 2017

OK, I tried out switching the global planner to navfn/NavfnROS today, and Leela ran without issue until the battery gave out.

Assuming this passes further testing, this would seem to indicate that our issue might be specific to the global_planner/GlobalPlanner code, or at least the particular way we interact with it.

I want to test this out a few more times in order to make sure this has really resolved our issue.

I also want the extra test runs in order to make sure Leela actually navigates smoothly in the hallways: I already found, for instance, that I needed to remove the obstacle_layer/footprint_clearing_enabled=false line in costmap_common_params.yaml in order for it to not get "stuck" sometimes.

@warnellg
Copy link
Contributor Author

warnellg commented Aug 8, 2017

Tested this for a few more hours today and all seems well: Leela seemed to run smoothly and without crashing.

So I'm going to go ahead and merge this into master.

I'm also going to open another pull request, perhaps destined to last a very long time, to figure out how we can switch back to global_planner/GlobalPlanner. There, we could, for example, try building the latest navigation stack from source and see if recent changes have fixed this or if we can track down the bug ourselves.

@warnellg warnellg merged commit a82dae3 into master Aug 8, 2017
@warnellg warnellg deleted the fix-move-base-crash branch August 8, 2017 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants