-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in bt_navigator when passing a goal #1439
Comments
Hi @maxlein, I tried reproducing this issue. I gave a goal and repeatedly kept giving a new goal while it was already executing one. However, I do not see the problem (segmentation fault, or any other error) that you have mentioned. |
I am hitting this issue. I can reproduce it by setting a goal in rviz, and then cancelling the goal in rviz. It usually takes 4 repetitions for this crash to occur. The line that causes the segfault is here in the behavior tree lib. i am running the head of the I am surprised by this: |
@Jconn we run 3.1.1: https://github.com/ros/rosdistro/blob/master/eloquent/distribution.yaml#L208 from ros2-devel, at the moment. Any suggestion of how to work around this or fix it? I think the big reason it hasn't been addressed yet is a lack of insight into how to resolve it. It may be worth also filing a ticket in BT.CPP and cross linking this one. |
I'm planning on digging into this more but I'll be afk until monday. Adding
what I know in the meantime incase someone else is digging into this too
…On Thu, Feb 6, 2020, 10:38 PM Steven Macenski ***@***.***> wrote:
@Jconn <https://github.com/Jconn> we run 3.1.1:
https://github.com/ros/rosdistro/blob/master/eloquent/distribution.yaml#L208
from ros2-devel, at the moment.
Any suggestion of how to work around this or fix it? I think the big
reason it hasn't been addressed yet is a lack of insight into how to
resolve it.
It may be worth also filing a ticket in BT.CPP and cross linking this one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1439>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXLVP5EEPVO2PVHFR7BS3LRBTQTJANCNFSM4J3L2GRA>
.
|
I appreciate it. Crashes are something I take very seriously and want to work to resolve them swiftly. Its not clear to me from this if its our fault in not being thread safe or if its the BT.CPP project's issue. Once you dig more, circle back and let me know. Edit: wait 4 threads? 4 navigation threads or things from the cpp library? My impression was the BT.CPP was supposed to be single threaded, that's the whole point of BTs 😕 |
I don't know if this is directly related or not, but seems to be on topic to make you aware of: #1285 |
~4 navigation threads ticking things like the |
Each time we cancel a NavigateToPose goal, we call the behavior tree reset function , which sets the state of every node in the behavior tree to idle. The nav2 Even though we set the Because of the The next time we set a NavigateToPose goal, a different thread is spawned to handle the request. That thread starts running the behavior tree object that had its NavigateToPose request cancelled. This is how the I tested with a handful of rviz NavigateToPose requests, then rviz cancels, and I’m no longer able to reproduce when I clean up state at the end of a NavigateToPose request. I’m doing this by reinitializing the whole tree:
I don't really want to reset the whole dang tree, but I couldn't find any kind of |
This may explain the issue you linked too, #1285 In that scenario, the NavigateToPose request occurs, the action server spawns a thread to handle the request, the request is cancelled and the thread dies. Then, a separate thread resets the stack. When it resets the stack, the I don't understand how the action server makes threads for handling requests, the "new request, new thread" that I'm claiming is only verified through print statements |
Thank you for your debugging efforts, that is a very good summary that closes the loop on this. I'm curious from what you've seen - how long does it take to create a new BT? Is that creation something problematic (>100ms?) I'd merge a PR with that fix if the creation time is reasonable. |
I haven't characterized that beyond just some terminal staring. I was happy to just make some progress lol I'll submit something later today or tomorrow once I do some timing and more testing |
Sounds good, I moved these tickets over to "In progress" Thanks for taking lead on this. I look forward to no more crashes :-) |
well the tree recreation seems to fix the issue for me, but when I use the standard "navigate_w_replanning_and_recovery" behavior tree the recreation takes anywhere between 200 and 300 milliseconds. I was checking open PRs and I think @mjeronimo probably knew all this Coro node stuff when he made #1322, and there was some hesitation on there about making this kind of change. @bpwilcox I see you were tagged in that PR to look into alternative solutions. Did you find anything? |
200-300ms is a little high for me to be comfortable with that long term, but if it gets rid of crashes, I think its not a terrible idea to start with. Any other suggestions that would avoid that overhead? I think in that thread I suggest a BT pointer-swap thread where we, on bringup, initialize 2 trees, execute 1. Then the next time around we use the cached tree and then spin up a thread to kill and reinitialize the first tree (and repeat). This way we don't block execution for spinning up the tree. Also, I wouldn't expect a response from them anytime soon, Intel's moved folks around. |
I couldn't think of any other ways unfortunately..I think the ping-pong strategy you mention is good. I cherry-picked #1322 onto master and I'm testing that. I will file an issue in the behavior tree repo to get some input on if there's a lighter fix. I would prefer to get the slow creation merged in and then iterate on a more efficient solution |
Makes sense, link that ticket here and keep me posted :-) |
I am joining this thread much later than I should... First of all, I see a big problem with As pointed out by @Jconn , this does not reset correctly the running nodes. I think the correct implementation is: void resetTree(BT::TreeNode * root_node)
{
root_node->halt();
root_node->setStatus(BT::NodeStatus::IDLE);
} The only correct way to do it is to let the halt() callback propagate through the tree, from the root. About the coroutine, I have two comments:
At this point, I understand that the only thing I can do is to join the development effort, but I don't know when I will be able to commit. It would be VERY helpful if we can create an example, separated from ROS and Navigation2, to reproduce the issue. |
Bug report
I was playing around with navigation stack with default, out of the box planners, behavior trees, etc. and I got a seg fault when passing a goal while shuttle was already executing one.
Required Info:
Steps to reproduce issue
Don't know it exactly anymore, but it was something like this:
Expected behavior
No seg fault, some error output
Actual behavior
Segmentation fault
Additional information
The text was updated successfully, but these errors were encountered: