Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix (#3109) deadlock when costmap receives new map #3145

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion nav2_costmap_2d/include/nav2_costmap_2d/static_layer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,8 @@ class StaticLayer : public CostmapLayer
unsigned char unknown_cost_value_;
bool trinary_costmap_;
bool map_received_{false};
bool map_received_in_update_bounds_{false};
tf2::Duration transform_tolerance_;
std::atomic<bool> update_in_progress_;
nav_msgs::msg::OccupancyGrid::SharedPtr map_buffer_;
// Dynamic parameters handler
rclcpp::node_interfaces::OnSetParametersCallbackHandle::SharedPtr dyn_params_handler_;
Expand Down
23 changes: 8 additions & 15 deletions nav2_costmap_2d/plugins/static_layer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ StaticLayer::getParameters()
// Enforce bounds
lethal_threshold_ = std::max(std::min(temp_lethal_threshold, 100), 0);
map_received_ = false;
update_in_progress_.store(false);
map_received_in_update_bounds_ = false;

transform_tolerance_ = tf2::durationFromSec(temp_tf_tol);

Expand Down Expand Up @@ -277,17 +277,13 @@ StaticLayer::interpretValue(unsigned char value)
void
StaticLayer::incomingMap(const nav_msgs::msg::OccupancyGrid::SharedPtr new_map)
{
std::lock_guard<Costmap2D::mutex_t> guard(*getMutex());
if (!map_received_) {
map_received_ = true;
processMap(*new_map);
}
if (update_in_progress_.load()) {
map_buffer_ = new_map;
} else {
processMap(*new_map);
Copy link
Contributor

@gimait gimait Aug 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably never be a problem, but please note that part of the processMap method is not thread-safe if we don't lock the Costmap2D mutex here as well. There is also a second call to this mutex within processMap, which might be able to produce the same deadlock that we are fixing.
The best way to go would probably find a way to move processMap to the main thread completely.
Again, this will probably never be a problem, and I might be wrong about this, but just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I removed it once, but I got some errors in the test.

[133.395s] The following tests FAILED:
[133.395s] 	 15 - test_collision_checker (Timeout)
[133.395s] 	 17 - inflation_tests (Timeout)

These tests use LayeredCostmap and StaticLayer, but do not call updateMap and wait for the costmap to be ready. So they got a timeout.

So I needed to keep this call here and keep it mutex free because the mutex can cause a deadlock at the very beginning.

Do you want me to remove the call and fix the tests too?

Copy link
Contributor

@gimait gimait Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can simply remove it. If we want to remove it from here, we will need to do some other changes. There might be several places where the main loop might attempt to make use of the first map when was received, but not yet processed. We need to ensure that all places where the processed map is used are aware of the possibility of the map not being yet initialised.

Then, in my opinion, it is good that that test fails, it should, because you are changing the behaviour of this plugin. However, the tests failing are not a reason to not do the full change to processing in the main loop. If we want to move that last call to processMap, then we should fix these tests and/or write new ones that fit the current implementation.

So I needed to keep this call here and keep it mutex free because the mutex can cause a deadlock at the very beginning.

I understand that you are focused on fixing the deadlock, but that doesn't mean you can simply leave unprotected code behind.
There are three points I want to do here, which I'm not sure have been considered yet:

First, in line 225 of the static layer we are locking the same mutex. This means you might not have fixed the lock entirely, only you don't see it anymore because it only happens with subsequent calls to incomingMap.

Second, all this code is retrieving data and performing operations on the layered costmap without protection, which to me, sounds like a very bad idea.

Lastly, setting map_received_ without protection might bring undesired behaviour when checking it from the main loop. For example, in the main thread, you could have just called updateBounds with map_received_ set to false, followed by a updateCosts with map_received_ set to false.

As a rule of thumb, never leave a variable shared between threads unprotected. I am sorry, I know I am being a bit bipolar about this. I sometimes feel like it's okay to break that rule, but things here look a bit too complex to leave the code unprotected. I think I wasn't thorough enough about the threading issues before my previous review, I apologize.


To sum up, you should move back the lock to the beginning of this method. I don't think you will see the deadlock if you do so. Then it's up to you to remove the last processMap call here and fix the tests.

Any thoughts @SteveMacenski? I think you know better if it's best to remove that last call to processMap or keep it.

Copy link
Member

@SteveMacenski SteveMacenski Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the issue. In steady state, processMap is only called in the updateBounds function, so we're really only talking about the very first map when the current costmap is not set to any particular valid size/content.

In the situation we're talking about with the first map, the map_received_ boolean at the start of both Update functions will make them immediately return without trying to do anything (if false) - which is before any map is processed.

So, if no map is processed before updateBounds its OK. If its processed before that point in the map callback, also OK.

If no map is processed before updateCosts and updateBounds, OK. If its processed after updateBounds but before updateCosts, that's where it could potentially be not OK. I think my comment just below lays out a solution to that (maintain the state of the "updatedness" from updateBounds to also use in updateCosts).

But in terms of leaving things unprotected, this seems like it does that fine? Because of the special case that map_received_ immediately returns the functions without doing anything.

map_buffer_ = nullptr;
map_received_ = true;
return;
}
std::lock_guard<Costmap2D::mutex_t> guard(*getMutex());
map_buffer_ = new_map;
}

void
Expand Down Expand Up @@ -338,11 +334,12 @@ StaticLayer::updateBounds(
double * max_y)
{
if (!map_received_) {
map_received_in_update_bounds_ = false;
return;
}
map_received_in_update_bounds_ = true;

std::lock_guard<Costmap2D::mutex_t> guard(*getMutex());
update_in_progress_.store(true);

// If there is a new available map, load it.
if (map_buffer_) {
Expand Down Expand Up @@ -378,17 +375,15 @@ StaticLayer::updateCosts(
{
std::lock_guard<Costmap2D::mutex_t> guard(*getMutex());
if (!enabled_) {
update_in_progress_.store(false);
return;
}
if (!map_received_) {
if (!map_received_in_update_bounds_) {
static int count = 0;
// throttle warning down to only 1/10 message rate
if (++count == 10) {
RCLCPP_WARN(logger_, "Can't update static costmap layer, no map received");
count = 0;
}
update_in_progress_.store(false);
return;
}

Expand All @@ -411,7 +406,6 @@ StaticLayer::updateCosts(
transform_tolerance_);
} catch (tf2::TransformException & ex) {
RCLCPP_ERROR(logger_, "StaticLayer: %s", ex.what());
update_in_progress_.store(false);
return;
}
// Copy map data given proper transformations
Expand All @@ -436,7 +430,6 @@ StaticLayer::updateCosts(
}
}
}
update_in_progress_.store(false);
current_ = true;
}

Expand Down