Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in rmw_wait. #160

Closed
wants to merge 4 commits into from
Closed

Fix a race condition in rmw_wait. #160

wants to merge 4 commits into from

Conversation

Yadunund
Copy link
Member

This PR cherry picks the fix to rmw_wait from #153

To very briefly explain, rmw_wait:

  1. Checks to see if any of the entities (subscriptions, clients, etc) have data ready to go.
  2. If they have data ready to go, then we skip attaching the condition variable and waiting.
  3. If they do not have data ready to go, then we attach the condition variable to all entities, take the condition variable lock, and call wait_for/wait on the condition variable.
  4. Regardless of whether we did 3 or 4, we check every entity to see if there is data ready, and mark that as appropriate in the wait set.

There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want.

Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to also take the lock before we notify, this ensures that the entities cannot make changes that get lost.

clalancette and others added 2 commits April 18, 2024 14:41
To very briefly explain, rmw_wait:
1.  Checks to see if any of the entities (subscriptions, clients, etc)
have data ready to go.
2.  If they have data ready to go, then we skip attaching the
condition variable and waiting.
3.  If they do not have data ready to go, then we attach the
condition variable to all entities, take the condition variable
lock, and call wait_for/wait on the condition variable.
4.  Regardless of whether we did 3 or 4, we check every entity
to see if there is data ready, and mark that as appropriate
in the wait set.

There is a race in all of this, however.  If data comes in after
we've checked the entity (1), but before we've attached the
condition variable (3), then we will never be woken up.  In most
cases, this means that we'll wait the full timeout for the wait_for,
which is not what we want.

Fix this by adding another step to 3.  In particular, after we've
locked the condition variable mutex, check the entities again.
Since we change the entities to *also* take the lock before we
notify, this ensures that the entities cannot make changes that
get lost.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Yadunund <yadunund@gmail.com>
@MichaelOrlov
Copy link

@clalancette A friendly ping to handle this issue.
At the waffle meeting, we decided to assign it to you, but I don't have permission to do so.

@clalancette clalancette self-assigned this Apr 26, 2024
{
static_cast<void>(events);

if (guard_conditions) {
for (size_t i = 0; i < guard_conditions->guard_condition_count; ++i) {
GuardCondition * gc = static_cast<GuardCondition *>(guard_conditions->guard_conditions[i]);
if (gc != nullptr) {
if (gc->get_and_reset_trigger()) {
if (gc->get_trigger()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for posterity. What I think was happening here is that checking for triggered conditions wasn't idempotent and was actually mutating the underlying conditions. Now we are only reading the condition (and not resetting the trigger) each time we are checking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for posterity. What I think was happening here is that checking for triggered conditions wasn't idempotent and was actually mutating the underlying conditions. Now we are only reading the condition (and not resetting the trigger) each time we are checking.

Yep, you are exactly right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an FYI, I've come up with a completely different approach to implementing rmw_wait on https://github.com/ros2/rmw_zenoh/tree/clalancette/rewrite-rmw-wait . It needs more work, but I also think that it has more promise than this PR to be much more performant. I haven't opened a PR yet since it is still WIP.

@Yadunund
Copy link
Member Author

Superseded by #203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants