-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race condition in rmw_wait. #160
Conversation
To very briefly explain, rmw_wait: 1. Checks to see if any of the entities (subscriptions, clients, etc) have data ready to go. 2. If they have data ready to go, then we skip attaching the condition variable and waiting. 3. If they do not have data ready to go, then we attach the condition variable to all entities, take the condition variable lock, and call wait_for/wait on the condition variable. 4. Regardless of whether we did 3 or 4, we check every entity to see if there is data ready, and mark that as appropriate in the wait set. There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want. Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to *also* take the lock before we notify, this ensures that the entities cannot make changes that get lost. Signed-off-by: Chris Lalancette <clalancette@gmail.com>
@clalancette A friendly ping to handle this issue. |
Signed-off-by: Yadunund <yadunund@openrobotics.org>
{ | ||
static_cast<void>(events); | ||
|
||
if (guard_conditions) { | ||
for (size_t i = 0; i < guard_conditions->guard_condition_count; ++i) { | ||
GuardCondition * gc = static_cast<GuardCondition *>(guard_conditions->guard_conditions[i]); | ||
if (gc != nullptr) { | ||
if (gc->get_and_reset_trigger()) { | ||
if (gc->get_trigger()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for posterity. What I think was happening here is that checking for triggered conditions wasn't idempotent and was actually mutating the underlying conditions. Now we are only reading the condition (and not resetting the trigger) each time we are checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for posterity. What I think was happening here is that checking for triggered conditions wasn't idempotent and was actually mutating the underlying conditions. Now we are only reading the condition (and not resetting the trigger) each time we are checking.
Yep, you are exactly right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an FYI, I've come up with a completely different approach to implementing rmw_wait
on https://github.com/ros2/rmw_zenoh/tree/clalancette/rewrite-rmw-wait . It needs more work, but I also think that it has more promise than this PR to be much more performant. I haven't opened a PR yet since it is still WIP.
Superseded by #203 |
This PR cherry picks the fix to rmw_wait from #153
To very briefly explain, rmw_wait:
There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want.
Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to also take the lock before we notify, this ensures that the entities cannot make changes that get lost.