Fix a race condition in rmw_wait. #160

Yadunund · 2024-04-18T06:43:18Z

This PR cherry picks the fix to rmw_wait from #153

To very briefly explain, rmw_wait:

Checks to see if any of the entities (subscriptions, clients, etc) have data ready to go.
If they have data ready to go, then we skip attaching the condition variable and waiting.
If they do not have data ready to go, then we attach the condition variable to all entities, take the condition variable lock, and call wait_for/wait on the condition variable.
Regardless of whether we did 3 or 4, we check every entity to see if there is data ready, and mark that as appropriate in the wait set.

There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want.

Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to also take the lock before we notify, this ensures that the entities cannot make changes that get lost.

To very briefly explain, rmw_wait: 1. Checks to see if any of the entities (subscriptions, clients, etc) have data ready to go. 2. If they have data ready to go, then we skip attaching the condition variable and waiting. 3. If they do not have data ready to go, then we attach the condition variable to all entities, take the condition variable lock, and call wait_for/wait on the condition variable. 4. Regardless of whether we did 3 or 4, we check every entity to see if there is data ready, and mark that as appropriate in the wait set. There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want. Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to *also* take the lock before we notify, this ensures that the entities cannot make changes that get lost. Signed-off-by: Chris Lalancette <clalancette@gmail.com>

Signed-off-by: Yadunund <yadunund@gmail.com>

MichaelOrlov · 2024-04-25T20:31:17Z

@clalancette A friendly ping to handle this issue.
At the waffle meeting, we decided to assign it to you, but I don't have permission to do so.

Signed-off-by: Yadunund <yadunund@openrobotics.org>

rmw_zenoh_cpp/src/rmw_zenoh.cpp

mjcarroll · 2024-05-10T15:13:13Z

rmw_zenoh_cpp/src/rmw_zenoh.cpp

 {
  static_cast<void>(events);

  if (guard_conditions) {
    for (size_t i = 0; i < guard_conditions->guard_condition_count; ++i) {
      GuardCondition * gc = static_cast<GuardCondition *>(guard_conditions->guard_conditions[i]);
      if (gc != nullptr) {
-        if (gc->get_and_reset_trigger()) {
+        if (gc->get_trigger()) {


Just for posterity. What I think was happening here is that checking for triggered conditions wasn't idempotent and was actually mutating the underlying conditions. Now we are only reading the condition (and not resetting the trigger) each time we are checking.

Just for posterity. What I think was happening here is that checking for triggered conditions wasn't idempotent and was actually mutating the underlying conditions. Now we are only reading the condition (and not resetting the trigger) each time we are checking.

Yep, you are exactly right.

Just as an FYI, I've come up with a completely different approach to implementing rmw_wait on https://github.com/ros2/rmw_zenoh/tree/clalancette/rewrite-rmw-wait . It needs more work, but I also think that it has more promise than this PR to be much more performant. I haven't opened a PR yet since it is still WIP.

Yadunund · 2024-06-20T03:03:08Z

Superseded by #203

clalancette and others added 2 commits April 18, 2024 14:41

Lint

23cb178

Signed-off-by: Yadunund <yadunund@gmail.com>

clalancette self-assigned this Apr 26, 2024

Yadunund added 2 commits May 10, 2024 04:49

Merge branch 'rolling' into wait-fixes

f103562

Make has_triggered_condition read-only

d4d9874

Signed-off-by: Yadunund <yadunund@openrobotics.org>

clalancette reviewed May 9, 2024

View reviewed changes

rmw_zenoh_cpp/src/rmw_zenoh.cpp Show resolved Hide resolved

Yadunund force-pushed the wait-fixes branch from 3c91fb3 to d4d9874 Compare May 9, 2024 21:35

mjcarroll reviewed May 10, 2024

View reviewed changes

CihatAltiparmak mentioned this pull request Jun 17, 2024

GSOC 2024 : Zenoh Support & Benchmarking moveit/moveit2#2844

Open

4 tasks

Yadunund closed this Jun 20, 2024

Yadunund deleted the wait-fixes branch June 20, 2024 03:03

CihatAltiparmak mentioned this pull request Jun 20, 2024

malloc(): unaligned tcache chunk detected error #186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a race condition in rmw_wait. #160

Fix a race condition in rmw_wait. #160

Yadunund commented Apr 18, 2024

MichaelOrlov commented Apr 25, 2024

mjcarroll May 10, 2024

clalancette May 10, 2024

clalancette May 10, 2024

Yadunund commented Jun 20, 2024

Fix a race condition in rmw_wait. #160

Fix a race condition in rmw_wait. #160

Conversation

Yadunund commented Apr 18, 2024

MichaelOrlov commented Apr 25, 2024

mjcarroll May 10, 2024

Choose a reason for hiding this comment

clalancette May 10, 2024

Choose a reason for hiding this comment

clalancette May 10, 2024

Choose a reason for hiding this comment

Yadunund commented Jun 20, 2024