-
Notifications
You must be signed in to change notification settings - Fork 0
Flaky gtest_multithreaded__rmw_{opensplice,fastrtps}_cpp #180
Comments
I did some debugging into the problem today. First of all, I'm able to make it happen much more readily on machines with larger numbers of cores (12 cores happened a lot more regularly than 4). Also, having additional load on the machine made the problem happen more readily. This points to a race condition somewhere. I did some debugging, and while I haven't found the root cause, I'll write some notes down about what I have found. In the failing cases, the problem was that several threads in the MultiThreadedExecutor got hung up trying to acquire the wait_mutex_. This happens because one of the threads previously acquired the lock, but then got stuck down in get_next_executable -> wait_for_work, which ends up down in the rmw implementation and hence waiting for work in the DDS implementation. This kind of suggests to me that the problem is that we have already missed the event we are going to wait for, but somehow we have failed to look for it before waiting for more work to do. This needs to be investigated more as it is a real problem (although somewhat rare). |
Potential fix is in ros2/rclcpp#703 |
Did this patch change anything related to this issue? |
No, it seems to happen as frequently as before. |
The last time that this happened was on August 28, in https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/448/ . While I don't know what fixed it, it hasn't shown up in a month. I'm going to close this out for now and we can reopen if the problem reappears. |
The gtest_multithreaded tests seem to be flaky. None of the direct links seem to point to actual test data, here's a link to last night's full console output: https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/318/consoleFull (search for "Start 12: gtest_multithreaded__rmw_fastrtps_cpp" for Fast-RTPS and "Start 62: gtest_multithreaded__rmw_opensplice_cpp" for Opensplice). The tests seem to fail every couple of nights: 318 and 316 failed, while 317 succeeded.
The text was updated successfully, but these errors were encountered: