Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Flaky gtest_multithreaded__rmw_{opensplice,fastrtps}_cpp #180

Closed
clalancette opened this issue Apr 19, 2019 · 5 comments
Closed

Flaky gtest_multithreaded__rmw_{opensplice,fastrtps}_cpp #180

clalancette opened this issue Apr 19, 2019 · 5 comments

Comments

@clalancette
Copy link

The gtest_multithreaded tests seem to be flaky. None of the direct links seem to point to actual test data, here's a link to last night's full console output: https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/318/consoleFull (search for "Start 12: gtest_multithreaded__rmw_fastrtps_cpp" for Fast-RTPS and "Start 62: gtest_multithreaded__rmw_opensplice_cpp" for Opensplice). The tests seem to fail every couple of nights: 318 and 316 failed, while 317 succeeded.

@clalancette
Copy link
Author

I did some debugging into the problem today. First of all, I'm able to make it happen much more readily on machines with larger numbers of cores (12 cores happened a lot more regularly than 4). Also, having additional load on the machine made the problem happen more readily. This points to a race condition somewhere.

I did some debugging, and while I haven't found the root cause, I'll write some notes down about what I have found.

In the failing cases, the problem was that several threads in the MultiThreadedExecutor got hung up trying to acquire the wait_mutex_. This happens because one of the threads previously acquired the lock, but then got stuck down in get_next_executable -> wait_for_work, which ends up down in the rmw implementation and hence waiting for work in the DDS implementation.

This kind of suggests to me that the problem is that we have already missed the event we are going to wait for, but somehow we have failed to look for it before waiting for more work to do. This needs to be investigated more as it is a real problem (although somewhat rare).

@clalancette
Copy link
Author

Potential fix is in ros2/rclcpp#703

@dirk-thomas
Copy link
Member

Potential fix is in ros2/rclcpp#703

Did this patch change anything related to this issue?

@clalancette
Copy link
Author

No, it seems to happen as frequently as before.

@clalancette
Copy link
Author

The last time that this happened was on August 28, in https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/448/ . While I don't know what fixed it, it hasn't shown up in a month. I'm going to close this out for now and we can reopen if the problem reappears.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants