Data race fixes #2494

jmachowinski · 2024-04-08T20:00:22Z

@mjcarroll @wjwwood

Make the executors more stable with our stack.

mjcarroll · 2024-04-08T20:18:11Z

Linux
Linux-aarch64
Windows

mjcarroll · 2024-04-09T13:30:20Z

All test failures here are related to ros2/rosidl_typesupport_fastrtps#117

wjwwood · 2024-04-09T13:36:32Z

Can you explain the changes?

jmachowinski · 2024-04-09T15:20:15Z

@wjwwood did you look at the commit messages of the single commits ?

jmachowinski · 2024-04-09T15:29:07Z

7d0b532

Any rebuild request coming from a second thread may get lost, if issued, while the execution thread is in Executor::collect_entities(). Changes the clearing of the request to an atomic operation to fix this.

a968a3c

The waitable did not rearm its guard condition if not processed yet. I think In the long run we should have a better API for this, as it is a common pitfall.

5d99e3b

After adding all of the timer_index++ in the outer code the code was basically same as the other next_XYZ, therefore
I just replaced it, to a a more uniform code.

39e0d94

This fixes a possible data race, as the on_trigger_callback might rely on the facts, that the executor wakes up,
after it was executed.

mjcarroll · 2024-04-09T21:48:37Z

Linux
Linux-aarch64
Windows

rclcpp/include/rclcpp/executors/executor_notify_waitable.hpp

jmachowinski · 2024-04-10T14:05:54Z

The unstable tests on windows seems to be non related, see ros2/rmw_connextdds#136

wjwwood · 2024-04-10T14:09:54Z

rclcpp/src/rclcpp/executor.cpp

- size_t current_timer_index = 0;
- while (true) {
- auto [timer, timer_index] = wait_result_->peek_next_ready_timer(current_timer_index);
- if (nullptr == timer) {
- break;
- }
- current_timer_index = timer_index;
+ while (auto timer = wait_result_->next_ready_timer()) {
 auto entity_iter = current_collection_.timers.find(timer->get_timer_handle().get());
 if (entity_iter != current_collection_.timers.end()) {
 auto callback_group = entity_iter->second.callback_group.lock();
- if (callback_group && !callback_group->can_be_taken_from()) {
+ if (!callback_group || !callback_group->can_be_taken_from()) {
 continue;
 }
- // At this point the timer is either ready for execution or was perhaps
- // it was canceled, based on the result of call(), but either way it
- // should not be checked again from peek_next_ready_timer(), so clear
- // it from the wait result.
- wait_result_->clear_timer_with_index(current_timer_index);


We made the change to peek and conditionally clear after a lot of iterations, and I don't fully understand why this needs to be changed back to not peek before clearing. Can you give a better explanation as to why this isn't needed any more? Is it related to the logic change around checking the callback groups?

Specifically, if this function is called multiple times before the wait_result_ is discarded due to a new call to wait on the wait set, we need to ensure timers are not passed over due to callback group constraints on one call only to be ignore on subsequent calls. If we could ensure that wait is called again (and the wait_result_ is discarded) every time a callback group became available to be taken from, then this wouldn't be needed, but I think for performance the idea is that wait results are fully processed before waiting again, in which case iterating over a timer, not calling it because of its callback group, but marking it as nullptr in the wait result anyways, would result in it getting skipped when it shouldn't.

I did not readd this change for now, so we can go ahead with this PR.

Anyway, currently the behavior of the WaitResult is inconsistent.
The timers are the only objects, that feature the 'do not clear' logic if the callback group
is in used. I actually see not a real reason for this, perhaps to make some test happy ?

All other objects in the WaitResult clear on iteration. Therefore I would suggest, that we
either also clear the timers, or move the callback_group->can_be_taken_from() check inside
of the WaitResult and do not clear if blocked, to get a consistent behavior.

The difference is that all the other waitable ros entities, excluding guard conditions, stay ready on subsequent waits. And for guard conditions if they represent something that needs to stay ready between waits, then that thing needs to re-trigger them, i.e. the pr I made about waitables and keeping guard conditions ready sometimes. That's why timers have a different API here.

This style of iteration is required to avoid having the wait result know anything about the callback group. We considered having an API that took a "can be executed callback" to do dependency injection, but we decided that was unnecessarily complicated and this API was simple by comparison and appeared to be more generic (remember the WaitSet and WaitResult objects were designed to be used directly as well as in the executor).

So while it was indeed in pursuit of making a test pass that was flaky, I do think where we landed makes some sense, even if it's not immediately obvious why timers are special. I do think it's possible that we want this style of "peek and clear" APIs for the other entities, but I'd need to convince myself of that. This is all related to the decision whether or not to wait again after each thing is executed or to instead burn down a wait result before waiting again.

That's why timers have a different API here.

I'm confused about this statement. Timers stay ready as well on subsequent waits, as long as you don't 'call()' them. What would not be done, if the cbg is busy.

rclcpp/src/rclcpp/executor.cpp

rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp

rclcpp/src/rclcpp/guard_condition.cpp

rclcpp/src/rclcpp/executor.cpp

Signed-off-by: Janosch Machowinski <j.machowinski@cellumation.com>

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

Before this change, the rebuild of wait set would be triggered after the wait set was waken up. With bad timing, this could lead to the rebuild not happening with multi threaded executor. Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

Signed-off-by: Janosch Machowinski <j.machowinski@cellumation.com>

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

Signed-off-by: Michael Carroll <mjcarroll@intrinsic.ai> Co-authored-by: Janosch Machowinski <j.machowinski@nospam.org>

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

wjwwood

This lgtm, but I had some style fixes, which I could not make because of push permissions, but I opened #2500 instead, so either integrate those changes here or we can close this in favor of that one.

jmachowinski · 2024-04-12T12:36:03Z

closed in favor for #2500 2500

wjwwood · 2024-04-12T12:36:37Z

On my machine, the test TestAddCallbackGroupsToExecutor/StaticSingleThreadedExecutor.add_callback_groups_after_add_node_to_executor (for all rmw implementations) is timing out.

The following tests FAILED:
	 28 - test_add_callback_groups_to_executor__rmw_cyclonedds_cpp (Timeout)
	 29 - test_add_callback_groups_to_executor__rmw_fastrtps_cpp (Timeout)
	 30 - test_add_callback_groups_to_executor__rmw_fastrtps_dynamic_cpp (Timeout)

I'll run CI just to confirm:

Linux
Linux-aarch64
Windows

I wasn't able to see any obvious issue, the CI is with my changes in #2500, but locally even this pr's changes have that problem.

jmachowinski requested review from ivanpauno, hidmic and wjwwood as code owners April 8, 2024 20:00

mjcarroll approved these changes Apr 9, 2024

View reviewed changes

wjwwood reviewed Apr 10, 2024

View reviewed changes

rclcpp/include/rclcpp/executors/executor_notify_waitable.hpp Outdated Show resolved Hide resolved

jmachowinski force-pushed the data_race_fixes branch from ac9ca01 to e74e2bd Compare April 10, 2024 14:04

wjwwood requested changes Apr 10, 2024

View reviewed changes

Janosch Machowinski and others added 2 commits April 10, 2024 17:39

Fix callback group logic in executor

1de4975

Signed-off-by: Janosch Machowinski <j.machowinski@cellumation.com>

fix: Fixed unnecessary copy of wait_set

f14dc71

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

jmachowinski force-pushed the data_race_fixes branch from e74e2bd to c5a13d5 Compare April 10, 2024 17:57

Janosch Machowinski and others added 2 commits April 10, 2024 20:14

fix(Executor): Fixed lost of entities rebuild request

438e233

Signed-off-by: Janosch Machowinski <j.machowinski@cellumation.com>

jmachowinski force-pushed the data_race_fixes branch from c5a13d5 to 438e233 Compare April 10, 2024 18:14

Janosch Machowinski and others added 3 commits April 11, 2024 11:22

chore: Added assert for not set callback_group in execute_any_executable

66b9880

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

Add test for cbg getting reset

5a2112f

Signed-off-by: Michael Carroll <mjcarroll@intrinsic.ai> Co-authored-by: Janosch Machowinski <j.machowinski@nospam.org>

chore: renamed test cases to snake_case

733ab4f

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

jmachowinski mentioned this pull request Apr 11, 2024

Fix callback group logic in executor #2476

Closed

wjwwood mentioned this pull request Apr 12, 2024

[wjwwood] Updated "Data race fixes" #2500

Merged

wjwwood requested changes Apr 12, 2024

View reviewed changes

jmachowinski closed this Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data race fixes #2494

Data race fixes #2494

jmachowinski commented Apr 8, 2024

mjcarroll commented Apr 8, 2024

mjcarroll commented Apr 9, 2024

wjwwood commented Apr 9, 2024

jmachowinski commented Apr 9, 2024

jmachowinski commented Apr 9, 2024

mjcarroll commented Apr 9, 2024

jmachowinski commented Apr 10, 2024 •

edited

Loading

wjwwood Apr 10, 2024

jmachowinski Apr 10, 2024

wjwwood Apr 12, 2024 •

edited

Loading

jmachowinski Apr 12, 2024 •

edited

Loading

wjwwood left a comment

jmachowinski commented Apr 12, 2024

wjwwood commented Apr 12, 2024 •

edited

Loading

Data race fixes #2494

Data race fixes #2494

Conversation

jmachowinski commented Apr 8, 2024

mjcarroll commented Apr 8, 2024

mjcarroll commented Apr 9, 2024

wjwwood commented Apr 9, 2024

jmachowinski commented Apr 9, 2024

jmachowinski commented Apr 9, 2024

mjcarroll commented Apr 9, 2024

jmachowinski commented Apr 10, 2024 • edited Loading

wjwwood Apr 10, 2024

Choose a reason for hiding this comment

jmachowinski Apr 10, 2024

Choose a reason for hiding this comment

wjwwood Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

jmachowinski Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

wjwwood left a comment

Choose a reason for hiding this comment

jmachowinski commented Apr 12, 2024

wjwwood commented Apr 12, 2024 • edited Loading

jmachowinski commented Apr 10, 2024 •

edited

Loading

wjwwood Apr 12, 2024 •

edited

Loading

jmachowinski Apr 12, 2024 •

edited

Loading

wjwwood commented Apr 12, 2024 •

edited

Loading