Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race fixes #2494

Closed
wants to merge 7 commits into from
Closed

Conversation

jmachowinski
Copy link
Contributor

@mjcarroll @wjwwood

Make the executors more stable with our stack.

@mjcarroll
Copy link
Member

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@mjcarroll
Copy link
Member

All test failures here are related to ros2/rosidl_typesupport_fastrtps#117

@wjwwood
Copy link
Member

wjwwood commented Apr 9, 2024

Can you explain the changes?

@jmachowinski
Copy link
Contributor Author

@wjwwood did you look at the commit messages of the single commits ?

@jmachowinski
Copy link
Contributor Author

7d0b532

Any rebuild request coming from a second thread may get lost, if issued, while the execution thread is in Executor::collect_entities(). Changes the clearing of the request to an atomic operation to fix this.

a968a3c

The waitable did not rearm its guard condition if not processed yet. I think In the long run we should have a better API for this, as it is a common pitfall.

5d99e3b

After adding all of the timer_index++ in the outer code the code was basically same as the other next_XYZ, therefore
I just replaced it, to a a more uniform code.

39e0d94

This fixes a possible data race, as the on_trigger_callback might rely on the facts, that the executor wakes up,
after it was executed.

@mjcarroll
Copy link
Member

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@jmachowinski
Copy link
Contributor Author

jmachowinski commented Apr 10, 2024

The unstable tests on windows seems to be non related, see ros2/rmw_connextdds#136

Comment on lines 682 to 699
size_t current_timer_index = 0;
while (true) {
auto [timer, timer_index] = wait_result_->peek_next_ready_timer(current_timer_index);
if (nullptr == timer) {
break;
}
current_timer_index = timer_index;
while (auto timer = wait_result_->next_ready_timer()) {
auto entity_iter = current_collection_.timers.find(timer->get_timer_handle().get());
if (entity_iter != current_collection_.timers.end()) {
auto callback_group = entity_iter->second.callback_group.lock();
if (callback_group && !callback_group->can_be_taken_from()) {
if (!callback_group || !callback_group->can_be_taken_from()) {
continue;
}
// At this point the timer is either ready for execution or was perhaps
// it was canceled, based on the result of call(), but either way it
// should not be checked again from peek_next_ready_timer(), so clear
// it from the wait result.
wait_result_->clear_timer_with_index(current_timer_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made the change to peek and conditionally clear after a lot of iterations, and I don't fully understand why this needs to be changed back to not peek before clearing. Can you give a better explanation as to why this isn't needed any more? Is it related to the logic change around checking the callback groups?

Specifically, if this function is called multiple times before the wait_result_ is discarded due to a new call to wait on the wait set, we need to ensure timers are not passed over due to callback group constraints on one call only to be ignore on subsequent calls. If we could ensure that wait is called again (and the wait_result_ is discarded) every time a callback group became available to be taken from, then this wouldn't be needed, but I think for performance the idea is that wait results are fully processed before waiting again, in which case iterating over a timer, not calling it because of its callback group, but marking it as nullptr in the wait result anyways, would result in it getting skipped when it shouldn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not readd this change for now, so we can go ahead with this PR.

Anyway, currently the behavior of the WaitResult is inconsistent.
The timers are the only objects, that feature the 'do not clear' logic if the callback group
is in used. I actually see not a real reason for this, perhaps to make some test happy ?

All other objects in the WaitResult clear on iteration. Therefore I would suggest, that we
either also clear the timers, or move the callback_group->can_be_taken_from() check inside
of the WaitResult and do not clear if blocked, to get a consistent behavior.

Copy link
Member

@wjwwood wjwwood Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that all the other waitable ros entities, excluding guard conditions, stay ready on subsequent waits. And for guard conditions if they represent something that needs to stay ready between waits, then that thing needs to re-trigger them, i.e. the pr I made about waitables and keeping guard conditions ready sometimes. That's why timers have a different API here.

This style of iteration is required to avoid having the wait result know anything about the callback group. We considered having an API that took a "can be executed callback" to do dependency injection, but we decided that was unnecessarily complicated and this API was simple by comparison and appeared to be more generic (remember the WaitSet and WaitResult objects were designed to be used directly as well as in the executor).

So while it was indeed in pursuit of making a test pass that was flaky, I do think where we landed makes some sense, even if it's not immediately obvious why timers are special. I do think it's possible that we want this style of "peek and clear" APIs for the other entities, but I'd need to convince myself of that. This is all related to the decision whether or not to wait again after each thing is executed or to instead burn down a wait result before waiting again.

Copy link
Contributor Author

@jmachowinski jmachowinski Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why timers have a different API here.

I'm confused about this statement. Timers stay ready as well on subsequent waits, as long as you don't 'call()' them. What would not be done, if the cbg is busy.

rclcpp/src/rclcpp/executor.cpp Show resolved Hide resolved
rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/guard_condition.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/executor.cpp Show resolved Hide resolved
Janosch Machowinski and others added 2 commits April 10, 2024 17:39
Signed-off-by: Janosch Machowinski <j.machowinski@cellumation.com>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Janosch Machowinski and others added 2 commits April 10, 2024 20:14
Before this change, the rebuild of wait set would be triggered
after the wait set was waken up. With bad timing, this could
lead to the rebuild not happening with multi threaded executor.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Signed-off-by: Janosch Machowinski <j.machowinski@cellumation.com>
Janosch Machowinski and others added 3 commits April 11, 2024 11:22
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Signed-off-by: Michael Carroll <mjcarroll@intrinsic.ai>
Co-authored-by: Janosch Machowinski <j.machowinski@nospam.org>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm, but I had some style fixes, which I could not make because of push permissions, but I opened #2500 instead, so either integrate those changes here or we can close this in favor of that one.

@jmachowinski
Copy link
Contributor Author

closed in favor for #2500 2500

@wjwwood
Copy link
Member

wjwwood commented Apr 12, 2024

On my machine, the test TestAddCallbackGroupsToExecutor/StaticSingleThreadedExecutor.add_callback_groups_after_add_node_to_executor (for all rmw implementations) is timing out.

The following tests FAILED:
	 28 - test_add_callback_groups_to_executor__rmw_cyclonedds_cpp (Timeout)
	 29 - test_add_callback_groups_to_executor__rmw_fastrtps_cpp (Timeout)
	 30 - test_add_callback_groups_to_executor__rmw_fastrtps_dynamic_cpp (Timeout)

I'll run CI just to confirm:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

I wasn't able to see any obvious issue, the CI is with my changes in #2500, but locally even this pr's changes have that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants