Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed race condition in action server between is_ready and take"… #2531

Merged
merged 2 commits into from
Jun 27, 2024

Conversation

jmachowinski
Copy link
Contributor

… (#2495)

Some background information: is_ready, take_data and execute data may be called from different threads in any order. The code in the old state expected them to be called in series, without interruption. This lead to multiple race conditions, as the state of the pimpl objects was altered by the three functions in a non thread safe way.

This is a clean backport of #2495. This should superseed #2530

Note, this patch is not tested.

@firesufer can you try out this patch ?

@firesurfer
Copy link

@jmachowinski I just tested it on my local setup and it worked fine so far. I will deploy it on our machine and checkout if it fixes the race condition issue.

@jmachowinski
Copy link
Contributor Author

Hm, we run into a problem with this backport, as #2109 is one of the causes of this bug, and can not be reverted without API break.

As a workaround, I would propose to remove the exception, if take_data is called multiple times, and make execute() handle this case as well. Before going forward with this, I would like to hear you opinions on this @clalancette @wjwwood @mjcarroll

@firesurfer
Copy link

@jmachowinski Apparently this back port also does not fix the issue completely.

1715775009.7834265 [ros2_control_node-14] terminate called after throwing an instance of 'std::runtime_error'
1715775009.7838161 [ros2_control_node-14] what(): ServerBase::take_data() called but no data is ready
1715775009.7845201 [ros2_control_node-14] Stack trace (most recent call last) in thread 2061428:
1715775009.7851567 [ros2_control_node-14] #14 Object "", at 0xffffffffffffffff, in
1715775009.7882259 [ros2_control_node-14] #13 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392da484f, in
1715775009.7885315 [ros2_control_node-14] #12 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392d12ac2, in
1715775009.7904732 [ros2_control_node-14] #11 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392fa5252, in
1715775009.7919304 [ros2_control_node-14] #10 Object "/home/user/workspace/install/rclcpp/lib/librclcpp.so", at 0x7f83932b34d9, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
1715775009.7929633 [ros2_control_node-14] #9 Object "/home/user/workspace/install/rclcpp/lib/librclcpp.so", at 0x7f83932a0f8c, in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
1715775009.7959473 [ros2_control_node-14] #8 Object "/home/user/workspace/install/rclcpp/lib/librclcpp.so", at 0x7f8393299066, in rclcpp::Executor::get_next_ready_executable_from_map(rclcpp::AnyExecutable&, std::map<std::weak_ptrrclcpp::CallbackGroup, std::weak_ptrrclcpp::node_interfaces::NodeBaseInterface, std::owner_less<std::weak_ptrrclcpp::CallbackGroup >, std::allocator<std::pair<std::weak_ptrrclcpp::CallbackGroup const, std::weak_ptrrclcpp::node_interfaces::NodeBaseInterface > > > const&)
1715775009.7967491 [ros2_control_node-14] #7 Object "/home/user/workspace/install/rclcpp_action/lib/librclcpp_action.so", at 0x7f8384326569, in rclcpp_action::ServerBase::take_data() [clone .cold]
1715775009.7971637 [ros2_control_node-14] #6 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f774d7, in __cxa_throw
1715775009.7974977 [ros2_control_node-14] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f77276, in std::terminate()
1715775009.7977948 [ros2_control_node-14] #4 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f7720b, in
1715775009.7980995 [ros2_control_node-14] #3 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f6bb9d, in
1715775009.7983689 [ros2_control_node-14] #2 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392ca67f2, in abort
1715775009.7984428 [ros2_control_node-14] #1 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392cc0475, in raise
1715775009.7987077 [ros2_control_node-14] #0 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392d149fc, in pthread_kill
1715775009.7987759 [ros2_control_node-14] Aborted (Signal sent by tkill() 2061064 1000)

@jmachowinski
Copy link
Contributor Author

@jmachowinski Apparently this back port also does not fix the issue completely.

Yes, this is the problem that I was talking about above.

@mjcarroll
Copy link
Member

As a workaround, I would propose to remove the exception, if take_data is called multiple times, and make execute() handle this case as well.

Only for the backport or for rolling and jazzy as well?

@jmachowinski
Copy link
Contributor Author

As a workaround, I would propose to remove the exception, if take_data is called multiple times, and make execute() handle this case as well.

Only for the backport or for rolling and jazzy as well?

Only for backport to Iron, for humble this patch should work as it is right now.

@firesurfer
Copy link

firesurfer commented Jun 18, 2024

Whats the state with this PR? @mjcarroll @clalancette
This bug is an absolute showstopper for using actions in any kind of industrial application and the fix should definitely be back ported from my point of view.

@jmachowinski
Copy link
Contributor Author

@firesurfer I updated the patch, to ignore the double calls to take_data, can you test it ?

@clalancette @mjcarroll @wjwwood As this PR has been stale for some time without any response from you the attendees of the Client Working Group Meeting decided, that we will go for just ignoring the double calls to take_data and go on with this PR.

@firesurfer
Copy link

@jmachowinski Thanks for the patch. I'm rolling it out to our machine right now and will give you feedback this week.

@jmachowinski
Copy link
Contributor Author

@firesurfer one second, I missed a part....

@jmachowinski
Copy link
Contributor Author

Fixed, @firesurfer sorry for the confusion, now the patch should be good to go

@firesurfer
Copy link

firesurfer commented Jun 25, 2024

@jmachowinski I guess the issue is fixed on the server side in rclcpp with this fix. On the client side with rclpy we still get a related error. But as I said my guess is that this is not related wit hthe action server side (ros2control - jtc) in this case.

[WARN] [1719316017.260343581] [.action_client]: Ignoring unexpected goal response. There may be more than one action server for the action '<my_jtc>/follow_joint_trajectory'

EDIT: It seems this is also a race condition but in the action implementation in rclpy:
ros2/rclpy#1123

Janosch Machowinski and others added 2 commits June 25, 2024 14:27
…ros2#2495)

Some background information: is_ready, take_data and execute data
may be called from different threads in any order. The code in the old
state expected them to be called in series, without interruption.
This lead to multiple race conditions, as the state of the pimpl objects
was altered by the three functions in a non thread safe way.

Co-authored-by: William Woodall <william@osrfoundation.org>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
This adds a workaround for a known bug in the executor in iron.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
@jmachowinski
Copy link
Contributor Author

@alsora @fujitatomoya I guess we are good to go. Can one of you run the CI ?

@alsora
Copy link
Collaborator

alsora commented Jun 25, 2024

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@jmachowinski
Copy link
Contributor Author

@alsora I think you started the CI for the wrong OS / Distro...

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rclcpp_action/src/client.cpp Show resolved Hide resolved
rclcpp_action/src/client.cpp Show resolved Hide resolved
@fujitatomoya
Copy link
Collaborator

I believe that this can keep the ABI compatibility but applying ABI compliance checker would be nice.

@alsora
Copy link
Collaborator

alsora commented Jun 26, 2024

The builds failed due to a strange error in rmw_connextdds_common... All the 380 previous packages built fine..
https://ci.ros2.org/job/ci_linux/21319/consoleFull

The gist was created from https://github.com/ros2/ros2/blob/iron/ros2.repos and it seems correct to me.

@jmachowinski
Copy link
Contributor Author

The gist was created from https://github.com/ros2/ros2/blob/iron/ros2.repos and it seems correct to me.

Here (https://ci.ros2.org/job/ci_linux/21319/) it says:
ubuntu_distro: noble,
el_release: 9,
ros_distro: rolling,

Therefore I assumed the CI is off, as ubuntu should be Jammy and ros_disto: Iron...

@alsora
Copy link
Collaborator

alsora commented Jun 26, 2024

New CI, this should be using Ubuntu Jammy and ROS 2 iron as a base

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@jmachowinski
Copy link
Contributor Author

@alsora @fujitatomoya merge ?

@clalancette
Copy link
Contributor

RHEL shouldn't be failing, but it looks like an infrastructure issue. I'll kick it off again.

@fujitatomoya
Copy link
Collaborator

@alsora CI is green, i am okay to merge this. can you merge this with your lgtm?

@alsora
Copy link
Collaborator

alsora commented Jun 27, 2024

Yes, the changes look good to me

@alsora alsora merged commit d588ccb into ros2:iron Jun 27, 2024
2 of 3 checks passed
mauropasse pushed a commit to mauropasse/rclcpp that referenced this pull request Aug 9, 2024
ros2#2531)

* fix: Fixed race condition in action server between is_ready and take" (ros2#2495)

Some background information: is_ready, take_data and execute data
may be called from different threads in any order. The code in the old
state expected them to be called in series, without interruption.
This lead to multiple race conditions, as the state of the pimpl objects
was altered by the three functions in a non thread safe way.

Co-authored-by: William Woodall <william@osrfoundation.org>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

* fix: added workaround for call to double calls to take_data

This adds a workaround for a known bug in the executor in iron.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

---------

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: William Woodall <william@osrfoundation.org>
mauropasse pushed a commit to mauropasse/rclcpp that referenced this pull request Aug 13, 2024
ros2#2531)

* fix: Fixed race condition in action server between is_ready and take" (ros2#2495)

Some background information: is_ready, take_data and execute data
may be called from different threads in any order. The code in the old
state expected them to be called in series, without interruption.
This lead to multiple race conditions, as the state of the pimpl objects
was altered by the three functions in a non thread safe way.

Co-authored-by: William Woodall <william@osrfoundation.org>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

* fix: added workaround for call to double calls to take_data

This adds a workaround for a known bug in the executor in iron.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

---------

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: William Woodall <william@osrfoundation.org>
apojomovsky pushed a commit to irobot-ros/rclcpp that referenced this pull request Aug 13, 2024
* Fixes for intra-process actions (#144)

* Fixes for intra-process Actions

* Fixes for Clang builds

* Fix deadlock

* Server to store results until client requests them

* Fix feedback/result data race

See ros2#2451

* Add missing mutex

* Check return value of intra_process_action_send

---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

* Fix IPC Actions data race (#147)

* Check if goal was sent through IPC before send responses
* Add intra_process_action_server_is_available API to intra-process Client


---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

* Fix data race in Actions: Part 2 (#148)

* Fix data race in Actions: Part 2

* Fix warning - copy elision

---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

* fix: Fixed race condition in action server between is_ready and take"… (ros2#2531)

* fix: Fixed race condition in action server between is_ready and take" (ros2#2495)

Some background information: is_ready, take_data and execute data
may be called from different threads in any order. The code in the old
state expected them to be called in series, without interruption.
This lead to multiple race conditions, as the state of the pimpl objects
was altered by the three functions in a non thread safe way.

Co-authored-by: William Woodall <william@osrfoundation.org>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

* fix: added workaround for call to double calls to take_data

This adds a workaround for a known bug in the executor in iron.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>

---------

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: William Woodall <william@osrfoundation.org>

---------

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: Mauro Passerino <mpasserino@irobot.com>
Co-authored-by: jmachowinski <jmachowinski@users.noreply.github.com>
Co-authored-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Co-authored-by: William Woodall <william@osrfoundation.org>
@akupferb
Copy link

Only for backport to Iron, for humble this patch should work as it is right now.

@jmachowinski Has this fix already been added as a patch in Humble? I continue to encounter this race bug in Humble on Ubuntu 22.

@jmachowinski
Copy link
Contributor Author

jmachowinski commented Aug 21, 2024

@jmachowinski Has this fix already been added as a patch in Humble? I continue to encounter this race bug in Humble on Ubuntu 22.

Nope, feel free to cherry pick the correct version and open a PR for it.

mauropasse pushed a commit to mauropasse/rclcpp that referenced this pull request Aug 22, 2024
mauropasse pushed a commit to mauropasse/rclcpp that referenced this pull request Aug 23, 2024
mauropasse added a commit to irobot-ros/rclcpp that referenced this pull request Aug 23, 2024
* Always publish inter-process on TRANSIENT_LOCAL pubs (#152)

* Backport upstream Action fixes

 - ros2#2531

* Fix actions feedback race

 - ros2#2451

* Fix data race in Actions: Part 3

---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>
edgarcamilocamacho added a commit to Unlimited-Robotics/third_party_rclcpp that referenced this pull request Sep 29, 2024
@edgarcamilocamacho
Copy link

Hello @akupferb , I just backported the fix from iron branch to humble and created the PR #2635

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants