Fixes Action.*_async futures never complete #1308

jmblixt3 · 2024-06-27T22:43:57Z

Replaces #1125 since original contributor went inactive

If two separate client server actions are running in separate executors the future given to the ActionClient will never complete due to a race condition on the rcl handles

Tested using this from @apockill which was adapted to match rolling then test with and without locks

To reproduce initial issue remove the lock/sleep in the RaceyAction client

Adapted example Code here

client.py

import rclpy
from rclpy import Future
from rclpy.action import ActionClient
from rclpy.action.client import ClientGoalHandle
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node

from test_msgs.action import Fibonacci as SomeMsg
from time import sleep

class RaceyActionClient(ActionClient):
    def _get_result_async(self, goal_handle):
        """
        Request the result for an active goal asynchronously.

        :param goal_handle: Handle to the goal to cancel.
        :type goal_handle: :class:`ClientGoalHandle`
        :return: a Future instance that completes when the get result request has been processed.
        :rtype: :class:`rclpy.task.Future` instance
        """
        if not isinstance(goal_handle, ClientGoalHandle):
            raise TypeError(
                'Expected type ClientGoalHandle but received {}'.format(type(goal_handle)))

        result_request = self._action_type.Impl.GetResultService.Request()
        result_request.goal_id = goal_handle.goal_id
        future = Future()
        with self._lock:
            sleep(1)
            sequence_number = self._client_handle.send_result_request(result_request)
            if sequence_number in self._pending_result_requests:
                raise RuntimeError(
                    'Sequence ({}) conflicts with pending result request'.format(sequence_number))
            sleep(1)
            self._pending_result_requests[sequence_number] = future
            sleep(1)
            self._result_sequence_number_to_goal_id[sequence_number] = result_request.goal_id
            sleep(1)
            future.add_done_callback(self._remove_pending_result_request)
            sleep(1)
            # Add future so executor is aware
            self.add_future(future)

        return future

class SomeActionClient(Node):
    def __init__(self):
        super().__init__("action_client")
        self._action_client = RaceyActionClient(
            self,
            SomeMsg,
            "some_action",
            callback_group=MutuallyExclusiveCallbackGroup(),
        )
        self.call_action_timer = self.create_timer(
            timer_period_sec=10,
            callback=self.send_goal,
            callback_group=MutuallyExclusiveCallbackGroup(),
        )

    def send_goal(self):
        goal_msg = SomeMsg.Goal()
        # self._action_client.wait_for_server()
        result = self._action_client.send_goal(goal_msg)
        print(result)


def main(args=None):
    rclpy.init(args=args)

    action_client = SomeActionClient()

    executor = MultiThreadedExecutor()
    rclpy.spin(action_client, executor)

if __name__ == "__main__":
    main()

server.py

import rclpy
from rclpy.action import ActionServer
from rclpy.node import Node

from test_msgs.action import Fibonacci as SomeMsg


class SomeActionServer(Node):
    def __init__(self):
        super().__init__("fibonacci_action_server")
        self._action_server = ActionServer(
            self, SomeMsg, "some_action", self.execute_callback
        )

    def execute_callback(self, goal_handle):
        self.get_logger().info("Executing goal...")
        result = SomeMsg.Result()
        goal_handle.succeed()
        return result


def main(args=None):
    rclpy.init(args=args)

    action_server = SomeActionServer()
    rclpy.spin(action_server)


if __name__ == "__main__":
    main()

Before and after log results

Before
client output

[WARN] [1720739707.125040450] [action_client.action_client]: Ignoring unexpected result response. There may be more than one action server for the action 'some_action'
^C

server output

[INFO] [1720739706.054013974] [fibonacci_action_server]: Executing goal...
^C

After
client output

[INFO] [1720739524.873096289] [fibonacci_action_server]: Executing goal...
[INFO] [1720739534.791491183] [fibonacci_action_server]: Executing goal...
[INFO] [1720739544.789914632] [fibonacci_action_server]: Executing goal...
[INFO] [1720739554.791130534] [fibonacci_action_server]: Executing goal...
[INFO] [1720739564.791638162] [fibonacci_action_server]: Executing goal...
[INFO] [1720739574.793290290] [fibonacci_action_server]: Executing goal...
[INFO] [1720739584.790164903] [fibonacci_action_server]: Executing goal...
[INFO] [1720739594.794427458] [fibonacci_action_server]: Executing goal...
^C

test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
test_msgs.action.Fibonacci_GetResult_Response(status=4, result=test_msgs.action.Fibonacci_Result(sequence=[]))
^C

I also initially tried to add locks to just the action client, but I was getting test failures in test_action_graph on the get_names_and_types function, so I added for action server as well.

fujitatomoya

i would add @aditya2592 as co-author, since this borrows the code from #1125.

rclpy/rclpy/action/client.py

jmblixt3 · 2024-06-29T01:44:43Z

i would add @aditya2592 as co-author, since this borrows the code from #1125.

Done

jmblixt3 · 2024-07-02T00:15:23Z

Throwing this back to draft because it still doesn't fix all conditions where this breaks. If others have ideas on how to more reliably reproduce this that would be appreciated.

rclpy/rclpy/action/client.py

apockill · 2024-07-03T16:59:50Z

Throwing this back to draft because it still doesn't fix all conditions where this breaks. If others have ideas on how to more reliably reproduce this that would be appreciated.

What kinds of failures are you seeing still? This is the "fix" I've been using in production, and at least with our use cases of services we haven't seen failures since:

class PatchRclpyIssue1123(RobustActionClient):

    _lock: RLock = None  # type: ignore

    @property
    def _cpp_client_handle_lock(self) -> RLock:
        if self._lock is None:
            self._lock = RLock()
        return self._lock

    async def execute(self, *args: Any, **kwargs: Any) -> None:
        # This is ugly- holding on to a lock in an async environment feels gross
        with self._cpp_client_handle_lock:
            return await super().execute(*args, **kwargs)  # type: ignore

    def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super().send_goal_async(*args, **kwargs)

    def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._cancel_goal_async(*args, **kwargs)

    def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._get_result_async(*args, **kwargs)

I'm not sure if these matter, but here are the differences I can note between this above fix and this PR:

In my implementation I lock the async execute, since multithreaded executors are still running async tasks "concurrently", leading to possible race conditions with the client handle.
I use an RLock vs a Lock. Not sure if this matters in this case, I figured I'd note it

jmblixt3 · 2024-07-03T17:13:16Z

I was defining a python action client server pair in two seperate terminals, and the custom server also contained a python client to another cpp nav2 action. After calling the custom python action server once it would work correctly the first time, but after wasn't even accepting the goal request to any subsequent calls. If I removed the nav2 action client from my custom action server it wouldn't deadlock. I suspect a lock wasn't correctly being released with the call to another action client inside the action server. I'd like to look more into try to see if the issue persists with either locking on calls to entire execute funciton as you've suggested or using Rlocks to see if the behavior persists.

apockill · 2024-07-03T17:26:57Z

After calling the custom python action server once it would work correctly the first time, but after wasn't even accepting the goal request to any subsequent calls.

Ahh understood. I would hazard a guess that an RLock would work in this situation.

jmblixt3 · 2024-07-03T17:30:56Z

I tried to recreate an example outside of my environment, but have been unsuccessful

jmblixt3 · 2024-07-11T22:47:46Z

I tried to recreate an example outside of my environment, but have been unsuccessful

I was able to recreate this issue outside my environment now #1313, and looks like there is an issue with rclcpp server, and a nested rclpy action client nested inside an rcply action server, so that seems unrelated, so I'll pull this back out of draft

rclpy/rclpy/action/client.py

fujitatomoya

lgtm with green CI

rclpy/rclpy/action/client.py

fujitatomoya · 2024-07-21T13:05:43Z

@apockill if you are willing to do 2nd review on this, that would be really appreciated.

firesurfer · 2024-07-24T12:54:11Z

Who ever wants to test the patch on iron. I cherry picked it on iron in this fork:
https://github.com/firesurfer/rclpy/tree/iron

Btw. from our tests today it seems to work fine on iron with our setup.

apockill

This looks like it would fix the issue. My only commentary is that this feels brittle, and future changes might break it easily.

Even now I have a hard time knowing which dicts need strict thread safety. For example, _remove_pending_request, _remove_pending_goal_request, _remove_pending_result_request all use dictionaries that are holding state that can be edited in other callbacks.

I wonder if a future refactor could make this harder to mess up. Off the cuff, we could wrap up all the state that is ties the C++ self._client and it's python stateful dictionaries into a single object, and protect that object. I'm just brainstorming here, but something a la:

class _ProtectedClientState:
    """Protect state between the C++ client and python that must be thread-safe"""
    
    def __init__(self):
        self.lock = RLock()
        
        # key: UUID in bytes, value: weak reference to ClientGoalHandle
        self.goal_handles = {}
        # key: goal request sequence_number, value: Future for goal response
        self.pending_goal_requests = {}
        # key: goal request sequence_number, value: UUID
        self.goal_sequence_number_to_goal_id = {}
        # key: cancel request sequence number, value: Future for cancel response
        self.pending_cancel_requests = {}
        # key: result request sequence number, value: Future for result response
        self.pending_result_requests = {}
        # key: result request sequence_number, value: UUID
        self.result_sequence_number_to_goal_id = {}
        # key: UUID in bytes, value: callback function
        self.feedback_callbacks = {}
        
    def __getattribute__(...):
       if not self.lock.acquired():
           raise HeyThatsBad("This state is meant to be protected while using these!"):
       return the requested attribute

jmblixt3 · 2024-08-04T15:46:28Z

Just bumping this, to see if other maintainers can review/merge this

fujitatomoya · 2024-08-04T16:29:40Z

@jmblixt3 can you rebase this on rolling? and i will start the CI.

jmblixt3 · 2024-08-04T17:20:15Z

@fujitatomoya Rebased

fujitatomoya · 2024-08-05T15:53:47Z

CI:

Linux
Linux-aarch64
Linux-rhel
Windows

jmblixt3 · 2024-08-13T20:01:34Z

Anything else needed from me?

fujitatomoya · 2024-08-18T21:59:47Z

@sloretz @clalancette @adityapande-1995 can either of you take a look at this?

GinesLopezz · 2024-08-22T12:48:37Z

I'm experiencing the same issue on ROS 2 Humble. Could you backport this fix to humble? Thank you!

mqcmd196 · 2024-08-23T02:17:27Z

I'm experiencing the same issue on ROS 2 Humble. Could you backport this fix to humble? Thank you!

+1

leander2189 · 2024-09-05T06:44:22Z

I am having also this issue with Humble. Is there a timeframe on when the fix will be merged? Thanks!

Per rclpy:1123 If two seperate client server actions are running in seperate executors the future given to the ActionClient will never complete due to a race condition This fixes the calls to rcl handles potentially leading to deadlock scenarios by adding locks to there references Co-authored-by: Aditya Agarwal <aditya.kgp25@gmail.com> Co-authored-by: Jonathan Blixt <jmblixt3@gmail.com> Signed-off-by: Jonathan Blixt <jmblixt3@gmail.com>

jmblixt3 · 2024-09-15T15:55:25Z

Just rebased this again, are there any issues with that still need to be resolved

fujitatomoya · 2024-09-16T05:34:24Z

@jmblixt3 no, we are just waiting for the 2nd review.

@clalancette @ahcorde @sloretz friendly ping.

jmblixt3 requested review from sloretz and adityapande-1995 as code owners June 27, 2024 22:43

fujitatomoya mentioned this pull request Jun 28, 2024

draft: avoid race condition in action client #1125

Closed

fujitatomoya requested changes Jun 28, 2024

View reviewed changes

rclpy/rclpy/action/client.py Outdated Show resolved Hide resolved

rclpy/rclpy/action/client.py Outdated Show resolved Hide resolved

rclpy/rclpy/action/client.py Show resolved Hide resolved

rclpy/rclpy/action/client.py Show resolved Hide resolved

jmblixt3 force-pushed the rolling branch from cdd7018 to 562f1d4 Compare June 29, 2024 01:44

jmblixt3 marked this pull request as draft June 29, 2024 02:49

jmblixt3 force-pushed the rolling branch from ac7f409 to 74ee427 Compare June 30, 2024 18:07

jmblixt3 marked this pull request as ready for review June 30, 2024 18:09

jmblixt3 marked this pull request as draft July 2, 2024 00:13

apockill reviewed Jul 2, 2024

View reviewed changes

rclpy/rclpy/action/client.py Outdated Show resolved Hide resolved

jmblixt3 force-pushed the rolling branch from 74ee427 to bc40252 Compare July 3, 2024 16:40

apockill reviewed Jul 3, 2024

View reviewed changes

rclpy/rclpy/action/client.py Outdated Show resolved Hide resolved

jmblixt3 marked this pull request as ready for review July 11, 2024 22:48

jmblixt3 requested review from fujitatomoya and apockill July 11, 2024 22:48

cottsay assigned fujitatomoya Jul 18, 2024

jmblixt3 force-pushed the rolling branch from bc40252 to dbee5ac Compare July 18, 2024 23:28

stelmik reviewed Jul 21, 2024

View reviewed changes

rclpy/rclpy/action/client.py Show resolved Hide resolved

fujitatomoya approved these changes Jul 21, 2024

View reviewed changes

rclpy/rclpy/action/client.py Show resolved Hide resolved

apockill approved these changes Jul 24, 2024

View reviewed changes

jmblixt3 force-pushed the rolling branch from dbee5ac to e9495e0 Compare August 4, 2024 17:15

jmblixt3 force-pushed the rolling branch from e9495e0 to 5d6ffd0 Compare August 13, 2024 20:43

jmblixt3 mentioned this pull request Aug 30, 2024

High-Latency networks: "generator already executing" when calling a service #1351

Open

jmblixt3 force-pushed the rolling branch from 5d6ffd0 to f901392 Compare September 15, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes Action.*_async futures never complete #1308

Fixes Action.*_async futures never complete #1308

jmblixt3 commented Jun 27, 2024 •

edited

Loading

fujitatomoya left a comment

jmblixt3 commented Jun 29, 2024

jmblixt3 commented Jul 2, 2024

apockill commented Jul 3, 2024

jmblixt3 commented Jul 3, 2024 •

edited

Loading

apockill commented Jul 3, 2024

jmblixt3 commented Jul 3, 2024

jmblixt3 commented Jul 11, 2024

fujitatomoya left a comment

fujitatomoya commented Jul 21, 2024

firesurfer commented Jul 24, 2024 •

edited

Loading

apockill left a comment •

edited

Loading

jmblixt3 commented Aug 4, 2024

fujitatomoya commented Aug 4, 2024

jmblixt3 commented Aug 4, 2024

fujitatomoya commented Aug 5, 2024

jmblixt3 commented Aug 13, 2024

fujitatomoya commented Aug 18, 2024

GinesLopezz commented Aug 22, 2024

mqcmd196 commented Aug 23, 2024 •

edited

Loading

leander2189 commented Sep 5, 2024

jmblixt3 commented Sep 15, 2024

fujitatomoya commented Sep 16, 2024

Fixes Action.*_async futures never complete #1308

Are you sure you want to change the base?

Fixes Action.*_async futures never complete #1308

Conversation

jmblixt3 commented Jun 27, 2024 • edited Loading

fujitatomoya left a comment

Choose a reason for hiding this comment

jmblixt3 commented Jun 29, 2024

jmblixt3 commented Jul 2, 2024

apockill commented Jul 3, 2024

jmblixt3 commented Jul 3, 2024 • edited Loading

apockill commented Jul 3, 2024

jmblixt3 commented Jul 3, 2024

jmblixt3 commented Jul 11, 2024

fujitatomoya left a comment

Choose a reason for hiding this comment

fujitatomoya commented Jul 21, 2024

firesurfer commented Jul 24, 2024 • edited Loading

apockill left a comment • edited Loading

Choose a reason for hiding this comment

jmblixt3 commented Aug 4, 2024

fujitatomoya commented Aug 4, 2024

jmblixt3 commented Aug 4, 2024

fujitatomoya commented Aug 5, 2024

jmblixt3 commented Aug 13, 2024

fujitatomoya commented Aug 18, 2024

GinesLopezz commented Aug 22, 2024

mqcmd196 commented Aug 23, 2024 • edited Loading

leander2189 commented Sep 5, 2024

jmblixt3 commented Sep 15, 2024

fujitatomoya commented Sep 16, 2024

jmblixt3 commented Jun 27, 2024 •

edited

Loading

jmblixt3 commented Jul 3, 2024 •

edited

Loading

firesurfer commented Jul 24, 2024 •

edited

Loading

apockill left a comment •

edited

Loading

mqcmd196 commented Aug 23, 2024 •

edited

Loading