Race condition in `ActionClient.*_async` functions which result in a future that never complete #1123

apockill · 2023-05-02T19:53:17Z

Bug report

Hi all, thank you for maintaining this wonderful library!

I've been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I've got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I'll lay out all my findings and some perceived paths forward in this bug report.

Required Info:

Operating System:
- Ubuntu 22.04, within a docker image that is also running Ubuntu 22.04
Installation type:
- Binaries as installed from ros2 apt.
Version or commit hash:
- rclpy==3.3.7, python3.10
DDS implementation:
- CycloneDDS
Client library (if applicable):
- rclpy==3.3.7, python3.10

Steps to reproduce issue

There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling send_goal_async. Take the ActionClient.send_goal_async and add a sleep(#) directly below the line sequence_number = self._client_handle.send_goal_request(request). For example:

    def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
        """
        ...
        """

        if not isinstance(goal, self._action_type.Goal):
            raise TypeError()

        request = self._action_type.Impl.SendGoalService.Request()
        request.goal_id = (
            self._generate_random_uuid() if goal_uuid is None else goal_uuid
        )
        request.goal = goal
        sequence_number = self._client_handle.send_goal_request(request)
        sleep(5)   # <--- this line causes this race condition nearly 100% of the time

        if sequence_number in self._pending_goal_requests:
            raise RuntimeError(
                "Sequence ({}) conflicts with pending goal request".format(
                    sequence_number
                )
            )

        if feedback_callback is not None:
            # TODO(jacobperron): Move conversion function to a general-use package
            goal_uuid = bytes(request.goal_id.uuid)
            self._feedback_callbacks[goal_uuid] = feedback_callback

        future = Future()
        self._pending_goal_requests[sequence_number] = future
        self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
        future.add_done_callback(self._remove_pending_goal_request)
        # Add future so executor is aware
        self.add_future(future)

        return future

Expected behavior

I would expect 'send_goal_async' to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).

Actual behavior

The future never completes, and I see the following line in the logs

Ignoring unexpected goal response. There may be more than one action server for the action '{ACTION_NAME}'

Additional information

By my sleuthing, it appears that this error is caused by the following important lines being called:

client:Thread1

send_goal_async is called, and within it:
sequence_number = self._client_handle.send_goal_request(request)
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id

client:Thread2

A response to a goal request is received in the client C bindings, causing it to return True to is_ready(...)
execute(...) is called
if sequence_number in self._goal_sequence_number_to_goal_id: evaluates to False, because client:Thread1 has not reached its step 3
It ignores the taken_data, and thus the future will never have set_result called.

server:Thread1

A goal request is received, and responded to

Cause Summary

There is a race condition when sending goal requests, because the goal request can complete before the sequence_number is tracked in the self._goal_sequence_number_to_goal_id dictionary. Thus execute is called before it is tracked, and the taken_data is ignored, assumed to be a response from a duplicate action server.

I believe this same race condition affects all usages of the client API, such as _cancel_goal_async, or _get_result_async.

Recommendations

Put a lock around every usage of self._client_handle so that python-side objects can be edited and synced up. That way if execute is called it will block until the self._client_handle isn't being used, and the _goal_sequence_number_to_goal_id will be filled in.
Alternatively, create locks for the various in-memory dictionaries that are used in execute:

_goal_handles
_pending_goal_requests
_goal_sequence_number_to_goal_id
_pending_cancel_requests
_pending_result_requests
_result_sequence_number_to_goal_id

The text was updated successfully, but these errors were encountered:

aditya2592 · 2023-05-04T01:29:21Z

Hi @apockill, I ran into the same issue. I was wondering if adding locks around dictionary _goal_sequence_number_to_goal_id will solve the problem or not. If there is a lock, it prevents data being written or read simultaneously but it doesnt synchronize the fact that self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id in send_goal_async needs to happen before if sequence_number in self._goal_sequence_number_to_goal_id check in execute ?

I believe to synchronize that, we would need some an event that is set only in send_goal_async and waited on in execute.

A lock around the entire send_goal_async and execute can overcome this? I tested that in this diff on my setup. Did not see the race condition.

aditya2592 · 2023-05-08T15:32:32Z

Hi @apockill, do you think this issue is also present in the rclpy ActionServer class ?. I have an action server in python and I have seen some requests getting dropped randomly.

apockill · 2023-05-08T16:23:03Z

I haven't personally noticed that, but I can't say I've spent much time looking for any race conditions on the server side. What logs have you see printed when a request gets dropped- and are you sure it's not just a repeat of the client side dropping the ball?

apockill · 2023-05-10T16:33:25Z

Any thoughts on the best way to resolve this bug? At the moment this affects all users of ActionClients. I can contribute better once there's guidance from maintainers 😁

aditya2592 · 2023-05-10T21:01:10Z

It seems like the change I attempted didn't fully fix the problem, still see the following on some compute platforms

Ignoring unexpected goal response. There may be more than one action server for the action
Ignoring unexpected result response. There may be more than one action server for the action

aditya2592 · 2023-05-11T00:43:37Z

I tried looking deeper in client.py and it seems a simple condition/lock wont solve the problem because it doesn't guarantee that the dictionaries are updated before they are checked in async def execute. So I attempted to use a threading.Event to synchronize dictionary write and read. However each goal, result, cancel cases in async def execute need their own events to avoid write of one wrongly triggering the read of another. Trying this now in the existing PR #1125

It still doesn't seem very clean, another possible solution could be to use multiple queue to send the dictionaries between threads instead of referencing them.

apockill · 2023-05-11T14:46:56Z

I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?

I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.

adityapande-1995 · 2023-05-11T19:00:13Z

Hello @apockill ! Other than the sleep() example, what is your use case where you're seeing this bug ? Is it a flaky one ?

apockill · 2023-05-11T21:44:25Z

Hi @adityapande-1995! I gave the sleep example just to show how this race condition is purely caused by flaky timing, and luck currently prevents most users from seeing it. Also, a much much smaller wait can also trigger it.

I'm seeing this flakiness happen on a regular basis on my stack. We happen to call actions on a regular basis (maybe 1-3 actions per second, on average). This bug will reveal itself sometimes minutes after startup, other times within a few hours of runtime.

What mainly concerns me is that it's purely timing luck that this works at all, and the error message is confusing enough that other users might run into this but not know what happened. After all, the symptom is a confusing log message and a future that will never complete.

If we could find a way to protect the relevant python state tied to sequence_number's, I believe that will solve the problem.

aditya2592 · 2023-05-12T05:57:56Z

I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?

I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.

You are right, lock does seem to solve the issue. Once I fixed the issue caused by the race condition I had another unrelated bug in my setup which was misleading me to believe that the issue is not fixed.

@adityapande-1995. Adding another viewpoint here: the dictionaries like _goal_sequence_number_to_goal_id are being edited and read in two threads: the thread that user is sending the goal request in and the thread in which the node spins which calls async def execute. So the race condition here is evident from the code itself.

apockill · 2023-05-17T18:01:30Z

To others who think they might be running into this issue, here's a quick hack you can use that doesn't require building rclpy:

class PatchRclpyIssue1123(ActionClient):

    _lock: RLock = None  # type: ignore

    @property
    def _cpp_client_handle_lock(self) -> RLock:
        if self._lock is None:
            self._lock = RLock()
        return self._lock

    async def execute(self, *args: Any, **kwargs: Any) -> None:
        with self._cpp_client_handle_lock:
            return await super().execute(*args, **kwargs)  # type: ignore

    def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super().send_goal_async(*args, **kwargs)

    def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._cancel_goal_async(*args, **kwargs)

    def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._get_result_async(*args, **kwargs)

apockill · 2023-05-17T18:01:47Z

On another note- @adityapande-1995:
Thoughts on a solution where there's some decorator called uses_cpp_client which can wrap all of the functions in the ActionClient which use the underlying C++ action client?
For example:

def uses_cpp_client(fn):
    def wrapper(self, *args: Any, **kwargs: Any) -> Any:
        with self._cpp_client_handle_lock:
            return fn(*args, **kwargs)
        
    return wrapper

class ActionClient(Waitable):
    ...
   
    @uses_cpp_client
    def send_goal_async(...):
        ....

    @uses_cpp_client
    def _cancel_goal_async(...):
        ....

    @uses_cpp_client
    def _get_result_async(...):
        ...

    @async_uses_cpp_client
    async def execute(...):
        ...

apockill · 2023-06-14T17:14:36Z

I was looking at the code for service clients and it also uses a lock to sync the C++ state (self.__client calls) with the python state (self.__pending_requests). I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.

I'll submit a PR!

fujitatomoya · 2023-06-14T23:17:08Z

@apockill thanks for being patient on this.

I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.

i think that is correct assumption we have here. _pending_requests is protected by _lock in the service client.

clalancette · 2023-08-02T19:28:33Z

Sorry for the delay in looking at this.

I certainly can see how, in theory, we can get into this situation. However, so far I've been unable to replicate the problem as originally stated on the rolling branch. What I did was to edit action/client.py to add in the 5 second delay. Then I launched ros2 run action_tutorials_py fibonacci_action_server in one terminal, and ros2 run action_tutorials_py fibonacci_action_client in a second terminal. I do indeed see the 5 second delay on the client side, but I also see the client eventually catch up and receive the output.

Does someone have a relatively simple example that I could use to try to reproduce this?

apockill · 2023-08-03T16:56:52Z

Hi @clalancette, thank you for looking into this! I've completely forgotten to submit a PR for this.

Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this.

clalancette · 2023-08-03T18:01:36Z

Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this.

Yeah, that very well may be the case. If there is a simple example I can use to try to reproduce it, I'm certainly happy to take a look and review a PR.

apockill · 2023-08-04T04:42:35Z

Absolutely! Here's an example that replicated it on my end. You might have to switch out custom_msgs for some other action that takes no parameters.

client.py

from time import sleep

import rclpy
from rclpy import Future
from rclpy.action import ActionClient
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node

from custom_msgs.action import Home as SomeMsg


class RaceyActionClient(ActionClient):
    def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
        """
        Send a goal and asynchronously get the result.

        The result of the returned Future is set to a ClientGoalHandle when receipt of the goal
        is acknowledged by an action server.

        :param goal: The goal request.
        :type goal: action_type.Goal
        :param feedback_callback: Callback function for feedback associated with the goal.
        :type feedback_callback: function
        :param goal_uuid: Universally unique identifier for the goal.
            If None, then a random UUID is generated.
        :type: unique_identifier_msgs.UUID
        :return: a Future instance to a goal handle that completes when the goal request
            has been accepted or rejected.
        :rtype: :class:`rclpy.task.Future` instance
        :raises: TypeError if the type of the passed goal isn't an instance of
          the Goal type of the provided action when the service was
          constructed.
        """
        if not isinstance(goal, self._action_type.Goal):
            raise TypeError()

        request = self._action_type.Impl.SendGoalService.Request()
        request.goal_id = (
            self._generate_random_uuid() if goal_uuid is None else goal_uuid
        )
        request.goal = goal
        sequence_number = self._client_handle.send_goal_request(request)

        sleep(5)  # <--- this line causes this race condition nearly 100% of the time

        if sequence_number in self._pending_goal_requests:
            raise RuntimeError(
                "Sequence ({}) conflicts with pending goal request".format(
                    sequence_number
                )
            )

        if feedback_callback is not None:
            # TODO(jacobperron): Move conversion function to a general-use package
            goal_uuid = bytes(request.goal_id.uuid)
            self._feedback_callbacks[goal_uuid] = feedback_callback

        future = Future()
        self._pending_goal_requests[sequence_number] = future
        self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
        future.add_done_callback(self._remove_pending_goal_request)
        # Add future so executor is aware
        self.add_future(future)

        return future


class SomeActionClient(Node):
    def __init__(self):
        super().__init__("action_client")
        self._action_client = RaceyActionClient(
            self,
            SomeMsg,
            "some_action",
            callback_group=MutuallyExclusiveCallbackGroup(),
        )
        self.call_action_timer = self.create_timer(
            timer_period_sec=10,
            callback=self.send_goal,
            callback_group=MutuallyExclusiveCallbackGroup(),
        )

    def send_goal(self):
        goal_msg = SomeMsg.Goal()
        # self._action_client.wait_for_server()
        result = self._action_client.send_goal(goal_msg)
        print(result)


def main(args=None):
    rclpy.init(args=args)

    action_client = SomeActionClient()

    executor = MultiThreadedExecutor()
    rclpy.spin(action_client, executor)


if __name__ == "__main__":
    main()

server.py

import rclpy
from rclpy.action import ActionServer
from rclpy.node import Node

from custom_msgs.action import Home as SomeMsg


class SomeActionServer(Node):
    def __init__(self):
        super().__init__("fibonacci_action_server")
        self._action_server = ActionServer(
            self, SomeMsg, "some_action", self.execute_callback
        )

    def execute_callback(self, goal_handle):
        self.get_logger().info("Executing goal...")
        result = SomeMsg.Result()
        return result


def main(args=None):
    rclpy.init(args=args)

    action_server = SomeActionServer()
    rclpy.spin(action_server)


if __name__ == "__main__":
    main()

Replication instructions:

Run server.py on one terminal
Run client.py on one terminal

Observe client.py calls failing. It will call every 10 seconds, however it will only actually end up getting called once since the race condition occurs and locks up a thread:

[WARN] [1691123997.104652764] [action_client]: Ignoring unexpected goal response. There may be more than one action server for the action 'some_action'

…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

mpollayil · 2023-10-12T08:17:43Z

Are there any updates on this? At our company, we are also experiencing the same issue.

…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

tompe17 · 2023-12-18T16:26:07Z

I am also seeing this when loading the system. Using the same callback group (Reentrant) did not solve the issue. I will try the the other suggested workaround.

Is it supposed to be fixed in the current humble or not or what is the progress? This is a show stopper for the application I have...

Fingerling42 · 2024-06-15T14:49:10Z

Still experiencing this issue on the current Humble release.

In my case, I encountered this in NavigateToPose action from nav2. If I call this action sequentially (for several poses) and insert time.sleep between them to make some routine after one goal is succeeded, then sometimes this error (Ignoring unexpected goal response) occurs.

ros2/rclpy#1123

firesurfer · 2024-06-25T12:07:30Z

Looks like the issue also still existent in iron.
I agree with the other posters that this is an absolute showstopper for the usage of actions in any industrial system and should finally be prioritized by the ROS2 core team.

@clalancette if there is anything that can be done from a user side to help fixing this issue please let us know!

The rclcpp side btw. seems to be finally fixed by ros2/rclcpp#2531 (at least in iron)

brta-jc · 2024-07-09T00:30:36Z

+1 we're also encountering this error. We're on Humble.

akupferb · 2024-08-20T06:41:50Z

Also still encountering this error on Humble...

apockill changed the title ~~Race condition in ActionClient.*_goal_async functions which result in a futures that never complete~~ Race condition in ActionClient.*_async functions which result in a futures that never complete May 2, 2023

apockill changed the title ~~Race condition in ActionClient.*_async functions which result in a futures that never complete~~ Race condition in ActionClient.*_async functions which result in a future that never complete May 2, 2023

apockill mentioned this issue May 4, 2023

draft: avoid race condition in action client #1125

Closed

daisukes mentioned this issue Sep 9, 2023

BtActionServer doesn't accept the message with warning about possible multiple action servers ros-navigation/navigation2#3328

Closed

daisukes added a commit to CMU-cabot/cabot that referenced this issue Sep 9, 2023

use the same callback group for ActionClient execution and send goal …

e13ddaa

…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

daisukes added a commit to CMU-cabot/cabot-navigation that referenced this issue Nov 5, 2023

use the same callback group for ActionClient execution and send goal …

136af46

…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

daisukes added a commit to CMU-cabot/cabot-navigation that referenced this issue Nov 8, 2023

use the same callback group for ActionClient execution and send goal …

798b911

…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

jmblixt3 added a commit to GOFIRST-Robotics/Lunabotics that referenced this issue Jun 20, 2024

Temporary Patch to fix rclpy:1123

6ab52a3

ros2/rclpy#1123

firesurfer mentioned this issue Jun 25, 2024

fix: Fixed race condition in action server between is_ready and take"… ros2/rclcpp#2531

Merged

jmblixt3 linked a pull request Jun 27, 2024 that will close this issue

Fixes Action.*_async futures never complete #1308

Open

fujitatomoya self-assigned this Jun 28, 2024

jmblixt3 mentioned this issue Jul 11, 2024

Incompatibility with calling spin_until_future_complete while in callback #1313

Closed

firesurfer mentioned this issue Jul 21, 2024

ros2_control_node crashes abruptly with error "what(): Taking data from action server but nothing is ready ros-controls/ros2_controllers#981

Open

Timple added a commit to nobleo/executive_smach that referenced this issue Aug 12, 2024

Fix fast actionclients with a patch for ros2/rclpy#1123

31ccd60

Timple mentioned this issue Aug 12, 2024

Fix fast actionclients with a patch nobleo/executive_smach#7

Merged

jmblixt3 mentioned this issue Aug 30, 2024

High-Latency networks: "generator already executing" when calling a service #1351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in `ActionClient.*_async` functions which result in a future that never complete #1123

Race condition in `ActionClient.*_async` functions which result in a future that never complete #1123

apockill commented May 2, 2023 •

edited

Loading

aditya2592 commented May 4, 2023 •

edited

Loading

aditya2592 commented May 8, 2023

apockill commented May 8, 2023

apockill commented May 10, 2023

aditya2592 commented May 10, 2023

aditya2592 commented May 11, 2023 •

edited

Loading

apockill commented May 11, 2023

adityapande-1995 commented May 11, 2023

apockill commented May 11, 2023

aditya2592 commented May 12, 2023 •

edited

Loading

apockill commented May 17, 2023

apockill commented May 17, 2023 •

edited

Loading

apockill commented Jun 14, 2023 •

edited

Loading

fujitatomoya commented Jun 14, 2023

clalancette commented Aug 2, 2023

apockill commented Aug 3, 2023

clalancette commented Aug 3, 2023

apockill commented Aug 4, 2023

mpollayil commented Oct 12, 2023

tompe17 commented Dec 18, 2023

Fingerling42 commented Jun 15, 2024

firesurfer commented Jun 25, 2024

brta-jc commented Jul 9, 2024

akupferb commented Aug 20, 2024

Race condition in ActionClient.*_async functions which result in a future that never complete #1123

Race condition in ActionClient.*_async functions which result in a future that never complete #1123

Comments

apockill commented May 2, 2023 • edited Loading

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

Cause Summary

Recommendations

aditya2592 commented May 4, 2023 • edited Loading

aditya2592 commented May 8, 2023

apockill commented May 8, 2023

apockill commented May 10, 2023

aditya2592 commented May 10, 2023

aditya2592 commented May 11, 2023 • edited Loading

apockill commented May 11, 2023

adityapande-1995 commented May 11, 2023

apockill commented May 11, 2023

aditya2592 commented May 12, 2023 • edited Loading

apockill commented May 17, 2023

apockill commented May 17, 2023 • edited Loading

apockill commented Jun 14, 2023 • edited Loading

fujitatomoya commented Jun 14, 2023

clalancette commented Aug 2, 2023

apockill commented Aug 3, 2023

clalancette commented Aug 3, 2023

apockill commented Aug 4, 2023

mpollayil commented Oct 12, 2023

tompe17 commented Dec 18, 2023

Fingerling42 commented Jun 15, 2024

firesurfer commented Jun 25, 2024

brta-jc commented Jul 9, 2024

akupferb commented Aug 20, 2024

Race condition in `ActionClient.*_async` functions which result in a future that never complete #1123

Race condition in `ActionClient.*_async` functions which result in a future that never complete #1123

apockill commented May 2, 2023 •

edited

Loading

aditya2592 commented May 4, 2023 •

edited

Loading

aditya2592 commented May 11, 2023 •

edited

Loading

aditya2592 commented May 12, 2023 •

edited

Loading

apockill commented May 17, 2023 •

edited

Loading

apockill commented Jun 14, 2023 •

edited

Loading