-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in ActionClient.*_async
functions which result in a future that never complete
#1123
Comments
ActionClient.*_goal_async
functions which result in a futures that never completeActionClient.*_async
functions which result in a futures that never complete
ActionClient.*_async
functions which result in a futures that never completeActionClient.*_async
functions which result in a future that never complete
Hi @apockill, I ran into the same issue. I was wondering if adding locks around dictionary I believe to synchronize that, we would need some an event that is set only in A lock around the entire |
I haven't personally noticed that, but I can't say I've spent much time looking for any race conditions on the server side. What logs have you see printed when a request gets dropped- and are you sure it's not just a repeat of the client side dropping the ball? |
Any thoughts on the best way to resolve this bug? At the moment this affects all users of ActionClients. I can contribute better once there's guidance from maintainers 😁 |
It seems like the change I attempted didn't fully fix the problem, still see the following on some compute platforms
|
I tried looking deeper in It still doesn't seem very clean, another possible solution could be to use multiple queue to send the dictionaries between threads instead of referencing them. |
I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition? I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me. |
Hello @apockill ! Other than the |
Hi @adityapande-1995! I gave the sleep example just to show how this race condition is purely caused by flaky timing, and luck currently prevents most users from seeing it. Also, a much much smaller wait can also trigger it. I'm seeing this flakiness happen on a regular basis on my stack. We happen to call actions on a regular basis (maybe 1-3 actions per second, on average). This bug will reveal itself sometimes minutes after startup, other times within a few hours of runtime. What mainly concerns me is that it's purely timing luck that this works at all, and the error message is confusing enough that other users might run into this but not know what happened. After all, the symptom is a confusing log message and a future that will never complete. If we could find a way to protect the relevant python state tied to sequence_number's, I believe that will solve the problem. |
You are right, lock does seem to solve the issue. Once I fixed the issue caused by the race condition I had another unrelated bug in my setup which was misleading me to believe that the issue is not fixed. @adityapande-1995. Adding another viewpoint here: the dictionaries like |
To others who think they might be running into this issue, here's a quick hack you can use that doesn't require building rclpy: class PatchRclpyIssue1123(ActionClient):
_lock: RLock = None # type: ignore
@property
def _cpp_client_handle_lock(self) -> RLock:
if self._lock is None:
self._lock = RLock()
return self._lock
async def execute(self, *args: Any, **kwargs: Any) -> None:
with self._cpp_client_handle_lock:
return await super().execute(*args, **kwargs) # type: ignore
def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super().send_goal_async(*args, **kwargs)
def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super()._cancel_goal_async(*args, **kwargs)
def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super()._get_result_async(*args, **kwargs) |
On another note- @adityapande-1995: def uses_cpp_client(fn):
def wrapper(self, *args: Any, **kwargs: Any) -> Any:
with self._cpp_client_handle_lock:
return fn(*args, **kwargs)
return wrapper
class ActionClient(Waitable):
...
@uses_cpp_client
def send_goal_async(...):
....
@uses_cpp_client
def _cancel_goal_async(...):
....
@uses_cpp_client
def _get_result_async(...):
...
@async_uses_cpp_client
async def execute(...):
... |
I was looking at the code for service clients and it also uses a lock to sync the C++ state ( I'll submit a PR! |
@apockill thanks for being patient on this.
i think that is correct assumption we have here. |
Sorry for the delay in looking at this. I certainly can see how, in theory, we can get into this situation. However, so far I've been unable to replicate the problem as originally stated on the Does someone have a relatively simple example that I could use to try to reproduce this? |
Hi @clalancette, thank you for looking into this! I've completely forgotten to submit a PR for this. Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this. |
Yeah, that very well may be the case. If there is a simple example I can use to try to reproduce it, I'm certainly happy to take a look and review a PR. |
Absolutely! Here's an example that replicated it on my end. You might have to switch out
from time import sleep
import rclpy
from rclpy import Future
from rclpy.action import ActionClient
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node
from custom_msgs.action import Home as SomeMsg
class RaceyActionClient(ActionClient):
def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
"""
Send a goal and asynchronously get the result.
The result of the returned Future is set to a ClientGoalHandle when receipt of the goal
is acknowledged by an action server.
:param goal: The goal request.
:type goal: action_type.Goal
:param feedback_callback: Callback function for feedback associated with the goal.
:type feedback_callback: function
:param goal_uuid: Universally unique identifier for the goal.
If None, then a random UUID is generated.
:type: unique_identifier_msgs.UUID
:return: a Future instance to a goal handle that completes when the goal request
has been accepted or rejected.
:rtype: :class:`rclpy.task.Future` instance
:raises: TypeError if the type of the passed goal isn't an instance of
the Goal type of the provided action when the service was
constructed.
"""
if not isinstance(goal, self._action_type.Goal):
raise TypeError()
request = self._action_type.Impl.SendGoalService.Request()
request.goal_id = (
self._generate_random_uuid() if goal_uuid is None else goal_uuid
)
request.goal = goal
sequence_number = self._client_handle.send_goal_request(request)
sleep(5) # <--- this line causes this race condition nearly 100% of the time
if sequence_number in self._pending_goal_requests:
raise RuntimeError(
"Sequence ({}) conflicts with pending goal request".format(
sequence_number
)
)
if feedback_callback is not None:
# TODO(jacobperron): Move conversion function to a general-use package
goal_uuid = bytes(request.goal_id.uuid)
self._feedback_callbacks[goal_uuid] = feedback_callback
future = Future()
self._pending_goal_requests[sequence_number] = future
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
future.add_done_callback(self._remove_pending_goal_request)
# Add future so executor is aware
self.add_future(future)
return future
class SomeActionClient(Node):
def __init__(self):
super().__init__("action_client")
self._action_client = RaceyActionClient(
self,
SomeMsg,
"some_action",
callback_group=MutuallyExclusiveCallbackGroup(),
)
self.call_action_timer = self.create_timer(
timer_period_sec=10,
callback=self.send_goal,
callback_group=MutuallyExclusiveCallbackGroup(),
)
def send_goal(self):
goal_msg = SomeMsg.Goal()
# self._action_client.wait_for_server()
result = self._action_client.send_goal(goal_msg)
print(result)
def main(args=None):
rclpy.init(args=args)
action_client = SomeActionClient()
executor = MultiThreadedExecutor()
rclpy.spin(action_client, executor)
if __name__ == "__main__":
main()
import rclpy
from rclpy.action import ActionServer
from rclpy.node import Node
from custom_msgs.action import Home as SomeMsg
class SomeActionServer(Node):
def __init__(self):
super().__init__("fibonacci_action_server")
self._action_server = ActionServer(
self, SomeMsg, "some_action", self.execute_callback
)
def execute_callback(self, goal_handle):
self.get_logger().info("Executing goal...")
result = SomeMsg.Result()
return result
def main(args=None):
rclpy.init(args=args)
action_server = SomeActionServer()
rclpy.spin(action_server)
if __name__ == "__main__":
main() Replication instructions:
Observe
|
…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
Are there any updates on this? At our company, we are also experiencing the same issue. |
…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
…process to avoid the issue ros2/rclpy#1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
I am also seeing this when loading the system. Using the same callback group (Reentrant) did not solve the issue. I will try the the other suggested workaround. Is it supposed to be fixed in the current humble or not or what is the progress? This is a show stopper for the application I have... |
Still experiencing this issue on the current Humble release. In my case, I encountered this in NavigateToPose action from nav2. If I call this action sequentially (for several poses) and insert |
Looks like the issue also still existent in iron. @clalancette if there is anything that can be done from a user side to help fixing this issue please let us know! The rclcpp side btw. seems to be finally fixed by ros2/rclcpp#2531 (at least in iron) |
+1 we're also encountering this error. We're on Humble. |
Also still encountering this error on Humble... |
Bug report
Hi all, thank you for maintaining this wonderful library!
I've been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I've got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I'll lay out all my findings and some perceived paths forward in this bug report.
Required Info:
Steps to reproduce issue
There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling
send_goal_async
. Take theActionClient.send_goal_async
and add asleep(#)
directly below the linesequence_number = self._client_handle.send_goal_request(request)
. For example:Expected behavior
I would expect 'send_goal_async' to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).
Actual behavior
The future never completes, and I see the following line in the logs
Additional information
By my sleuthing, it appears that this error is caused by the following important lines being called:
client:Thread1
send_goal_async
is called, and within it:sequence_number = self._client_handle.send_goal_request(request)
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
client:Thread2
is_ready(...)
execute(...)
is calledif sequence_number in self._goal_sequence_number_to_goal_id:
evaluates to False, becauseclient:Thread1
has not reached its step 3taken_data
, and thus the future will never haveset_result
called.server:Thread1
Cause Summary
There is a race condition when sending goal requests, because the goal request can complete before the
sequence_number
is tracked in theself._goal_sequence_number_to_goal_id
dictionary. Thusexecute
is called before it is tracked, and thetaken_data
is ignored, assumed to be a response from a duplicate action server.I believe this same race condition affects all usages of the client API, such as
_cancel_goal_async
, or_get_result_async
.Recommendations
self._client_handle
so that python-side objects can be edited and synced up. That way ifexecute
is called it will block until theself._client_handle
isn't being used, and the_goal_sequence_number_to_goal_id
will be filled in.execute
:The text was updated successfully, but these errors were encountered: