Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix recursive spinning from within a callback? #1211

Open
wants to merge 1 commit into
base: rolling
Choose a base branch
from

Conversation

meyerj
Copy link

@meyerj meyerj commented Jan 20, 2024

Resolves #1018. Also #919 seems related at first glance.

Suggested changes:

  • rclpy's global spinning functions (rclpy.spin*()) use the node's current executor by default, if set. Only in case the passed node does not have an executor assigned yet, fall back to the default executor, for example in the main function.
  • only remove the node from the executor if it has actually been added in the same spinning call. add_node() may return False if the node was already added to the executor before, like if spin_until_future_complete() was called recursively from within another callback. In that case it must not be removed in the same callback. Only the outermost call would do so.

Rationale:

This is to support the common case where rclpy.spin_until_future_complete(self, future) is used within another callback, by a running executor, where the node has already been added before. At the moment, without a patch like this, it seems very complex to call a service in a blocking way in this case, for example to forward a call to another service within a node? The synchronous client.call() cannot be used either, which would be the most obvious solution, because it does not spin the node while waiting or handles the response directly, and hence depends on a spinner for that node in another thread to function properly.

The basic example for calling a service from the official documentation only works because the service call is initiated from the main thread directly, which is unusual in a more complex, long running node. It would also work like that inside a callback, for example a timer callback like in #1018, but then the node is removed from the executor and no other callbacks will run anymore afterwards.

I am aware that this patch would be a quite significant change to rclpy, with the potential to break user code. So it is more meant to start a discussion on other potential solutions...

Alternatives considered:

  • Update examples and documentation and tell users to call node.executor.spin_until_future_complete(future) instead of rclpy.spin_until_future_complete(node, future) if that is used within a callback?
  • Throw an exception if the same node is added to an executor more than once, like in rclcpp, to enforce the above?

References:

For reference, my original comment in #1018 with some additional thoughts and suggestions:

We experienced similar issuses in different contexts. I think the problem is deeper, not only related to client calls and timers.

Essentially all implementations of rclpy.spin_*() methods in rclpy/__init__.py are somehow broken for the following reason, similar to what @iuhilnehc-ynos already outlined above:

  • If they are called recursively from within a callback, which is not explicitly forbidden, then the node instance was already added before to that executor. This results in False being returned from Executor.add_node(node), but the return value is not used in the spin functions.
  • Even if the node has not be added in this spin call, it will be removed from the executor unconditionally in the finally block.
  • So if any of those methods is used recursively, the node will not be spinned anymore afterwards.

I do think that recursive spinning is a desirable feature and apparently many users rely on that for implementing timeouts without blocking all other callbacks. Before recent #1188 calling a service with a timeout that works in any context was surprisingly hard, but the synchronous call() with a timeout does currently not allow other callbacks to run in between, like with recursive spinning. Even the most basic Python service client example in the official documentation uses rclpy.spin_until_future_complete(self, self.future), but in a method called directly from main() and not by an already running executor, like if a service would be called in another callback.

Suggestion:

So probably the "best" solution would be to try adding the node to the executor for spinning, but keep the return value and only remove it in finally if it was successfully added before? Something like

    try:
        was_added = executor.add_node(node)
        executor.spin_until_future_complete(future, timeout_sec)
    finally:
        if was_added:
            executor.remove_node(node)

of with a context manager that temporarily adds the node to the executor if it was not added before, instead of the repeated try ... finally construct.

Alternatively there must be a way to call executor.spin_until_future_complete(future, timeout_sec) directly without a specific node instance if it was already added to the executor before, like inside another callback of that node.

Something like

node.executor.spin_until_future_complete(future, timeout_sec)

instead of

rclpy.spin_until_future_complete(future, timeout_sec)

should also solve the problem described in this issue, and similar ones.

When checking on how to find the current executor of a node, I also found that Executor.add_node(node) is always called twice recursively, because it invokes the setter for node.executor which in turn calls back into executor.add_node(node). This only works because executor._nodes_lock is a recursive lock and add_node(node) checks whether the node was already added to self._nodes before.

Afaik there is currently no way to get the executor of the current thread, if any, without a reference to the node? I think asyncio.get_event_loop() uses thread local storage for that. So calls like rclpy.spin_until_future_complete(node, self.future) from within a callback would also fail, or spin the wrong executor, if used inside a callback that is executed by another than the global executor. If there was such a way, then using the executor of the current thread as the default executor in global rclpy.spin_*() functions would be a reasonable default, and only fall back to the global executor in case the call is from a thread context without any executor, like the main thread before any spin function was called. Of course that would be a quite severe change to the basics of rclpy...

Interestingly, in rclcpp recursively spinning a node results in an exception being thrown in Executor::add_node(), so it behaves fundamentally different (in version 6f6b5f2). See also @wjwwood's related TODO comment in rclcpp::spin_node_until_future_complete(). On the long term it would be nice if the two most widely used client libraries behave the same with this respect.

And just for completeness and because it is very much related, if an uncaught exception is thrown in any callback in rclcpp, the executor is left in an unclean state and the node cannot spin anymore, even if exceptions would be handled by the caller of the spinner: ros2/rclcpp#2017.

…fault

... and only remove the node from the executor if it has actually been added before in the same call.

Signed-off-by: Johannes Meyer <johannes@intermodalics.eu>
Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add the check if the node is already associated with Executor,

def add_node(self, node: 'Node') -> bool:
"""
Add a node whose callbacks should be managed by this executor.
:param node: The node to add to the executor.
:return: ``True`` if the node was added, ``False`` otherwise.
"""
with self._nodes_lock:
if node not in self._nodes:
self._nodes.add(node)
node.executor = self
# Rebuild the wait set so it includes this new node
self._guard.trigger()
return True
return False

"""
executor = get_global_executor() if executor is None else executor
executor = executor or node.executor or get_global_executor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although this makes #1018 (comment) works with node.executor,

If executor is specified, and node.executor is assigned, this will replace the node.executor implicitly. i do not think this is expected behavior by user. instead, we should generate the exception to let the user know that this node has been already associated with executor? i agree with #1018 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Asynchronous client calls in a timer cause the timer to never execute a second time.
2 participants