Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception handling in user callbacks? #2017

Conversation

meyerj
Copy link
Contributor

@meyerj meyerj commented Sep 21, 2022

This pull request is actually more meant as a question, rather than an actual request to merge or a bug report. But I am quite sure that something is fishy here, and that the proposed patch may at least help to mitigate the issue. So here it is:

What is the expected behavior of rclcpp in case of an exception raised in a user callback?

I tried to find some information about this topic in the documentation, in the code, on GitHub, on Discourse, on ROS Answers, but failed to find something conclusive, or maybe used the wrong search terms. Only this post and this answer seem to be related. For the special case of service callbacks I remember having seen a discussion/feature request to forward exceptions to the caller as a special response like in ROS 1, but did not find it anymore now.

  1. User callbacks must never throw?

    They do. I triggered the case by using the ros1_bridge with a service server in ROS 1 and a client calling it from ROS 2: If the ROS 1 service is not available anymore, for example because the ROS 1 node died, the callback defined in ServiceFactory<ROS1_T, ROS2_T>::forward_2_to_1() throws a runtime error after the roscpp service call API returned false. Also any ROS 2 middleware can throw exceptions, I assume, when the user callback invokes a publisher or service client itself. Apparently it is even recommended to handle errors by throwing exceptions.

    So if the rule would be that user callbacks must handle exceptions internally, I guess ros1_bridge and numerous other node implementations would need to be fixed.

  2. Did I miss a place where this is already handled within rclcpp?

    Even rclcpp code itself may throw exceptions in the Executor code path while spinning, for example here.

    If that is not the case yet, maybe a per executor, per node or per context flag would be nice-to-have, that decides whether exceptions are unhandled like it seems to be the case now, or whether rclcpp catches and logs them internally. Or some mechanism to register a user callback that receives an std::exception_ptr and whose return value decides whether the executor continuous or aborts...

  3. Always catch exceptions when spinning?

    As a last resort, I wanted to patch the main loop of the dynamic_bridge (and other nodes), such that exceptions get logged, but the node does not terminate and continues to forward other topics and service calls. But that is not possible without the patch proposed here:

    // ROS 2 spinning loop
    rclcpp::executors::SingleThreadedExecutor executor;
    while (ros1_node.ok() && rclcpp::ok()) {
      try {
        executor.spin_node_once(ros2_node);
      } catch (std::exception& e) {
        // Log the exception and continue spinning...
      }
    }

    The problem is that it triggers the "Node has already been added to an executor" exception here in the next cycle after the exception, and hence keeps logging in a loop. So maybe the executor needs to be recreated to recover? Or I could call executor.remove_node(ros2_node) in the catch body as a workaround? That was the point where I started to investigate the problem and ended up here.

    The proposed patch would fix that, I think, by removing the node from the executor before the exception is rethrown to be handled in main() or whereever else spin_once() has been called from. I have not actually tested it yet by compiling rclcpp from source. I also may have missed other places where add_node() and remove_node() gets called in pairs. Maybe a better design would involve a RAII-style class that adds a node in its constructor and removes it again in its destructor? Seems like RCPPUTILS_SCOPE_EXIT() is meant exactly for those use cases and should be applied instead of my try/catch block, but I only discovered it while writing this.

    The same pattern that involves a loop with rclcpp::ok() and rclcpp::spin_once() directly in main() can be found in many other places, too, e.g. here. I am not sure whether rclpy is also affected, but in ROS2 Python examples the equivalent pattern is even dominant.

    For the more simple rclcpp::spin(node) call an extra loop would need to be added to keep spinning after an exception.

I can almost not believe that there is no foreseen or documented way to prevent that any minor fault terminates the whole process, or that this behavior is "by design"? I am sorry in case there is something more obvious, and I just missed it.

It is easy to reproduce the crash with the minimal_service example in ros2/examples, by adding a throw statement in the callback:

$ ros2 run examples_rclcpp_minimal_service service_main &
[1] 353822
$ ros2 service call /add_two_ints example_interfaces/srv/AddTwoInts "{}"
requester: making request: example_interfaces.srv.AddTwoInts_Request(a=0, b=0)

[INFO] [1663789664.837616992] [minimal_service]: request: 0 + 0
terminate called after throwing an instance of 'std::runtime_error'
  what():  some error
^C[1]+  Exit 250                ros2 run examples_rclcpp_minimal_service service_main
$ 

@meyerj meyerj force-pushed the fix/remove-node-from-executor-on-exception-while-spinning branch from 591f1f9 to 78df041 Compare September 21, 2022 20:08
…ntations

Signed-off-by: Johannes Meyer <johannes@intermodalics.eu>
@meyerj meyerj force-pushed the fix/remove-node-from-executor-on-exception-while-spinning branch from 78df041 to f046836 Compare September 21, 2022 20:09
@clalancette
Copy link
Contributor

This pull request is actually more meant as a question, rather than an actual request to merge or a bug report. But I am quite sure that something is fishy here, and that the proposed patch may at least help to mitigate the issue. So here it is:

What is the expected behavior of rclcpp in case of an exception raised in a user callback?

Your question is totally valid, and is really a design consideration. As such, I think it deserves a larger conversation than in a pull request, as the answer to it could also affect other client libraries (like rclpy). What I'm going to do here is to close this pull request to try and keep the pull request list down. What I'll encourage you to do is to start a thread on https://discourse.ros.org that describes the problem as you see it, and possible design solutions. From there, we may end up migrating to a design document or to an REP.

@alsora
Copy link
Collaborator

alsora commented Sep 22, 2022

+1 on having a design discussion on this topic.
@meyerj besides creating a thread in discourse, we can also have a conversation in the ROS 2 client library WG (we meet every 2 wednesdays, see calendar at the bottom of this page https://docs.ros.org/en/rolling/The-ROS2-Project/Governance.html)

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/what-is-the-expected-behavior-of-rclcpp-in-case-of-an-exception-raised-in-a-user-callback/27527/1

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/what-is-the-expected-behavior-of-rclcpp-in-case-of-an-exception-raised-in-a-user-callback/27527/4

@meyerj
Copy link
Contributor Author

meyerj commented Sep 18, 2023

Nothing changed since this pull request was opened, and the discussion on ROS Discourse starved. It may still be worth to have that broader discussion on exception handling in rclcpp and other client libraries, but at the same time I consider the fix proposed here valid and necessary at a much lower level, without any change of the intended behavior. Exceptions are still not handled explicitly and will be propagated back to the caller of spin() or its variants, only that the side effects of having added a node to the executor are undone before returning. That issue is at the same level as other typical RAII bugs, like for example locking a mutex and then not unlocking it when returning or throwing an exception before the end of the function.

@meyerj meyerj deleted the fix/remove-node-from-executor-on-exception-while-spinning branch January 20, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants