Exception handling in user callbacks? #2017
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request is actually more meant as a question, rather than an actual request to merge or a bug report. But I am quite sure that something is fishy here, and that the proposed patch may at least help to mitigate the issue. So here it is:
What is the expected behavior of rclcpp in case of an exception raised in a user callback?
I tried to find some information about this topic in the documentation, in the code, on GitHub, on Discourse, on ROS Answers, but failed to find something conclusive, or maybe used the wrong search terms. Only this post and this answer seem to be related. For the special case of service callbacks I remember having seen a discussion/feature request to forward exceptions to the caller as a special response like in ROS 1, but did not find it anymore now.
User callbacks must never throw?
They do. I triggered the case by using the
ros1_bridge
with a service server in ROS 1 and a client calling it from ROS 2: If the ROS 1 service is not available anymore, for example because the ROS 1 node died, the callback defined inServiceFactory<ROS1_T, ROS2_T>::forward_2_to_1()
throws a runtime error after the roscpp service call API returned false. Also any ROS 2 middleware can throw exceptions, I assume, when the user callback invokes a publisher or service client itself. Apparently it is even recommended to handle errors by throwing exceptions.So if the rule would be that user callbacks must handle exceptions internally, I guess
ros1_bridge
and numerous other node implementations would need to be fixed.Did I miss a place where this is already handled within rclcpp?
Even rclcpp code itself may throw exceptions in the
Executor
code path while spinning, for example here.If that is not the case yet, maybe a per executor, per node or per context flag would be nice-to-have, that decides whether exceptions are unhandled like it seems to be the case now, or whether rclcpp catches and logs them internally. Or some mechanism to register a user callback that receives an
std::exception_ptr
and whose return value decides whether the executor continuous or aborts...Always catch exceptions when spinning?
As a last resort, I wanted to patch the main loop of the
dynamic_bridge
(and other nodes), such that exceptions get logged, but the node does not terminate and continues to forward other topics and service calls. But that is not possible without the patch proposed here:The problem is that it triggers the "Node has already been added to an executor" exception here in the next cycle after the exception, and hence keeps logging in a loop. So maybe the executor needs to be recreated to recover? Or I could call
executor.remove_node(ros2_node)
in the catch body as a workaround? That was the point where I started to investigate the problem and ended up here.The proposed patch would fix that, I think, by removing the node from the executor before the exception is rethrown to be handled in
main()
or whereever elsespin_once()
has been called from. I have not actually tested it yet by compiling rclcpp from source. I also may have missed other places whereadd_node()
andremove_node()
gets called in pairs. Maybe a better design would involve a RAII-style class that adds a node in its constructor and removes it again in its destructor? Seems likeRCPPUTILS_SCOPE_EXIT()
is meant exactly for those use cases and should be applied instead of my try/catch block, but I only discovered it while writing this.The same pattern that involves a loop with
rclcpp::ok()
andrclcpp::spin_once()
directly inmain()
can be found in many other places, too, e.g. here. I am not sure whether rclpy is also affected, but in ROS2 Python examples the equivalent pattern is even dominant.For the more simple
rclcpp::spin(node)
call an extra loop would need to be added to keep spinning after an exception.I can almost not believe that there is no foreseen or documented way to prevent that any minor fault terminates the whole process, or that this behavior is "by design"? I am sorry in case there is something more obvious, and I just missed it.
It is easy to reproduce the crash with the
minimal_service
example in ros2/examples, by adding a throw statement in the callback: