Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-Latency networks: "generator already executing" when calling a service #1351

Open
mmatthebi opened this issue Aug 30, 2024 · 5 comments

Comments

@mmatthebi
Copy link

mmatthebi commented Aug 30, 2024

Bug report

We have observed a strange behaviour when high network latency is occuring between an action client and server. Such setup can e.g. happen when two robots are connected via an LTE link. I could replicate the error with two docker containers and according traffic shaping.

image

Required Info:

  • Operating System: Ubuntu 22.04
  • Installation type: Binaries
  • Version or commit hash: ros-humble-desktop 0.10.0-1jammy.20240617.124725, but has been reproduced with Iron as well.
  • DDS implementation: Humble default
  • Client library (if applicable): probably rclpy

Steps to reproduce issue

See attached zip file bugreport.zip which contains a docker-compose script along with the source code for an action server and an action client. The architecture is as follows:

  • action_server.py provides an basic action
  • action_client.py provides
    • an action client to above action server
    • an service server /mode to enable/disable calling the action in the server
    • a subscription /data which triggers the action call upon each reception on this topic, when the action calls have been enabled by the service.
  • a script to enable network latency between two docker containers.

Expected behavior

The behaviour is as expected when there is a low-latency connection between the containers:

  1. start up the docker containers with docker-compose up -d
  2. login into both docker containers by (in different terminals)
    docker exec -it client1 /bin/bash
    docker exec -it client2 /bin/bash
    
  3. in both containers start tmux, move to /app and call . /opt/ros/humble/setup.bash
  4. in client1: python3 action_server.py
  5. in client2, open multiple tmux panes (pane1, pane2, pane3)
  6. pane1: python3 action_client.py
  7. pane2: ros2 service call /mode std_srvs/srv/SetBool '{data: true}'
  8. pane3: ros2 topic pub /data std_msgs/msg/String -r 10

From this point on, the action_server is constantly performing actions and the client is reporting on the console. Now, stop the action calls by calling the service:

ros2 service call /mode std_srvs/srv/SetBool '{data: false}'

The client stops executing the actions as expected.

Actual behavior

Now, we add a network latency between the containers by calling

./delay.sh  # equivalent to tc qdisc add dev eth0 root netem delay 200ms
  • Check if there is a delay of total 400ms between the containers by pinging the containers each other.
  • Now, perform steps 4.-8. from previous section. As before, the actions calls are performed by the action server.
  • Then, trying to stop the action calls by calling the service results in a blocking service call:
    ros2 service call /mode std_srvs/srv/SetBool '{data: false}'
    
  • Upon Ctrl-C-ing this service call and calling it again, the client crashes with this stack trace:
      main()
  File "/app/action_client.py", line 89, in main
    rclpy.spin(action_client, e)
  File "/opt/ros/humble/local/l
ib/python3.10/dist-packages/rclpy/__init__.py", line 222, in spin
    executor.spin_once()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 791, in _spin_once_impl
    future.result()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
    raise self.exception()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
    self._handler.send(None)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 437, in handler
    await call_coroutine(entity, arg)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 391, in _execute_service
    response = await await_or_execute(srv.callback, request, srv.srv_type.Response())
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
    return callback(*args)
  File "/app/action_client.py", line 75, in _onMode
    self._acClient.cancel_goal()
  File "/app/action_client.py", line 42, in cancel_goal
    rclpy.spin_until_future_complete(self, future)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 248, in spin_until_future_complete
    executor.spin_until_future_complete(future, timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 303, in spin_until_future_complete
    self.spin_once_until_future_complete(future, timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 746, in spin_once_until_future_complete
    self._spin_once_impl(timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 728, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 711, in wait_for_ready_callbacks
    return next(self._cb_iter)
ValueError: generator already executing

Additional information

The blocking of the service call depends on the rate of the /data publisher. Essentially, it seems if the service call to stop the actions is incoming while a packet transmission is ongoing, the call blocks forever and the crash occurs.

I know that the architecture of the action_client.py is a bit strange, as there are two nodes in the process where one is creating the other. However, we have seen such architecture in the wild where this bug occured. It would be of great help to understand what the cause is and how to fix it.

@jmblixt3
Copy link
Contributor

This certainly sounds like #1123, so potentially try to apply pr #1308 and see if it fixes your issue.

@fujitatomoya
Copy link
Collaborator

This certainly sounds like #1123, so potentially try to apply pr #1308 and see if it fixes your issue.

@jmblixt3

I do not really find the logical reason why #1308 can fix this... can you explain your thought a bit more?

IMO, this is because the same SingleThreadedExecutor (global executor object) object is called in multi-threaded manner by MultiThreadedExecutor, and it will call _spin_once_impl concurrently, finally ValueError raised by Generator which is _wait_for_ready_callbacks.

The reason why we have this problem only with extra delay 200ms, is that 1st service call is still waiting to be completed because of this delay, and then 2nd call comes in after that. (without this delay, service call will be completed much quicker so 1st spin_until_future_complete of global executor is completed before 2nd request comes in.)

@mmatthebi thanks for the detailed information.

can you try with the following patch for your application to see if that works?

from rclpy.executors import SingleThreadedExecutor, MultiThreadedExecutor
...
        se = SingleThreadedExecutor() # HERE
        rclpy.spin_until_future_complete(self, send_goal_future, executor=se) # HERE

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Sep 4, 2024
…ervice

  ref: ros2/rclpy#1351

Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
@mmatthebi
Copy link
Author

thank you all for your help! The hint from #1123 did not fix it unfortunately. Though, if running with rolling, the error does not occur, so it seems to be fixed somewhere internally already.

However, huge huge thanks to @fujitatomoya for providing a workaround! I have implemented it in the testcase and the error goes away. I will try this in production and see if it helps there as well. I'll let you know. Thanks again!

@fujitatomoya
Copy link
Collaborator

@mmatthebi thanks for checking that.

Though, if running with rolling, the error does not occur, so it seems to be fixed somewhere internally already.

one question, did you build the source code or just use the released package? can you share the version/commit hash of either package or source code?

@mmatthebi
Copy link
Author

I ran the tests again. With rolling, the service call still blocks sometimes, but a subsequent call does not crash with the "generator already executing" bug. Though, it seems that the action calls to the action server do not get through smoothly but in bursts. Instead, when using your proposed solution with the dedicated executor, the service calls do not block and also the action calls go through more smoothly.

I used the packaged rolling version from the ros2:rolling docker container:

root@client2:/app# apt-cache show ros-rolling-desktop
Package: ros-rolling-desktop
Version: 0.11.0-1noble.20240820.020842
Architecture: amd64
Maintainer: Geoffrey Biggs <geoff@openrobotics.org>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants