Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaunchService.shutdown() fails to terminate run loop #126

Open
crdelsey opened this issue Aug 22, 2018 · 14 comments
Open

LaunchService.shutdown() fails to terminate run loop #126

crdelsey opened this issue Aug 22, 2018 · 14 comments
Labels
bug Something isn't working

Comments

@crdelsey
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 16.04
  • Python Version:
  • Python 3.5.2
  • Installation type:
    • source
  • Version or commit hash:
    • 0.6.0
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):

Steps to reproduce issue

Run the following python code

#!/usr/bin/env python

import sys
import time
import threading
from launch import LaunchDescription
from launch import LaunchService
import launch_ros.actions

def main():
    ld = LaunchDescription([
        launch_ros.actions.Node(
            package='demo_nodes_cpp', node_executable='listener', output='screen'),
            ])
    ls = LaunchService()
    ls.include_launch_description(ld)
    t = threading.Thread(target=ls.run)
    t.start()
    time.sleep(5) # wait for listener to fully start
    ls.shutdown()
    t.join()

if __name__=="__main__":
    main()

Expected behavior

I expect the listener executable to terminate and the python code to exit.

Actual behavior

The python script produces the following output but never terminates.

[INFO] [launch]: process[listener-1]: started with pid [20208]
Starting shutdown
[INFO] [launch]: sending signal 'SIGINT' to process[listener-1]
[ERROR] [launch]: process[listener-1] failed to terminate '5' seconds after receiving 'SIGINT', escalating to 'SIGTERM'
[INFO] [launch]: sending signal 'SIGTERM' to process[listener-1]
[ERROR] [launch]: process[listener-1] failed to terminate '10.0' seconds after receiving 'SIGTERM', escalating to 'SIGKILL'
[INFO] [launch]: sending signal 'SIGKILL' to process[listener-1]

Additional information

The listener executable in fact terminates when the SIGINT is sent and remains as a zombie process.

20208 pts/20   Z+     0:00 [listener] <defunct>
@wjwwood
Copy link
Member

wjwwood commented Aug 22, 2018

I think this issue comes from using asyncio.subprocess from a thread, if I run on my macOS machine with Python 3.7 I get:

% python3 ./test.py
[ERROR] [launch]: exception occurred while executing process[listener-1]:
Traceback (most recent call last):
  File "/Users/william/ros2_ws/build/launch/launch/actions/execute_process.py", line 417, in __execute_process
    stderr_to_stdout=(self.__output == 'screen'),
  File "/Users/william/ros2_ws/build/osrf_pycommon/osrf_pycommon/process_utils/async_execute_process_asyncio/impl.py", line 140, in async_execute_process
    stderr_to_stdout)
  File "/Users/william/ros2_ws/build/osrf_pycommon/osrf_pycommon/process_utils/async_execute_process_asyncio/impl.py", line 48, in _async_execute_process_nopty
    stderr=stderr, close_fds=False)
  File "/Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 1516, in subprocess_exec
    bufsize, **kwargs)
  File "/Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/asyncio/unix_events.py", line 193, in _make_subprocess_transport
    self._child_watcher_callback, transp)
  File "/Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/asyncio/unix_events.py", line 924, in add_child_handler
    "Cannot add child handler, "
RuntimeError: Cannot add child handler, the child watcher does not have a loop attached

[ERROR] [asyncio]: Task was destroyed but it is pending!
task: <Task pending coro=<BaseSubprocessTransport._connect_pipes() done, defined at /Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_subprocess.py:157> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x1031a3ca8>()]>>

If I modify the script like so (I'm avoiding the shebang by invoking with python3 directly):

import sys
import time
import threading
from launch import LaunchDescription
from launch import LaunchService
import launch_ros.actions

ls = None


def run():
    global ls
    ld = LaunchDescription([
        launch_ros.actions.Node(
            package='demo_nodes_cpp', node_executable='listener', output='screen'),
            ])
    ls = LaunchService()
    ls.include_launch_description(ld)
    ls.run()


def main():
    t = threading.Thread(target=run)
    t.start()
    time.sleep(5) # wait for listener to fully start
    assert ls is not None
    ls.shutdown()
    t.join()

if __name__=="__main__":
    main()

I get this:

% python3 ./test.py
failed to set signal handlers in 'launch.utilities.signal_management.py'
this function must be called in the main thread
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "./test.py", line 19, in run
    ls = LaunchService()
  File "/Users/william/ros2_ws/build/launch/launch/launch_service.py", line 86, in __init__
    install_signal_handlers()
  File "/Users/william/ros2_ws/build/launch/launch/utilities/signal_management.py", line 140, in install_signal_handlers
    signal.signal(signal.SIGINT, __on_sigint)
  File "/Users/william/.pyenv/versions/3.7.0/Python.framework/Versions/3.7/lib/python3.7/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread

Traceback (most recent call last):
  File "./test.py", line 33, in <module>
    main()
  File "./test.py", line 28, in main
    assert ls is not None
AssertionError

Which is a combination of errors, some are ours, e.g. the failed to set signal handlers in 'launch.utilities.signal_management.py' this function must be called in the main thread, and the other is from Python, and finally the assertion I added.

See:

install_signal_handlers() must have been called in the main thread before.
It is called automatically by the constructor of `launch.LaunchService`.

I guess my assumption was that running LaunchService from a thread would be ok so long as you constructed it in the main thread (what your script does), but perhaps none of the tests do all of that and execute a process.

This launch file does work:

import threading
import time

import launch.actions
from launch import LaunchDescription
from launch import LaunchService


def main():
    ld = LaunchDescription([launch.actions.LogInfo(msg='Hello World')])
    ls = LaunchService()
    ls.include_launch_description(ld)
    t = threading.Thread(target=ls.run)
    t.start()
    time.sleep(5)  # wait for listener to fully start
    ls.shutdown()
    t.join()


if __name__ == '__main__':
    main()

I'll see if I can reproduce your exact issue with Python 3.5 in Xenial.

@wjwwood
Copy link
Member

wjwwood commented Aug 23, 2018

I can reproduce your issue on Xenial as well.

So the question is what to do about it. For now I'd advise never running LaunchService.run in a thread if you can avoid it.

If running it in a thread is a must, I can look at the child watcher policies that asyncio has. There is asyncio.SafeChildWatcher which uses polling rather than calling os.waitpid(-1), but that incurs significant overhead as the number of processes rises, which might be the case for launch. I might be able to come up with a solution that only does that if you run outside of the main thread, and put a warning in the docs that running in a thread is inefficient. In both cases, I would not recommend running in a thread.

I'll wait to hear from you guys about why you need to run in a thread, and if it's not a hard requirement I'll likely just document that you shouldn't and put a better/earlier warning when you try to do that.

@wjwwood
Copy link
Member

wjwwood commented Aug 23, 2018

There's a little info about the child process watchers here: https://stackoverflow.com/questions/48604341/what-is-event-loop-policy-and-why-is-it-needed-in-python-asyncio

@crdelsey
Copy link
Author

What I'm trying to do is write some integration tests for my node in pytest. I was trying to use launch to start up the nodes-under-test as part of a pytest fixture, run the tests, and then shutdown() as part of the teardown process.

I needed to put run() in a thread so the fixture setup wouldn't block.

The alternatives I can think of are:

  • Start a new process during the fixture setup using either the python multiprocessing module or ros2 launch. Those would be fine for the simple case where I don't care about the node-under-test console output, but doesn't work so well if I wanted to compare the output as part of the test case using launch_testing
  • Create a separate launch file for each configuration I want to test, and make the launch file also launch pytest. It would then have to terminate when pytest terminates. This could work if I don't have a lot of configurations, but it prevents me from tearing down and setting up clean between each test in the file.

@wjwwood
Copy link
Member

wjwwood commented Aug 23, 2018

Ok, so in ROS 1 the equivalent to this was rostest and it worked by defining test files which were just XML based launch files with one of the "nodes" marked as a test, so that it's return code would influence the test running. And the process marked as node test would usually do the actual testing (e.g. check if a topic is there, offer a service and check that it was called, etc...). Integrating that with something like pytest was not that straightforward.

I can see the allure of using launch to start a bunch of stuff, wait for an event to occur, then use the "test body" in pytest to poke at the things running under launch using rclpy.

So I guess we should look at how to support running outside of the main thread, but that's not something I have time to dig into at the moment and in my opinion there's a chance it just never works right, especially with Python 3.5.2.

Lots of asyncio stuff was fixed in 3.5.3 and even more in 3.6+ and in fact the first stable version of asyncio wasn't until 3.6. So another recommendation would be to get a newer Python, possibly with a new Ubuntu, or even possibly with something like pyenv.

The other thing I'll mention is that when you use launch directly (rather than ros2 launch) you will be missing the stuff added by launch_ros.get_default_launch_description(), which includes a custom rclpy node that launch_ros uses to monitor nodes with lifecycle and stuff like that.

@wjwwood
Copy link
Member

wjwwood commented Aug 23, 2018

I'll also say that for now we're still using the legacy system in launch_testing for this kind of thing. The new launch system hasn't tackled testing yet. @dirk-thomas just took a stab at doing a more "ROS 1 rostest" like approach using the new launch system: #127

@crdelsey
Copy link
Author

I've got no problems with going to a newer Python version, but based on your test with 3.7, it doesn't really solve my problem. It just gets me a better error message :-)

I'll take a look at what @dirk-thomas is doing. I'd like to follow the same pattern as the rest of the ROS code, but I just couldn't figure out how to use the existing launch_testing to do what I wanted.

In the meantime, I'll work around

@wjwwood
Copy link
Member

wjwwood commented Aug 23, 2018

I've got no problems with going to a newer Python version, but based on your test with 3.7, it doesn't really solve my problem. It just gets me a better error message :-)

That's true, but it might also be required for one of the potential solutions for multi-threading (like the asyncio.SafeChildWatcher option).

@wjwwood
Copy link
Member

wjwwood commented Aug 23, 2018

Also, using multiprocess is not a bad option (I didn't try it myself but I think it would work), at least as a temporary replacement for running it in a thread and so long as you don't need access to the launch service during the test.

@crdelsey
Copy link
Author

For now, I think I'll just spawn ros2 launch from my test fixture. That has the added benefit that I can reuse the launch files for other things. Also, I'd been meaning to look into the lifecycle stuff you mentioned that I get with that, because one of the problems with my current approach is that I can't reliably tell when all the nodes are up and ready to receive messages.

@crdelsey crdelsey changed the title LuanchService.shutdown() fails to terminate run loop LaunchService.shutdown() fails to terminate run loop Aug 23, 2018
@dirk-thomas dirk-thomas added the bug Something isn't working label Aug 30, 2018
@pbaughman
Copy link
Contributor

I just encountered this exact same problem and then stumbled on this issue (woo google!). Based on this comment here it sure seems like LaunchService.run is intended to be runnable from another thread, right?

If LaunchService.shutdown is not meant to shut down a LaunchService that got run on another thread, what's it supposed to be for? Or are you only meant to do LaunchService.run from the main thread, and then some other worker thread calls LaunchService.shutdown?

Thanks,
Pete

@wjwwood
Copy link
Member

wjwwood commented Jan 7, 2019

I recommend not running launch service in anything but the main thread (it is special in Python). That comment implies that shutdown may be called from a non-main thread but will do nothing if the launch service isn't running.

@hidmic
Copy link
Contributor

hidmic commented Mar 18, 2019

Alright, I just stumbled upon this issue too. It's quite inconvenient, as pushing the service to a thread instead of pushing the tests (as @pbaughman had to do) simplifies things quite a bit. I'll dig in a bit.

@hidmic
Copy link
Contributor

hidmic commented Mar 18, 2019

Ok, there seems to be a fundamental limitation in asyncio when it comes to subprocesses. At least on Unix platforms and as of Python 3.6, all child watchers rely on SIGCHLD to monitor subprocesses termination and loop signal handlers can only be serve in the main thread event loop. Thus, there's no way to push the LaunchService to a background thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants