Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x) #82504

Closed
vstinner opened this issue Sep 30, 2019 · 50 comments
Labels
3.9 only security fixes tests Tests in the Lib/test dir topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 38323
Nosy @vstinner, @njsmith, @asvetlov, @cjerdonek, @1st1, @jstasiak, @pablogsal, @miss-islington, @aeros, @Fidget-Spinner, @shreyanavigyan
PRs
  • bpo-38323: Suppress the hang #18457
  • bpo-38323: Temporarily skip close_kill_running() for MultiLoopWatcher in test_asyncio #20013
  • [WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142
  • bpo-38323: Add guard clauses in MultiLoopChildWatcher. #22756
  • [3.9] bpo-38323: Add guard clauses in MultiLoopChildWatcher. (GH-22756) #23806
  • [3.8] bpo-38323: Add guard clauses in MultiLoopChildWatcher. (GH-22756) #23807
  • bpo-38323, asyncio: Fix MultiLoopChildWatcher race condition #26536
  • bpo-38323: Remove test_cancel_make_subprocess_transport_exec as it has a rance condicion #26541
  • bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite #26542
  • [3.9] bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542) #26543
  • [3.10] bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542) #26544
  • bpo-38323: Change MultiLoopChildWatcher to install handlers for all the event loops #26574
  • bpo-44368: Improve syntax errors with invalid as pattern targets #26632
  • [3.8] bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542) #26670
  • Files
  • test-kill.py
  • test-kill2.py
  • test-kill3.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2019-09-30.12:57:35.040>
    labels = ['type-bug', 'tests', '3.9', 'expert-asyncio']
    title = 'asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x)'
    updated_at = <Date 2021-11-05.17:45:33.738>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2021-11-05.17:45:33.738>
    actor = 'byllyfish'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Tests', 'asyncio']
    creation = <Date 2019-09-30.12:57:35.040>
    creator = 'vstinner'
    dependencies = []
    files = ['49144', '49145', '49150']
    hgrepos = []
    issue_num = 38323
    keywords = ['patch']
    message_count = 49.0
    messages = ['353568', '354079', '355150', '355157', '355159', '355362', '355363', '359909', '359910', '359928', '360975', '361576', '361691', '361801', '368061', '368183', '368280', '368467', '368468', '368501', '368511', '368514', '368516', '368532', '368544', '368549', '368561', '368563', '368564', '368699', '368725', '369086', '380420', '383194', '383197', '383199', '395090', '395113', '395119', '395126', '395138', '395139', '395142', '395143', '395376', '395693', '395718', '396684', '405813']
    nosy_count = 12.0
    nosy_names = ['vstinner', 'njs', 'asvetlov', 'chris.jerdonek', 'yselivanov', 'jstasiak', 'pablogsal', 'miss-islington', 'aeros', 'kj', 'shreyanavigyan', 'byllyfish']
    pr_nums = ['18457', '20013', '20142', '22756', '23806', '23807', '26536', '26541', '26542', '26543', '26544', '26574', '26632', '26670']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue38323'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    test_asyncio fails once on AMD64 RHEL7 Refleaks 3.x, and then test_close_kill_running() was killed after 3h 15 min. I guess that it hangs, but I'm not 100% sure. When test_asyncio was re-run, it seems like test_asyncio was run 3x successful but it hanged at the 4rd run. Refleaks runs each test 6 times.

    https://buildbot.python.org/all/#/builders/264/builds/10

    0:17:04 load avg: 3.01 [416/419/1] test_asyncio failed (12 min 21 sec) -- running: test_multiprocessing_fork (6 min 34 sec), test_concurrent_futures (10 min 52 sec), test_multiprocessing_spawn (13 min 53 sec)
    beginning 6 repetitions
    123456
    .Unknown child process pid 25032, will report returncode 255
    Loop <_UnixSelectorEventLoop running=False closed=True debug=False> that handles pid 25032 is closed
    .Unknown child process pid 19349, will report returncode 255
    Child watcher got an unexpected pid: 19349
    Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/asyncio/unix_events.py", line 1213, in _do_waitpid
        loop, callback, args = self._callbacks.pop(pid)
    KeyError: 19349
    test test_asyncio failed -- Traceback (most recent call last):
      File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 156, in test_shell
        self.assertEqual(exitcode, 7)
    AssertionError: 255 != 7

    (...)
    0:21:59 load avg: 0.52 [419/419/1] test_concurrent_futures passed (15 min 46 sec)
    (...)
    == Tests result: FAILURE ==

    404 tests OK.

    10 slowest tests:

    • test_concurrent_futures: 15 min 46 sec
    • test_multiprocessing_spawn: 14 min 7 sec
    • test_asyncio: 12 min 21 sec
    • test_multiprocessing_forkserver: 8 min 59 sec
    • test_multiprocessing_fork: 7 min 13 sec
    • test_io: 3 min 47 sec
    • test_subprocess: 3 min 30 sec
    • test_pickle: 2 min 51 sec
    • test_capi: 2 min 47 sec
    • test_zipfile: 2 min 41 sec

    1 test failed:
    test_asyncio

    14 tests skipped:
    test_devpoll test_gdb test_ioctl test_kqueue test_msilib
    test_ossaudiodev test_startfile test_tix test_tk test_ttk_guionly
    test_winconsoleio test_winreg test_winsound test_zipfile64

    Re-running failed tests in verbose mode

    Re-running test_asyncio in verbose mode
    beginning 6 repetitions
    123456
    (...)

    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessSafeWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests) ... ok
    ...

    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessSafeWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests) ... ok
    ...

    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessSafeWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests) ... ok
    ...

    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessSafeWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests) ... ok
    ...
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_communicate (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_communicate_ignore_broken_pipe (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_create_subprocess_exec_text_mode_fails (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_create_subprocess_exec_with_path (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_create_subprocess_shell_text_mode_fails (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_devnull_error (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_devnull_input (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_devnull_output (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_empty_input (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_exec_loop_deprecated (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_kill (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_pause_reading (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_popen_error (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_popen_error_with_stdin_pipe (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_process_create_warning (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_read_stdout_after_process_exit (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_send_signal (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_shell (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_shell_loop_deprecated (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_start_new_session (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_stdin_broken_pipe (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_stdin_not_inheritable (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_stdin_stdout (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_subprocess_protocol_create_warning (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_terminate (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests) ... ok
    test_cancel_make_subprocess_transport_exec (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    test_cancel_post_init (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    test_cancel_process_wait (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    test_close_dont_kill_finished (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    ...Timeout (3:15:00)!
    Thread 0x00007f666a7a1740 (most recent call first):
    File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/selectors.py", line 468 in select
    File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/asyncio/base_events.py", line 1837 in _run_once
    File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/asyncio/base_events.py", line 589 in run_forever
    File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/asyncio/base_events.py", line 621 in run_until_complete
    File "/home/buildbot/buildarea/3.x.cstratak-RHEL7-x86_64.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running

    @vstinner vstinner added 3.9 only security fixes tests Tests in the Lib/test dir topic-asyncio labels Sep 30, 2019
    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 7, 2019

    Fail on x86 Gentoo Refleaks 3.x:
    https://buildbot.python.org/all/#/builders/1/builds/737

    (...)
    Loop <UnixSelectorEventLoop running=False closed=True debug=False> that handles pid 8654 is closed
    .Timeout (3:15:00)!
    Thread 0xb7b92700 (most recent call first):
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/selectors.py", line 468 in select
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 1837 in _run_once
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 589 in run_forever
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 621 in run_until_complete
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/case.py", line 616 in _callTestMethod
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/case.py", line 659 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/case.py", line 719 in __call
    _
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/support/testresult.py", line 162 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/support/init.py", line 2032 in _run_suite
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/support/init.py", line 2128 in run_unittest
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 209 in _test_module
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/refleak.py", line 87 in dash_R
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 232 in _runtest_inner2
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 270 in _runtest_inner
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 153 in _runtest
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 193 in runtest
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest_mp.py", line 67 in run_tests_worker
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 654 in _main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 634 in main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 712 in main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/regrtest.py", line 43 in _main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/regrtest.py", line 47 in <module>
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/runpy.py", line 85 in _run_code
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/runpy.py", line 192 in _run_module_as_main

    == Tests result: FAILURE ==

    410 tests OK.

    10 slowest tests:

    • test_concurrent_futures: 23 min 2 sec
    • test_multiprocessing_spawn: 22 min 6 sec
    • test_multiprocessing_forkserver: 15 min 51 sec
    • test_multiprocessing_fork: 10 min 15 sec
    • test_regrtest: 10 min 10 sec
    • test_gdb: 8 min 26 sec
    • test_lib2to3: 7 min 31 sec
    • test_pickle: 7 min
    • test_io: 6 min 50 sec
    • test_subprocess: 6 min 37 sec

    1 test failed:
    test_asyncio

    8 tests skipped:
    test_devpoll test_kqueue test_msilib test_startfile
    test_winconsoleio test_winreg test_winsound test_zipfile64
    4:25:59 load avg: 1.55
    4:25:59 load avg: 1.55 Re-running failed tests in verbose mode
    4:25:59 load avg: 1.55 Re-running test_asyncio in verbose mode
    (...)
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... Timeout (3:15:00)!
    Thread 0xb7c51700 (most recent call first):
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/selectors.py", line 468 in select
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 1837 in _run_once
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 589 in run_forever
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 621 in run_until_complete
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/case.py", line 616 in _callTestMethod
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/case.py", line 659 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/case.py", line 719 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 122 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/suite.py", line 84 in __call__
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/unittest/runner.py", line 176 in run
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/support/init.py", line 2032 in _run_suite
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/support/init.py", line 2128 in run_unittest
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 209 in _test_module
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/refleak.py", line 87 in dash_R
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 232 in _runtest_inner2
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 270 in _runtest_inner
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 153 in _runtest
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/runtest.py", line 193 in runtest
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 318 in rerun_failed_tests
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 691 in _main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 634 in main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/libregrtest/main.py", line 712 in main
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/main.py", line 2 in <module>
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/runpy.py", line 85 in _run_code
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/runpy.py", line 192 in _run_module_as_main
    make: *** [Makefile:1163: buildbottest] Error 1

    @vstinner
    Copy link
    Member Author

    Also on x86 Gentoo Refleaks 3.x:

    https://buildbot.python.org/all/#/builders/1/builds/754

    6:11:43 load avg: 1.44 [419/419/2] test_asyncio crashed (Exit code 1)
    beginning 6 repetitions
    123456
    Timeout (3:15:00)!
    Thread 0xb7c4d700 (most recent call first):
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/selectors.py", line 468 in select
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 1837 in _run_once
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 589 in run_forever
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/asyncio/base_events.py", line 621 in run_until_complete
    File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running

    @vstinner vstinner changed the title test_ayncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x Oct 22, 2019
    @vstinner
    Copy link
    Member Author

    I'm able to reproduce the issue locally using the command:

    ./python -m test test_asyncio --match=test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests.test_close_kill_running -v -F -j20 --timeout=30.0

    Example:

    ...
    0:00:47 load avg: 14.35 [287] test_asyncio passed -- running: test_asyncio (31.1 sec)
    0:00:47 load avg: 14.35 [288] test_asyncio passed -- running: test_asyncio (31.3 sec)
    0:00:47 load avg: 14.35 [289/1] test_asyncio crashed (Exit code 1)
    Timeout (0:00:30)!
    Thread 0x00007f3bbd098740 (most recent call first):
    File "/home/vstinner/python/master/Lib/selectors.py", line 468 in select
    File "/home/vstinner/python/master/Lib/asyncio/base_events.py", line 1837 in _run_once
    File "/home/vstinner/python/master/Lib/asyncio/base_events.py", line 589 in run_forever
    File "/home/vstinner/python/master/Lib/asyncio/base_events.py", line 621 in run_until_complete
    File "/home/vstinner/python/master/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running
    ...

    I added some debug traces.

    Logs when it works:
    ---
    0:01:39 load avg: 21.21 [512] test_asyncio passed -- running: test_asyncio (31.0 sec)
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... MultiLoopChildWatcher: attach_loop <_UnixSelectorEventLoop running=False closed=False debug=False>
    MultiLoopChildWatcher: SIGCHLD signal registered
    MultiLoopChildWatcher: attach_loop None
    MultiLoopChildWatcher: attach_loop None
    MultiLoopChildWatcher: SIGCHLD signal registered
    ok

    ----------------------------------------------------------------------

    Ran 1 test in 0.054s

    OK
    MultiLoopChildWatcher: _do_waitpid 27887
    MultiLoopChildWatcher: _do_waitpid (0, 0)
    kill 27887 9
    MultiLoopChildWatcher: _sig_chld
    MultiLoopChildWatcher: _do_waitpid_all
    MultiLoopChildWatcher: _do_waitpid 27887
    MultiLoopChildWatcher: _do_waitpid (27887, 9)
    MultiLoopChildWatcher: _do_waitpid: call (<bound method _UnixSelectorEventLoop._child_watcher_callback of <_UnixSelectorEventLoop running=True closed=False debug=False>>, <_UnixSelectorEventLoop running=True closed=False debug=False>)
    ---

    Logs when it hangs:
    ---
    0:01:40 load avg: 21.21 [514/1] test_asyncio crashed (Exit code 1)
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... MultiLoopChildWatcher: attach_loop <_UnixSelectorEventLoop running=False closed=False debug=False>
    MultiLoopChildWatcher: SIGCHLD signal registered
    MultiLoopChildWatcher: _do_waitpid 26010
    MultiLoopChildWatcher: _do_waitpid (0, 0)
    kill 26010 9
    Timeout (0:00:30)!
    Thread 0x00007f4b73f35740 (most recent call first):
    File "/home/vstinner/python/master/Lib/selectors.py", line 468 in select
    File "/home/vstinner/python/master/Lib/asyncio/base_events.py", line 1837 in _run_once
    File "/home/vstinner/python/master/Lib/asyncio/base_events.py", line 589 in run_forever
    File "/home/vstinner/python/master/Lib/asyncio/base_events.py", line 621 in run_until_complete
    File "/home/vstinner/python/master/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running
    ...
    ---

    It seems like sometimes, MultiLoopChildWatcher._sig_chld() is not called.

    @vstinner vstinner changed the title test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x) Oct 22, 2019
    @vstinner
    Copy link
    Member Author

    It seems like MultiLoopChildWatcher doesn't respect the PEP-475: siginterrupt(False) must not be used. But the following change isn't enough to fix this issue (test_close_kill_running hang).

    diff --git a/Lib/asyncio/unix_events.py b/Lib/asyncio/unix_events.py
    index d8f653045a..887f837bad 100644
    --- a/Lib/asyncio/unix_events.py
    +++ b/Lib/asyncio/unix_events.py
    @@ -1192,9 +1192,6 @@ class MultiLoopChildWatcher(AbstractChildWatcher):
                                    "restore to default handler on watcher close.")
                     self._saved_sighandler = signal.SIG_DFL
     
    -            # Set SA_RESTART to limit EINTR occurrences.
    -            signal.siginterrupt(signal.SIGCHLD, False)
    -
         def _do_waitpid_all(self):
             for pid in list(self._callbacks):
                 self._do_waitpid(pid)

    @aeros
    Copy link
    Contributor

    aeros commented Oct 25, 2019

    I'm able to reproduce the issue locally using the command:
    ./python -m test test_asyncio --match=test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests.test_close_kill_running -v -F -j20 --timeout=30.0

    I was unable to reproduce this locally on OS: Arch Linux 5.3.7 x86_64 and CPU: Intel i5-4460 from the latest commit (heads/master:96b06aefe2) within a reasonable number of tests (~10000). I also tried differing number of jobs, up to 100. What OS were you able to reproduce it locally on using that configuration?

    This is of course still an issue since it's still occurring on multiple buildbots intermittently and Victor was able to reproduce it locally, but I was not able to do so on my local setup. This will be significantly harder to debug without a reliable means of reproducing the failure.

    @vstinner
    Copy link
    Member Author

    I did my tests on Fedora 30 on my laptop which has 4 cores (8 logical CPUs).

    @vstinner
    Copy link
    Member Author

    The test still hangs randomly. Can it be disabled or fixed?

    s390x Debian 3.x:
    https://buildbot.python.org/all/#builders/105/builds/167

    test_close_dont_kill_finished (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok
    Timeout (0:15:00)!
    Thread 0x000003ffaa9f8700 (most recent call first):
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/selectors.py", line 468 in select
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/asyncio/base_events.py", line 1852 in _run_once
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/asyncio/base_events.py", line 596 in run_forever
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/asyncio/base_events.py", line 629 in run_until_complete
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_asyncio/test_subprocess.py", line 485 in test_close_kill_running
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/case.py", line 616 in _callTestMethod
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/case.py", line 659 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/case.py", line 719 in __call__
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 122 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 84 in __call__
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 122 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 84 in __call__
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 122 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 84 in __call__
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 122 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 84 in __call__
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 122 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/suite.py", line 84 in __call__
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/unittest/runner.py", line 176 in run
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/support/init.py", line 2079 in _run_suite
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/support/init.py", line 2201 in run_unittest
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/runtest.py", line 209 in _test_module
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/runtest.py", line 234 in _runtest_inner2
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/runtest.py", line 270 in _runtest_inner
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/runtest.py", line 153 in _runtest
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/runtest.py", line 193 in runtest
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/main.py", line 318 in rerun_failed_tests
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/main.py", line 691 in _main
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/main.py", line 634 in main
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/libregrtest/main.py", line 712 in main
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/main.py", line 2 in <module>
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/runpy.py", line 86 in _run_code
    File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/runpy.py", line 193 in _run_module_as_main
    test_close_kill_running (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... Makefile:1164: recipe for target 'buildbottest' failed
    make: *** [buildbottest] Error 1

    @asvetlov
    Copy link
    Contributor

    I'd like to spend time by reproducing the issue locally.
    Please give me a few days.

    @asvetlov
    Copy link
    Contributor

    FYI, I'm able to reproduce the hang by running "python -m test -F test_asyncio -m test_close_kill_running".

    Working on the fix.

    @vstinner
    Copy link
    Member Author

    FYI, I'm able to reproduce the hang by running "python -m test -F test_asyncio -m test_close_kill_running". Working on the fix.

    Any update on this issue? It's still failing randomly on buildbots. Example:
    https://buildbot.python.org/all/#/builders/158/builds/64

    Another option is to skip the test_close_kill_running() until it's fixed which usually means forget about it until an user gets the bug for real in production.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2020

    I marked bpo-38182 as a duplicate of this issue: "test_asyncio: SubprocessMultiLoopWatcherTests.test_stdin_stdout() failed on AMD64 FreeBSD 10-STABLE Non-Debug 3.x".

    @vstinner
    Copy link
    Member Author

    This issue is still failing frequently on buildbots. What can be done to stop getting test failures on buildbots?

    @asvetlov
    Copy link
    Contributor

    Victor, sorry.
    I spent about 4 days in total trying to figure out what is the reason for hanging.
    I see that the signal handler is installed; but, in a rare situation, it is not called when the process is killed.

    If I reduce an amount of code executed in a signal handler by spawning a auxiliary thread, setting threading. Event in sighandler, and doing all dirty work in this secondary thread -- the problem goes away.
    I'm not happy with this "patch" because it looks like reducing a change of the race without the elimination of the root cause.

    I've mastered a quick hack to skip the problematic test on a hang, it may unblock other developers.
    I want to spend more time on the issue detection. Unfortunately, it is not easy because of non-determenistic problem source.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 4, 2020

    Issue open since 2019-09-30 and tests still hang randomly. What's the progress on this issue?

    @aeros
    Copy link
    Contributor

    aeros commented May 5, 2020

    I have also explored a few different solutions and was unable to find a fix for this issue. If it's still causing intermittent hangs after Andrew's patch, we may want to consider skipping test_close_kill_running for MultiLoopWatcher until we can find one. This clearly seems to be a complex issue without an easy solution, and I don't think we want it to potentially mask other issues in the meantime.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2020

    we may want to consider skipping test_close_kill_running for MultiLoopWatcher until we can find one

    There are more MultiLoopWatcher tests which hang randomly, it's not only test_close_kill_running().

    I'm fine with skipping tests until someone can come up with a fix.

    @aeros
    Copy link
    Contributor

    aeros commented May 8, 2020

    There are more MultiLoopWatcher tests which hang randomly, it's not only test_close_kill_running().

    What other MultiLoopWatcher tests are currently having random hangs? From searching "MultiLoopWatcher" on bpo, I'm only finding this issue. For now, I'll just work on a PR to temporarily skip test_close_kill_running() until we can find a fix.

    @aeros
    Copy link
    Contributor

    aeros commented May 8, 2020

    What other MultiLoopWatcher tests are currently having random hangs?

    Oh, never mind. I see test_stdin_stdout is also hanging in https://bugs.python.org/issue38182. I would like to take a closer look at that one before skipping it as well.

    @cjerdonek
    Copy link
    Member

    I looked into this a little after reproducing it locally.

    What I found is that MultiLoopChildWatcher._sig_chld() *is* called. It's just that it's only called immediately after timeout=5 has elapsed. (The timeout=5 was added by Andrew here:

    await asyncio.wait_for(transport._wait(), 5)
    )

    Consider this line in asyncio.tasks.wait_for(), which is to trigger the timeout:

    timeout_handle = loop.call_later(timeout, _release_waiter, waiter)

      timeout_handle = loop.call_later(timeout, _release_waiter, waiter)

    I put some debug statements inside _release_waiter, and I found that _sig_chld() is called after the timeout has elapsed and before _release_waiter starts. So basically, it looks like CPython is holding onto the signal, and waiting for the loop to do something more before running the handler and calling the _sig_chld().

    The code base already has logic to skip running signal handlers in various cases, but I don't know whether it relates to this issue:

    cpython/Python/ceval.c

    Lines 1410 to 1425 in 7f7e706

    /* Few cases where we skip running signal handlers and other
    pending calls:
    - If we're about to enter the 'with:'. It will prevent
    emitting a resource warning in the common idiom
    'with open(path) as file:'.
    - If we're about to enter the 'async with:'.
    - If we're about to enter the 'try:' of a try/finally (not
    *very* useful, but might help in some cases and it's
    traditional)
    - If we're resuming a chain of nested 'yield from' or
    'await' calls, then each frame is parked with YIELD_FROM
    as its next opcode. If the user hit control-C we want to
    wait until we've reached the innermost frame before
    running the signal handler and raising KeyboardInterrupt
    (see bpo-30039).
    */

    It seems like there are a number of issues on the tracker related to signals (some solved and others not, e.g. https://bugs.python.org/issue21895 ). So it looks to me like this could point to a deeper issue between asyncio and CPython's signal handling.

    @cjerdonek
    Copy link
    Member

    I'm attaching a stand-alone script that can reproduce the issue. It doesn't use unittest or even MultiLoopChildWatcher.

    It starts an event loop and then repeatedly calls loop.subprocess_exec() with 0.2 seconds in between until the "hang" happens (which shows up as a timeout). I recommend running the script for about 15 seconds, and if it doesn't happen, re-run it again. You might need to run it a half-dozen or dozen times to see the hang, but it can also happen right away.

    I'm sure the script can be cleaned up and simplified a lot more. This is just a start. I wanted to see how much of the cruft I could strip out quickly.

    This is what the output looks like after one of the hangs:

    /.../cpython/Lib/subprocess.py:1048: ResourceWarning: subprocess 3282 is still running
    _warn("subprocess %s is still running" % self.pid,
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    killing pid: 3283
    BaseSubprocessTransport: awaiting in _wait
    _sig_child: started
    releasing waiter: okay
    okay
    [82]: 16.99
    /.../cpython/Lib/subprocess.py:1048: ResourceWarning: subprocess 3283 is still running
    _warn("subprocess %s is still running" % self.pid,
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    killing pid: 3284
    BaseSubprocessTransport: awaiting in _wait
    _sig_child: started
    releasing waiter: **TIMEOUT**
    not okay: **TIMEOUT**

    You can ignore the ResourceWarnings. You can also see at the end that the _sig_child() handler was called even in the timeout case (right before the loop.call_later(TIMEOUT, ...) callback began).

    @cjerdonek
    Copy link
    Member

    I'm attaching a slightly simpler version of the script.

    @cjerdonek
    Copy link
    Member

    I'm adding Nathaniel in case he can recognize this as one of the signal handler cases he's already come across.

    @njsmith
    Copy link
    Contributor

    njsmith commented May 9, 2020

    I don't have any particular insight into the bug, but I do have an
    observation: this seems like an awful lot of energy to spend on some code
    that's not even used by default. Would it make sense to deprecate it and
    stick with ThreadedChildWatcher?

    On Sat, May 9, 2020, 05:39 Chris Jerdonek <report@bugs.python.org> wrote:

    Chris Jerdonek <chris.jerdonek@gmail.com> added the comment:

    I'm adding Nathaniel in case he can recognize this as one of the signal
    handler cases he's already come across.

    ----------
    nosy: +njs


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue38323\>


    @aeros
    Copy link
    Contributor

    aeros commented May 9, 2020

    Would it make sense to deprecate it and stick with ThreadedChildWatcher?

    I had proposed deprecating it in another bpo issue, but it was brought up that MutliLoopWatcher had only been added recently in 3.8. So, it might be a bit too soon to deprecate it already.

    Although I suppose that if there are several complex and difficult to resolve issues with the implementation, it may very well be less damaging to deprecate it now rather than after it's gained a decent a decent amount of usage.

    @njsmith
    Copy link
    Contributor

    njsmith commented May 10, 2020

    How do you know that your reproducer is showing the same bug, or anything
    related to signals? IIUC subprocess waiting by default doesn't involve a
    signal handler.

    On Sat, May 9, 2020, 14:40 Chris Jerdonek <report@bugs.python.org> wrote:

    Chris Jerdonek <chris.jerdonek@gmail.com> added the comment:

    > this seems like an awful lot of energy to spend on some code
    that's not even used by default.

    True, but it seems likely to me that this signals :) a deeper, more
    general issue about CPython / signals (which is why I spent time on it).
    For example, my reproducer script doesn't use MultiLoopWatcher. I'd like to
    see if it can be reproduced with signals other than SIGCHLD, too.

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue38323\>


    @cjerdonek
    Copy link
    Member

    I came up with the script by (1) running the test locally and seeing the same hang, (2) moving the test function to its own script separate from the unit tests and seeing the same hang, and (3) successively stripping away code while continuing to check for the same hang. So it should be equivalent.

    As for why it's related to signals, it's because of what the script does (it's not waiting on a subprocess). All it does is start an event loop and then do the following repeatedly:

    1. starts a subprocess that sleeps indefinitely
    2. create an empty future
    3. set a SIGCHLD handler that calls set_result() on the future
    4. use call_later() to terminate the future after 5 seconds
    5. kill the subprocess
    6. await on the future

    Almost all of the time, (5) completes immediately (because the handler is called immediately). But sometimes, (5) takes 5 seconds (which means the timeout fired). And in the cases it takes 5 seconds, I'm able to observe both that (a) Python received the SIGCHLD right away, and (b) the signal handler only gets called when the loop is woken up by the call_later(). So during the await in (5), it seems like Python is holding onto the signal for 5 seconds without calling its signal handler.

    @cjerdonek
    Copy link
    Member

    1. use call_later() to terminate the future after 5 seconds

    You should read that as "3.5" (I inserted it later).

    @cjerdonek
    Copy link
    Member

    I was able to simplify the script a lot more and continue to reproduce the hang. It's attached as test-kill3.py (80 lines). It doesn't use subprocess_exec() or a watcher anymore -- just subprocess.Popen() followed by popen.kill(), and then awaiting on a future.

    The right amount of time has to elapse between the popen.kill() and the await, so I introduced a random bit of variability in between. The right range / amount of time to put in between probably depends on the machine. (What you want is a narrow range right on the borderline, where sometimes the signal fires right before the await, and sometimes right after.) I also added a printf() statement at the beginning of signalmodule.c's trip_signal(), so I can see in the console whether the signal is firing before or after the await. In the timeout / hang case, the signal will be firing after. The hang is very infrequent with the script, though (less frequent than the original unittest). It can take multiple 1-minute runs.

    It seems similar issues have happened a number of times in the past around the signal-handling code. See e.g. https://bugs.python.org/issue30038 and changes to the neighboring code since then.

    @cjerdonek
    Copy link
    Member

    I think I have a possible explanation for this now.

    In my reproducer script and in the original test, the wakeup file descriptor isn't set. I think this explains why the loop isn't waking up to call SIGCHLD's handler when the signal comes in. The reason the wakeup file descriptor isn't set in the original test is that MultiLoopChildWatcher registers the SIGCHLD handler using signal.signal():

    def attach_loop(self, loop):
    # Don't save the loop but initialize itself if called first time
    # The reason to do it here is that attach_loop() is called from
    # unix policy only for the main thread.
    # Main thread is required for subscription on SIGCHLD signal
    if self._saved_sighandler is None:
    self._saved_sighandler = signal.signal(signal.SIGCHLD, self._sig_chld)

    rather than using loop.add_signal_handler(), which calls signal.set_wakeup_fd():
    signal.set_wakeup_fd(self._csock.fileno())

    Is there a reason MultiLoopChildWatcher.attach_loop() isn't calling loop.add_signal_handler()? Since attach_loop() has to be called from the main thread anyways, it seems like it could be okay.

    I also wonder if the documentation could perhaps be more specific as to the difference between loop.add_signal_handler() and signal.signal(). Currently, it just says callbacks registered with add_signal_handler() are "allowed to interact with the event loop":
    https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.add_signal_handler
    But it doesn't give any warnings about using signal.signal(). For example, it might be worth saying something about the possibility of hangs if add_signal_handler() isn't used.

    @cjerdonek
    Copy link
    Member

    I posted a draft PR for this issue: #20142

    @cjerdonek cjerdonek added the type-bug An unexpected behavior, bug, or error label May 17, 2020
    @jstasiak
    Copy link
    Mannequin

    jstasiak mannequin commented Nov 5, 2020

    FYI your PR 20142 together with my PR #23154 allow me to run the whole test_asyncio test suite on OpenIndiana 5.11:

    $ ./python -m unittest -v test.test_asyncio
    (...)
    Ran 2298 tests in 71.668s

    OK (skipped=52)

    without your PR at least one of the tests is hanging forever. So, this PR improves Solaris/SunOS/illumos/OpenIndiana side of things as well.

    @asvetlov
    Copy link
    Contributor

    New changeset 66d3b58 by Chris Jerdonek in branch 'master':
    bpo-38323: Add guard clauses in MultiLoopChildWatcher. (bpo-22756)
    66d3b58

    @miss-islington
    Copy link
    Contributor

    New changeset bf0eed3 by Miss Islington (bot) in branch '3.9':
    bpo-38323: Add guard clauses in MultiLoopChildWatcher. (GH-22756)
    bf0eed3

    @miss-islington
    Copy link
    Contributor

    New changeset dd262ef by Miss Islington (bot) in branch '3.8':
    bpo-38323: Add guard clauses in MultiLoopChildWatcher. (GH-22756)
    dd262ef

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 4, 2021

    This issue is not solved. I can still reproduce the hang using:

    ./python -m test test_asyncio -m SubprocessMultiLoopWatcherTests -v -F -j20 --timeout=30.0

    Example with test_cancel_make_subprocess_transport_exec:

    $ ./python -m test test_asyncio -m test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests.test_cancel_make_subprocess_transport_exec -v -F -j20 --timeout=30.0
    ...
    0:00:37 load avg: 10.97 [163] test_asyncio passed -- running: test_asyncio (30.9 sec)
    test_cancel_make_subprocess_transport_exec (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok

    Ran 1 test in 0.032s

    OK
    0:00:37 load avg: 10.97 [164] test_asyncio passed -- running: test_asyncio (31.0 sec)
    test_cancel_make_subprocess_transport_exec (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ... ok

    ----------------------------------------------------------------------

    Ran 1 test in 0.036s

    OK
    0:00:37 load avg: 10.97 [165/1] test_asyncio crashed (Exit code 1)
    test_cancel_make_subprocess_transport_exec (test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests) ...
    Timeout (0:00:30)!
    Thread 0x00007fb583cbd740 (most recent call first):
    File "/home/vstinner/python/main/Lib/selectors.py", line 469 in select
    File "/home/vstinner/python/main/Lib/asyncio/base_events.py", line 1845 in _run_once
    File "/home/vstinner/python/main/Lib/asyncio/base_events.py", line 595 in run_forever
    File "/home/vstinner/python/main/Lib/asyncio/base_events.py", line 628 in run_until_complete
    File "/home/vstinner/python/main/Lib/test/test_asyncio/test_subprocess.py", line 442 in test_cancel_make_subprocess_transport_exec
    ...

    Test complete in less than 1 second, but sometimes it hangs for at least 30 seconds.

    @cjerdonek
    Copy link
    Member

    This issue is not solved.

    Yes, nothing was changed. After diagnosing this issue and trying some things out in a draft PR, my conclusion is that an asyncio maintainer really needs to weigh in on what to do (especially Andrew who authored the class). The reason is that the hang is closely tied to MultiLoopChildWatcher's documented purpose. The only way I could see to fix MultiLoopChildWatcher would change its documented behavior and make it the same as SafeChildWatcher, which would defeat the purpose of having a separate class: #20142 (comment)
    Maybe there is a way to sidestep the hangs in the tests without fixing MultiLoopChildWatcher, but I didn't look into that.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 4, 2021

    When I reproduce test_cancel_make_subprocess_transport_exec() hang, the problem is that the C signal handler is called with SIGCHLD when the child process completes, but the Python signal handler is not called.

    Python is "blocked" in a selector (maybe select.select(), it doesn't matter). I guess that the selector is interrupted by a signal (even if asyncio calls signal.setinterrupt(SIGCHLD, False)), but since the signal handler doesn't raise an exception, the syscall is restarted: see the PEP-475.

    I understood that the asyncio event loop only gets the opportunity to call the Python signal handler if there is a pending asyncio event (call_soon, call_timer, event on a tracked FD, whatever). If the signal arrives when the event loop is idle, the Python signal handler will never be called since the selector is called with timeout=0 (blocking mode).

    MultiLoopChildWatcher must ensures that the event loop is awaken when it receives a signal by using signal.setwakeup(). This is done by _UnixSelectorEventLoop.add_signal_handler(). Maybe MultiLoopChildWatcher could reuse this function, rather than calling directly signal.signal().

    @cjerdonek
    Copy link
    Member

    MultiLoopChildWatcher must ensures that the event loop is awaken when it receives a signal by using signal.setwakeup(). This is done by _UnixSelectorEventLoop.add_signal_handler(). Maybe MultiLoopChildWatcher could reuse this function,

    This is the conclusion I came to, too. But changing MultiLoopChildWatcher to use loop.add_signal_handler() would contradict the class's documented purpose and make it the same as SafeChildWatcher. This is why I think a maintainer needs to weigh in. The class's purpose / design might fundamentally not be workable. If it can't be made to work, maybe it should be removed or be documented as susceptible to hangs.

    @pablogsal
    Copy link
    Member

    I'm landing #26542 as a temporary solution meanwhile we decide what to do with PR 26536.

    @pablogsal
    Copy link
    Member

    New changeset f171877 by Pablo Galindo in branch 'main':
    bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542)
    f171877

    @miss-islington
    Copy link
    Contributor

    New changeset b3c50b2 by Miss Islington (bot) in branch '3.9':
    bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542)
    b3c50b2

    @pablogsal
    Copy link
    Member

    New changeset 0d441d2 by Miss Islington (bot) in branch '3.10':
    bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542) (GH-26544)
    0d441d2

    @1st1
    Copy link
    Member

    1st1 commented Jun 9, 2021

    MultiLoopChildWatcher must ensures that the event loop is awaken when it receives a signal by using signal.setwakeup(). This is done by _UnixSelectorEventLoop.add_signal_handler(). Maybe MultiLoopChildWatcher could reuse this function, rather than calling directly signal.signal().

    I think this is a good idea. MultiLoopChildWatcher could use setwakeupfd with some no-op callback just to wakeup the loop.

    @pablogsal
    Copy link
    Member

    New changeset af5fb67 by Miss Islington (bot) in branch '3.8':
    bpo-38323: Skip SubprocessMultiLoopWatcherTest as they can hang the test suite (GH-26542) (GH-26670)
    af5fb67

    @vstinner
    Copy link
    Member Author

    I think this is a good idea. MultiLoopChildWatcher could use setwakeupfd with some no-op callback just to wakeup the loop.

    A no-op doesn't solve the issue. It doesn't wake up the event loop. There are only two options to wake up the event loop:

    • Raise an exception in the Python signal handler
    • Generate any event which stops the event loop polling. In practice, it means to write into the event loop self pipe.

    This issue is similar to call_soon() which doesn't wake up the event loop if called from a different thread: you must call call_soon_threadsafe() which... writes into the self pipe. The self pipe again!

    --

    MultiLoopChildWatcher cannot use setwakeupfd if it's also used by an event loop to register a signal handler (like CTRL+C). Python currently supports a single signal wakeup FD.

    The event loop *must* set the wakeup FD to its own self pipe. Otherwise, the asynchronous Python signal handler register by add_signal_handler() is not called. Extract of the Unix add_signal_handler():

                # Register a dummy signal handler to ask Python to write the signal
                # number in the wakeup file descriptor. _process_self_data() will
                # read signal numbers from this file descriptor to handle signals.
                signal.signal(sig, _sighandler_noop)

    Currently, MultiLoopChildWatcher is not related to any event loop, it cannot access such "self pipe".

    If MultiLoopChildWatcher is modified to be attached to an event loop... does it contract the whole purpose of MultiLoopChildWatcher?

    @vstinner
    Copy link
    Member Author

    MultiLoopChildWatcher still has a race condition. I see different options:

    • Deprecate/Remove MultiLoopChildWatcher
    • Find a magic fix for MultiLoopChildWatcher. It sounds complicated since it's *designed* to not be related to any event loop. See my failed attempt: PR 26536.
    • Document the limitation and adapt the unit test to work around the issue.

    If an event loop frequently gets events, the worst case is that a child process complete will be notified later, but at least the event is handled. If the event loop is idle, the completion is never notified.

    A workaround is to schedule an asynchronous task which runs frequently, like once per minute. It doesn't even have to do anything. It can be a no-op task.

    I'm not really excited by documenting the limitation and suggest a hack to work around the limitation. MultiLoopChildWatcher could belong to a PyPI module.

    But for the base asyncio from the stdlib, I would prefer to not ship any known race conditions :-( So I'm more in favor of deprecate/remove.

    @byllyfish
    Copy link
    Mannequin

    byllyfish mannequin commented Nov 5, 2021

    asyncio.MultiLoopChildWatcher has two problems that create a race condition.

    1. The SIGCHLD signal handler does not guard against interruption/re-entry.
    2. The SIGCHLD signal handler can interrupt add_child_handler's self._do_waitpid(pid).

    Symptoms:

    Log messages that look like this:

    1634935451.761 WARNING Unknown child process pid 8747, will report returncode 255
    ...
    1634935451.762 WARNING Child watcher got an unexpected pid: 8747
    Traceback (most recent call last):
      File "/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py", line 1306, in _do_waitpid
        loop, callback, args = self._callbacks.pop(pid)
    KeyError: 8747

    Background:

    I've been working on a library to make calling asyncio subprocesses more convenient. As part of my testing, I've been stress testing asyncio using different child watcher policies. My CI build runs more than 200 tests each on macOS, Linux and FreeBSD. I've noticed a small percentage of sporadic failures using MultiLoopChildWatcher.

    My understanding of Python signal functions is that:

    1. Upon receipt of a signal, the native "C" signal handler sets a flag that indicates the signal arrived.
    2. The main thread checks the signal flags after each bytecode instruction. If a signal flag is set, Python saves the call stack, runs the signal handler on the main thread immediately, then pops the stack when it returns (assuming no exception raised by signal handler).
    3. If you are in the middle of a signal handler running on the main thread and Python detects another signal flag, your signal handler can be interrupted.
    4. Stacked signal handlers run in LIFO order. The last one that enters will run to completion without interruption.

    Explanation:

    I wrapped MultiLoopChildWatcher's sig_chld function in a decorator that logs when it is entered and exited. This clearly shows that _sig_chld is being re-entered. In the following log snippet, I'm running two commands in a pipeline "tr | cat".

    1634935451.743 DEBUG process '/usr/bin/tr' created: pid 8747
    ...
    1634935451.746 DEBUG process '/bin/cat' created: pid 8748
    ...
    1634935451.761 DEBUG enter '_sig_chld' 20
    1634935451.761 DEBUG enter '_sig_chld' 20
    1634935451.761 WARNING Unknown child process pid 8747, will report returncode 255
    1634935451.762 DEBUG process 8748 exited with returncode 0
    1634935451.762 DEBUG exit '_sig_chld' 20
    1634935451.762 WARNING Child watcher got an unexpected pid: 8747
    Traceback (most recent call last):
      File "/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py", line 1306, in _do_waitpid
        loop, callback, args = self._callbacks.pop(pid)
    KeyError: 8747
    1634935451.763 WARNING Unknown child process pid 8748, will report returncode 255
    1634935451.763 WARNING Child watcher got an unexpected pid: 8748
    Traceback (most recent call last):
      File "/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py", line 1306, in _do_waitpid
        loop, callback, args = self._callbacks.pop(pid)
    KeyError: 8748
    1634935451.763 DEBUG exit '_sig_chld' 20

    Here is the breakdown of what happens:

    1. Pid 8747 exits and we enter _sig_chld Support "bpo-" in Misc/NEWS #1.
    2. sig_chld Support "bpo-" in Misc/NEWS #1 calls os.waitpid which gives (pid, status) = (8747, 0).
    3. Before sig_chld Support "bpo-" in Misc/NEWS #1 has a chance to call self._callbacks.pop(pid), it is interrupted.
    4. sig_chld Rename README to README.rst and enhance formatting #2 calls os.waitpid for pid 8747. We get a ChildProcessError and then "Unknown child process pid 8747, will report returncode 255"
    5. sig_chld Rename README to README.rst and enhance formatting #2 invokes the callback for pid 8747 saying the returncode=255.
    6. sig_chld Rename README to README.rst and enhance formatting #2 continues to completion. It reaps pid 8748 normally.
    7. sig_chld Support "bpo-" in Misc/NEWS #1 picks up where it left off. We get an error when we try to pop the callback for 8747.
    8. sig_chld Support "bpo-" in Misc/NEWS #1 calls os.waitpid for pid 8748. This gives us failure messages because it was done by sig_chld Rename README to README.rst and enhance formatting #2.

    The issue of interruption can also happen in the case of running a single process. If the _sig_chld interrupts the call to self._do_waitpid(pid) in add_child_watcher, a similar interleaving can occur.

    Work-Around:

    In my tests, I patched MultiLoopChildWatcher and so far, it appears to be more reliable. In add_child_handler, I call raise_signal(SIGCHLD) so that all the work is done in the signal handler.

    class PatchedMultiLoopChildWatcher(asyncio.MultiLoopChildWatcher):
        "Test race condition fixes in MultiLoopChildWatcher."
    
        def add_child_handler(self, pid, callback, *args):
            loop = asyncio.get_running_loop()
            self._callbacks[pid] = (loop, callback, args)
    
            # Prevent a race condition in case signal was delivered before
            # callback added.
            signal.raise_signal(signal.SIGCHLD)
    
        @_serialize
        def _sig_chld(self, signum, frame):
            super()._sig_chld(signum, frame)

    _serialize is a decorator that looks like this:

    def _serialize(func):
        """Decorator to serialize a non-reentrant signal function.
        If one client is already in the critical section, set a flag to run the
        section one more time. Testing purposes only.
        """
    
        lock = threading.Lock()  # Used as atomic test-and-set.
        retry = False
    
        @functools.wraps(func)
        def _decorator(*args, **kwargs):
            nonlocal retry
    
            while True:
                if lock.acquire(blocking=False):  # pylint: disable=consider-using-with
                    try:
                        retry = False
                        func(*args, **kwargs)
                    finally:
                        lock.release()
                    if retry:
                        continue
                else:
                    # A signal handler that interrupts an existing handler will
                    # run to completion (LIFO).
                    retry = True
                break
    
        return _decorator

    @kumaraditya303
    Copy link
    Contributor

    MultiLoopWatcher is deprecated now so closing as "won't fix".

    @kumaraditya303 kumaraditya303 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes tests Tests in the Lib/test dir topic-asyncio type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    No branches or pull requests

    9 participants