Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure distributed.comm.core.connect can always be cancelled #6064

Closed
wants to merge 1 commit into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Apr 4, 2022

I noticed that tests like test_worker_waits_for_scheduler are not properly running even though the test logic is extremely simple.

The test test_worker_waits_for_scheduler particularly hangs in the asyncio.wait_for statement waiting for the worker to come up.
After some debugging, it appears the worker is stuck in the connect retry implementation trying to connect to the scheduler. This connect should've been cancelled after the wait_for hits its timeout but it wasn't.

It appears there is a race condition when cancelling a task. If the Task is cancelled right after it is finished and the exception is set, it will not raise a CancelledError but the original Exception. I don't know if this is a bug in CPython but it is not the behaviour I anticipated. I wrote the ensure_cancellation wrapper to make sure this behaviour is deterministic

cc @graingert

Closes #5861

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 27m 30s ⏱️ - 11m 23s
  2 750 tests +  5    2 663 ✔️ +  2       80 💤 ±0  7 +3 
21 885 runs  +40  20 840 ✔️ +35  1 037 💤 +2  8 +3 

For more details on these failures, see this check.

Results for commit 676bb18. ± Comparison against base commit fc5b460.

♻️ This comment has been updated with latest results.

@fjetter fjetter force-pushed the cancel_core_connect branch from b544ac6 to bda34ea Compare April 7, 2022 12:35
@fjetter fjetter force-pushed the cancel_core_connect branch 4 times, most recently from 5d2fce9 to 73a4da2 Compare April 10, 2022 14:50
Comment on lines +1626 to +1645
task.cancel()
await watcher.wait()
raise
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if I need to await the task itself here. There are no warnings

cc @graingert

@fjetter fjetter force-pushed the cancel_core_connect branch from 73a4da2 to 8d125a0 Compare April 12, 2022 11:16
],
)
@gen_test()
async def test_connection_pool_cancellation(monkeypatch, closing, backend):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests do not capture this race condition but the rewrite is much cleaner and covers more ground so I'd like to keep it

@fjetter
Copy link
Member Author

fjetter commented Apr 19, 2022

There is a CPython issue which looks like the root cause, see python/cpython#86296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test_worker_waits_for_scheduler
1 participant