Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't connect to cluster subprocesses at shutdown #6829

Merged
merged 2 commits into from
Aug 5, 2022

Conversation

gjoseph92
Copy link
Collaborator

Closes #6828. See that issue for explanation.

The one thing we lose with this over the RPC is a clean close call—kill is more abrupt. If we'd like to maintain the clean close, I can easily do that by sending a SIGINT, waiting, then SIGKILL only if necessary.

  • Tests added / passed
  • Passes pre-commit run --all-files

@gjoseph92 gjoseph92 requested review from graingert and fjetter August 4, 2022 16:24
@@ -586,7 +590,7 @@ def cluster(
nanny=False,
worker_kwargs=None,
active_rpc_timeout=10,
disconnect_timeout=20,
shutdown_timeout=20,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument name was unused across the codebase, so changing it seems fine.

Comment on lines -685 to -700
async def close():
logger.debug("Closing out test cluster")
alive_workers = [
w["address"]
for w in workers_by_pid.values()
if w["proc"].is_alive()
]
await disconnect_all(
alive_workers,
timeout=disconnect_timeout,
rpc_kwargs=rpc_kwargs,
)
if scheduler.is_alive():
await disconnect(
saddr, timeout=disconnect_timeout, rpc_kwargs=rpc_kwargs
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is that I just deleted this entire finally section (and therefore removed the try block). The stack.callback(_kill_join, scheduler, shutdown_timeout) will accomplish an equivalent thing without RPCs.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 53m 55s ⏱️ + 13m 1s
  2 989 tests ±0    2 895 ✔️  - 3       89 💤 ±0  4 +3  1 🔥 ±0 
22 165 runs   - 1  21 111 ✔️  - 5  1 047 💤  - 1  6 +5  1 🔥 ±0 

For more details on these failures and errors, see this check.

Results for commit 993f420. ± Comparison against base commit 4af2d0a.

@gjoseph92
Copy link
Collaborator Author

Failures look unrelated:

proc.join()
def _kill_join(proc, timeout):
proc.kill()
proc.join(timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rely on the pytest timeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe kill them all at the same time then join them all

@gjoseph92 gjoseph92 merged commit e1f3779 into dask:main Aug 5, 2022
@gjoseph92 gjoseph92 deleted the cluster-contextmanager-no-rpc-at-end branch August 5, 2022 17:10
@gjoseph92 gjoseph92 mentioned this pull request Aug 5, 2022
2 tasks
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition connecting to vs shutting down subprocesses in tests using cluster()
2 participants