Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all invocations of IOLoop.run_sync from CLI #6205

Merged
merged 36 commits into from
May 18, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Apr 26, 2022

This PR removes the usage of IOLoop.run_sync from dask_scheduler.py and dask_worker.py and adds tests for graceful shutdown. It also adds a graceful shutdown to the scheduler which previously ran into a hard scheduler.stop() instead. I have made slight changes to the logging, i.e. we always log the received signal.

Out of scope:

  • Graceful shutdown in dask_spec.py
  • Graceful shutdown of workers used without nannies in dask_worker.py
  • Preserving behavior of preload modules if they are using loops about whether or not they are initialized before the loop is created.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@quasiben
Copy link
Member

add to allowlist

@github-actions
Copy link
Contributor

github-actions bot commented Apr 26, 2022

Unit Test Results

       15 files   -        1         15 suites   - 1   7h 13m 40s ⏱️ + 23m 8s
  2 800 tests +     31    2 720 ✔️ +  33    78 💤  -   3  2 +1 
20 762 runs   - 1 060  19 839 ✔️  - 982  920 💤  - 79  3 +1 

For more details on these failures, see this check.

Results for commit edc6893. ± Comparison against base commit e390609.

♻️ This comment has been updated with latest results.

with popen(["dask-worker", s.address, nanny], flush_output=False) as worker:
await c.wait_for_workers(1)
worker.send_signal(signal.SIGINT)
logs = [worker.stdout.readline().decode().lower() for _ in range(25)]
Copy link
Member

@graingert graingert Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worker.communicate() is better here - because you need to consume stdout and stderr concurrently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done but leads to the issue with with popen laid out in another comment.

async def test_ctrl_break_event(c, s, nanny):
try:
worker = subprocess.Popen(
["dask-worker", s.address, nanny],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware, a developer running tests may have a different dask-worker executable on their path. This is actually pretty common because people install a released version of Dask, but then don't properly install the git version.

To resolve this you could use python -m distributed.cli.dask_worker or you could use the popen context manager in utils_test.py

You might look at using popen regardless. It has some advantages like printing out stdout/stderr. It's also nice to be consistent across tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using with popen, I ran into an issue where my manual use of .communicate() on the yielded Popen object to obtain the contents of stdout and stderr closed stdout and the popen context manager attempted to do the same on __exit__, running into a ValueError: Read of a closed file. If we want to use with popen consistently, one way to avoid manual use of .communicate() while allowing the user to obtain stdout and stderr might be to set stdout and stderr on the yielded object object here instead of writing to local variables that will be dropped after.

if flush_output:
out, err = flush_future.result()
ex.shutdown()
else:
out, err = proc.communicate()
assert not err

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once these tests are working, we'll be able to see what the common parts of utils_test.popen can be refactored out


logger = logging.getLogger(__name__)
Copy link
Member Author

@hendrikmakait hendrikmakait Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to pass a logger into wait_for_signal or refactor the logging into the user of it. Otherwise, this will log to distributed.cli.utils.

distributed/cli/utils.py Outdated Show resolved Hide resolved
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
Copy link
Member

@graingert graingert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I'd like another set of eyes on this if possible

distributed/cli/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM. Did anybody confirm that KeyboardInterrupts are indeed still working on windows?

@hendrikmakait
Copy link
Member Author

@fjetter: I verified it with @sjperkins in an intermediate version and we have since mainly fixed edge cases. I can setup a Windows environment to verify functionality on the final version.

@hendrikmakait
Copy link
Member Author

@fjetter: Windows behaves the same way as Ubuntu/macOS.

@fjetter
Copy link
Member

fjetter commented May 18, 2022

Thanks for checking

@fjetter fjetter merged commit 0ff8c29 into dask:main May 18, 2022
@jacobtomlinson
Copy link
Member

It looks like we are using install_signal_handlers in some downstream places. I've noticed issues in dask-ctl this morning. I imagine I copy/pasted the IOLoop code from distributed rather than consciously choosing to use it.

https://github.com/dask-contrib/dask-ctl/blob/ef87eecf0c837ac83000a39456a17576ee7cc08f/dask_ctl/utils.py

@hendrikmakait what is the migration plan that I should follow? Am I safe to just remove the call or do I need to replace it with something else?

@hendrikmakait
Copy link
Member Author

@jacobtomlinson, if this breaks stuff downstream, I'd suggest we store the function and deprecate it instead of dropping it. Thanks for the heads-up, I wasn't aware of any downstream dependencies in other repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove all invocations of IOLoop.run_sync from CLI
9 participants