Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose scheduler idle via RPC and HTTP API #7642

Merged
merged 2 commits into from
Mar 13, 2023

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented Mar 10, 2023

xref dask/dask-kubernetes#672

It would be really helpful in dask-kubernetes if it were possible to ask the scheduler if it is idle and how long it has been idle for.

This PR tweaks the check_idle method in Scheduler so that it returns whether it is idle and how long for and exposes that method via the RPC. Also the periodic callback that calls this watching for idleness was only being set if the idle_timeout was configured, I've changed it so that it is always running and if the timeout isn't set it checks 4 times a second.

I also added an API method to the HTTP API to expose this so it can be called by the dask-kubernetes operator controller (although it will use the RPC as a fallback).

I've added a test for the HTTP API and tweaked existing check_idle tests to cover the changes here.

  • Tests added / passed
  • Passes pre-commit run --all-files

@jacobtomlinson jacobtomlinson changed the title Expose schedule idle via RPC and HTTP API Expose scheduler idle via RPC and HTTP API Mar 10, 2023
@fjetter
Copy link
Member

fjetter commented Mar 10, 2023

cc @ntabris we talked about the HTTP API at some point. This is just an example of how this API can be used. Nothing to do for you.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       26 files  +       1         26 suites  +1   13h 8m 9s ⏱️ + 1h 9m 59s
  3 498 tests +       1    3 392 ✔️  -        1     103 💤 ±  0  2 +1  1 🔥 +1 
44 215 runs  +2 181  42 144 ✔️ +2 106  2 068 💤 +73  2 +1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 04be190. ± Comparison against base commit fd4b91b.

♻️ This comment has been updated with latest results.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that the distributed/tests/test_scheduler.py::test_idle_timeout_no_workers failures here seem to be related to the changes in this PR (that test isn't showing up in the flaky test failure report at least)

@jacobtomlinson
Copy link
Member Author

Fixed up the test and it's passing consistently for me again locally.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau jrbourbeau merged commit 9f88685 into dask:main Mar 13, 2023
@jacobtomlinson jacobtomlinson deleted the expose-check-idle branch March 13, 2023 17:19
@jacobtomlinson
Copy link
Member Author

Thanks for the merge @jrbourbeau and @fjetter for the review.

jacobtomlinson added a commit to jacobtomlinson/dask-kubernetes that referenced this pull request Mar 13, 2023
ypogorelova pushed a commit to ypogorelova/distributed that referenced this pull request Mar 16, 2023
@@ -3643,6 +3643,7 @@ def __init__(
"dump_cluster_state_to_url": self.dump_cluster_state_to_url,
"benchmark_hardware": self.benchmark_hardware,
"get_story": self.get_story,
"check_idle": self.check_idle,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobtomlinson just curious, where is this RPC used? I don't see a corresponding client method, or use on workers. Are you planning to call it in dask-k8s?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah using it in dask-kubernetes, but only as a fallback if the HTTP API is not enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I actually missed this. I think "best practice" should be to register this in dask-k8s with a scheduler extension. If I just had a look at this code base, it would look like dead code and I might remove it.

We're rarely doing these refactorings but there is nothing here that tells us what is actually "public" and what isn't so there is always a risk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback of the fallback is to register a scheduler extension so that we can support older versions of distributed. My worry is that scheduler extensions still depend on scheduler internals like Scheduler.idle_since which could also change.

There are tests in dask-kubernetes that cover this, so in terms of risk there is some mitigation. But it would be really nice to try and move towards a world where we have well defined public APIs in terms of the scheduler object (for extensions), the RPC and the HTTP API. Otherwise it is always going to be risky developing against the scheduler.

We could also expose this via the client and treat that as the public API instead of prodding the RPC directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests in dask-kubernetes that cover this, so in terms of risk there is some mitigation. But it would be really nice to try and move towards a world where we have well defined public APIs in terms of the scheduler object (for extensions), the RPC and the HTTP API. Otherwise it is always going to be risky developing against the scheduler.

This is a much bigger thing that I want it to be but yes I would love to see nothing of the actual Scheduler object to be public. It is just too big, there are too many small things, it's changing too quickly, ...

We could also expose this via the client and treat that as the public API instead of prodding the RPC directly.

That's typically something I feel very comfortable with since it is very explicit that this is intended for external usage.


Just to be very clear: It's fine keeping it like this. "Fixing" this API leakage is very hard. I'm open to any suggestions but handing users (both in extension but especially in plugins) the entire object isn't maintainable, particularly not as long as there is no proper discipline around "using underscores"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much bigger thing that I want it to be but yes I would love to see nothing of the actual Scheduler object to be public.

I totally sympathise with that. Although scheduler plugins/extensions/run_on_scheduler are all well established patterns. So backing out from those at this point will be hard.

That's typically something I feel very comfortable with since it is very explicit that this is intended for external usage.

Most, if not all, of the cluster managers open an RPC to the scheduler and invoke methods directly. I think part of the challenge here is that some cluster managers live in distributed and some live in dask/dask{kubernetes,jobqueue,yarn,gateway,cloudprovider,etc}.

For a long time the dask-foo projects were considered part of core Dask and it wasn't unusual to make a PR to both distributed and the cluster manager you were working on which were coupled. However this is a maintenance challenge because tests for code in distributed run in other repos.

Perhaps a good step forward would be for dask-foo projects to only interact with the scheduler via the Client and the HTTP API. These are API surfaces that we can consider to be public.

The question is what do we do with the base Cluster and SpecCluster classes in distributed? These both use the RPC directly and are intended to be subclassed by third-party libraries. Maybe these classes should be updated to stop using the RPC and to use a Client instead?

jacobtomlinson added a commit to dask/dask-kubernetes that referenced this pull request Jun 12, 2023
* Add autoCleanupTimeout option

* Add idle_timeout

* Update to use dask/distributed#7642

* Rename spec ketyword

* Updated with changes in distributed

* Nudge CI

* Set idle_timeout longer than resource_timeout

* Rename cluster to avoid collision

* Fix timeout default value

* Reduce logging noise and add warning when autoshudown can't reach scheduler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants