-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose scheduler idle via RPC and HTTP API #7642
Conversation
cc @ntabris we talked about the HTTP API at some point. This is just an example of how this API can be used. Nothing to do for you. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 26 files + 1 26 suites +1 13h 8m 9s ⏱️ + 1h 9m 59s For more details on these failures and errors, see this check. Results for commit 04be190. ± Comparison against base commit fd4b91b. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that the distributed/tests/test_scheduler.py::test_idle_timeout_no_workers
failures here seem to be related to the changes in this PR (that test isn't showing up in the flaky test failure report at least)
Fixed up the test and it's passing consistently for me again locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jacobtomlinson
Thanks for the merge @jrbourbeau and @fjetter for the review. |
@@ -3643,6 +3643,7 @@ def __init__( | |||
"dump_cluster_state_to_url": self.dump_cluster_state_to_url, | |||
"benchmark_hardware": self.benchmark_hardware, | |||
"get_story": self.get_story, | |||
"check_idle": self.check_idle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobtomlinson just curious, where is this RPC used? I don't see a corresponding client method, or use on workers. Are you planning to call it in dask-k8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah using it in dask-kubernetes, but only as a fallback if the HTTP API is not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I actually missed this. I think "best practice" should be to register this in dask-k8s with a scheduler extension. If I just had a look at this code base, it would look like dead code and I might remove it.
We're rarely doing these refactorings but there is nothing here that tells us what is actually "public" and what isn't so there is always a risk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback of the fallback is to register a scheduler extension so that we can support older versions of distributed. My worry is that scheduler extensions still depend on scheduler internals like Scheduler.idle_since
which could also change.
There are tests in dask-kubernetes that cover this, so in terms of risk there is some mitigation. But it would be really nice to try and move towards a world where we have well defined public APIs in terms of the scheduler object (for extensions), the RPC and the HTTP API. Otherwise it is always going to be risky developing against the scheduler.
We could also expose this via the client and treat that as the public API instead of prodding the RPC directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are tests in dask-kubernetes that cover this, so in terms of risk there is some mitigation. But it would be really nice to try and move towards a world where we have well defined public APIs in terms of the scheduler object (for extensions), the RPC and the HTTP API. Otherwise it is always going to be risky developing against the scheduler.
This is a much bigger thing that I want it to be but yes I would love to see nothing of the actual Scheduler object to be public. It is just too big, there are too many small things, it's changing too quickly, ...
We could also expose this via the client and treat that as the public API instead of prodding the RPC directly.
That's typically something I feel very comfortable with since it is very explicit that this is intended for external usage.
Just to be very clear: It's fine keeping it like this. "Fixing" this API leakage is very hard. I'm open to any suggestions but handing users (both in extension but especially in plugins) the entire object isn't maintainable, particularly not as long as there is no proper discipline around "using underscores"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much bigger thing that I want it to be but yes I would love to see nothing of the actual Scheduler object to be public.
I totally sympathise with that. Although scheduler plugins/extensions/run_on_scheduler are all well established patterns. So backing out from those at this point will be hard.
That's typically something I feel very comfortable with since it is very explicit that this is intended for external usage.
Most, if not all, of the cluster managers open an RPC to the scheduler and invoke methods directly. I think part of the challenge here is that some cluster managers live in distributed and some live in dask/dask{kubernetes,jobqueue,yarn,gateway,cloudprovider,etc}
.
For a long time the dask-foo
projects were considered part of core Dask and it wasn't unusual to make a PR to both distributed and the cluster manager you were working on which were coupled. However this is a maintenance challenge because tests for code in distributed
run in other repos.
Perhaps a good step forward would be for dask-foo
projects to only interact with the scheduler via the Client
and the HTTP API. These are API surfaces that we can consider to be public.
The question is what do we do with the base Cluster
and SpecCluster
classes in distributed
? These both use the RPC directly and are intended to be subclassed by third-party libraries. Maybe these classes should be updated to stop using the RPC and to use a Client
instead?
* Add autoCleanupTimeout option * Add idle_timeout * Update to use dask/distributed#7642 * Rename spec ketyword * Updated with changes in distributed * Nudge CI * Set idle_timeout longer than resource_timeout * Rename cluster to avoid collision * Fix timeout default value * Reduce logging noise and add warning when autoshudown can't reach scheduler
xref dask/dask-kubernetes#672
It would be really helpful in
dask-kubernetes
if it were possible to ask the scheduler if it is idle and how long it has been idle for.This PR tweaks the
check_idle
method inScheduler
so that it returns whether it is idle and how long for and exposes that method via the RPC. Also the periodic callback that calls this watching for idleness was only being set if theidle_timeout
was configured, I've changed it so that it is always running and if the timeout isn't set it checks 4 times a second.I also added an API method to the HTTP API to expose this so it can be called by the
dask-kubernetes
operator controller (although it will use the RPC as a fallback).I've added a test for the HTTP API and tweaked existing
check_idle
tests to cover the changes here.pre-commit run --all-files