Expose scheduler idle via RPC and HTTP API #7642

jacobtomlinson · 2023-03-10T17:02:05Z

It would be really helpful in dask-kubernetes if it were possible to ask the scheduler if it is idle and how long it has been idle for.

This PR tweaks the check_idle method in Scheduler so that it returns whether it is idle and how long for and exposes that method via the RPC. Also the periodic callback that calls this watching for idleness was only being set if the idle_timeout was configured, I've changed it so that it is always running and if the timeout isn't set it checks 4 times a second.

I also added an API method to the HTTP API to expose this so it can be called by the dask-kubernetes operator controller (although it will use the RPC as a fallback).

I've added a test for the HTTP API and tweaked existing check_idle tests to cover the changes here.

Tests added / passed
Passes pre-commit run --all-files

fjetter · 2023-03-10T17:15:38Z

cc @ntabris we talked about the HTTP API at some point. This is just an example of how this API can be used. Nothing to do for you.

github-actions · 2023-03-10T19:21:37Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files +      1       26 suites +1 13h 8m 9s ⏱️ + 1h 9m 59s
  3 498 tests +      1   3 392 ✔️ -       1   103 💤 ±  0 2 ❌ +1 1 🔥 +1
44 215 runs +2 181 42 144 ✔️ +2 106 2 068 💤 +73 2 ❌ +1 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 04be190. ± Comparison against base commit fd4b91b.

♻️ This comment has been updated with latest results.

jrbourbeau

Just noting that the distributed/tests/test_scheduler.py::test_idle_timeout_no_workers failures here seem to be related to the changes in this PR (that test isn't showing up in the flaky test failure report at least)

jacobtomlinson · 2023-03-13T13:28:57Z

Fixed up the test and it's passing consistently for me again locally.

jrbourbeau

Thanks @jacobtomlinson

jacobtomlinson · 2023-03-13T17:19:42Z

Thanks for the merge @jrbourbeau and @fjetter for the review.

gjoseph92 · 2023-03-21T07:29:11Z

distributed/scheduler.py

@@ -3643,6 +3643,7 @@ def __init__(
            "dump_cluster_state_to_url": self.dump_cluster_state_to_url,
            "benchmark_hardware": self.benchmark_hardware,
            "get_story": self.get_story,
+            "check_idle": self.check_idle,


@jacobtomlinson just curious, where is this RPC used? I don't see a corresponding client method, or use on workers. Are you planning to call it in dask-k8s?

Yeah using it in dask-kubernetes, but only as a fallback if the HTTP API is not enabled.

Good point, I actually missed this. I think "best practice" should be to register this in dask-k8s with a scheduler extension. If I just had a look at this code base, it would look like dead code and I might remove it.

We're rarely doing these refactorings but there is nothing here that tells us what is actually "public" and what isn't so there is always a risk

The fallback of the fallback is to register a scheduler extension so that we can support older versions of distributed. My worry is that scheduler extensions still depend on scheduler internals like Scheduler.idle_since which could also change.

There are tests in dask-kubernetes that cover this, so in terms of risk there is some mitigation. But it would be really nice to try and move towards a world where we have well defined public APIs in terms of the scheduler object (for extensions), the RPC and the HTTP API. Otherwise it is always going to be risky developing against the scheduler.

We could also expose this via the client and treat that as the public API instead of prodding the RPC directly.

There are tests in dask-kubernetes that cover this, so in terms of risk there is some mitigation. But it would be really nice to try and move towards a world where we have well defined public APIs in terms of the scheduler object (for extensions), the RPC and the HTTP API. Otherwise it is always going to be risky developing against the scheduler.

This is a much bigger thing that I want it to be but yes I would love to see nothing of the actual Scheduler object to be public. It is just too big, there are too many small things, it's changing too quickly, ...

We could also expose this via the client and treat that as the public API instead of prodding the RPC directly.

That's typically something I feel very comfortable with since it is very explicit that this is intended for external usage.

Just to be very clear: It's fine keeping it like this. "Fixing" this API leakage is very hard. I'm open to any suggestions but handing users (both in extension but especially in plugins) the entire object isn't maintainable, particularly not as long as there is no proper discipline around "using underscores"

This is a much bigger thing that I want it to be but yes I would love to see nothing of the actual Scheduler object to be public.

I totally sympathise with that. Although scheduler plugins/extensions/run_on_scheduler are all well established patterns. So backing out from those at this point will be hard.

That's typically something I feel very comfortable with since it is very explicit that this is intended for external usage.

Most, if not all, of the cluster managers open an RPC to the scheduler and invoke methods directly. I think part of the challenge here is that some cluster managers live in distributed and some live in dask/dask{kubernetes,jobqueue,yarn,gateway,cloudprovider,etc}.

For a long time the dask-foo projects were considered part of core Dask and it wasn't unusual to make a PR to both distributed and the cluster manager you were working on which were coupled. However this is a maintenance challenge because tests for code in distributed run in other repos.

Perhaps a good step forward would be for dask-foo projects to only interact with the scheduler via the Client and the HTTP API. These are API surfaces that we can consider to be public.

The question is what do we do with the base Cluster and SpecCluster classes in distributed? These both use the RPC directly and are intended to be subclassed by third-party libraries. Maybe these classes should be updated to stop using the RPC and to use a Client instead?

* Add autoCleanupTimeout option * Add idle_timeout * Update to use dask/distributed#7642 * Rename spec ketyword * Updated with changes in distributed * Nudge CI * Set idle_timeout longer than resource_timeout * Rename cluster to avoid collision * Fix timeout default value * Reduce logging noise and add warning when autoshudown can't reach scheduler

Expose schedule idle via RPC and HTTP API

b18081d

jacobtomlinson changed the title ~~Expose schedule idle via RPC and HTTP API~~ Expose scheduler idle via RPC and HTTP API Mar 10, 2023

fjetter approved these changes Mar 10, 2023

View reviewed changes

jrbourbeau reviewed Mar 10, 2023

View reviewed changes

Fix test

04be190

jrbourbeau reviewed Mar 13, 2023

View reviewed changes

jrbourbeau merged commit 9f88685 into dask:main Mar 13, 2023

jacobtomlinson deleted the expose-check-idle branch March 13, 2023 17:19

jacobtomlinson added a commit to jacobtomlinson/dask-kubernetes that referenced this pull request Mar 13, 2023

Update to use dask/distributed#7642

762a1ed

jacobtomlinson mentioned this pull request Mar 13, 2023

Auto shutdown idle clusters dask/dask-kubernetes#672

Merged

ypogorelova pushed a commit to ypogorelova/distributed that referenced this pull request Mar 16, 2023

Expose scheduler idle via RPC and HTTP API (dask#7642)

9151b38

gjoseph92 reviewed Mar 21, 2023

View reviewed changes

HynekBlahaDev mentioned this pull request Apr 14, 2023

Add idle cluster cleanup dask/dask-kubernetes#667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose scheduler idle via RPC and HTTP API #7642

Expose scheduler idle via RPC and HTTP API #7642

jacobtomlinson commented Mar 10, 2023 •

edited

Loading

fjetter commented Mar 10, 2023

github-actions bot commented Mar 10, 2023 •

edited

Loading

jrbourbeau left a comment

jacobtomlinson commented Mar 13, 2023

jrbourbeau left a comment

jacobtomlinson commented Mar 13, 2023

gjoseph92 Mar 21, 2023

jacobtomlinson Mar 21, 2023

fjetter Mar 21, 2023

jacobtomlinson Mar 21, 2023

fjetter Mar 21, 2023

jacobtomlinson Mar 22, 2023

Expose scheduler idle via RPC and HTTP API #7642

Expose scheduler idle via RPC and HTTP API #7642

Conversation

jacobtomlinson commented Mar 10, 2023 • edited Loading

fjetter commented Mar 10, 2023

github-actions bot commented Mar 10, 2023 • edited Loading

Unit Test Results

jrbourbeau left a comment

Choose a reason for hiding this comment

jacobtomlinson commented Mar 13, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

jacobtomlinson commented Mar 13, 2023

gjoseph92 Mar 21, 2023

Choose a reason for hiding this comment

jacobtomlinson Mar 21, 2023

Choose a reason for hiding this comment

fjetter Mar 21, 2023

Choose a reason for hiding this comment

jacobtomlinson Mar 21, 2023

Choose a reason for hiding this comment

fjetter Mar 21, 2023

Choose a reason for hiding this comment

jacobtomlinson Mar 22, 2023

Choose a reason for hiding this comment

jacobtomlinson commented Mar 10, 2023 •

edited

Loading

github-actions bot commented Mar 10, 2023 •

edited

Loading