Add HTTP API to scheduler #6270

Matt711 · 2022-05-04T21:59:21Z

Closes #5935

This PR exposes some of the scheduler endpoints we need to replace the Dask RPC in the Operator.

Tests added / passed
Passes pre-commit run --all-files

GPUtester · 2022-05-04T21:59:23Z

Can one of the admins verify this patch?

quasiben · 2022-05-04T22:34:22Z

add to allowlist

github-actions · 2022-05-05T00:19:54Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 55m 22s ⏱️ +18s
  2 792 tests +  4   2 713 ✔️ +  4   78 💤 ±0 1 ❌ ±0
20 706 runs +28 19 795 ✔️ +25 910 💤 +3 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 15ad992. ± Comparison against base commit af3b93e.

♻️ This comment has been updated with latest results.

…uler-api

jacobtomlinson

This is looking great! It's exactly what we need for external resource managers to interact with the scheduler.

We should definitely add a documentation page about this API too.

distributed/http/scheduler/api.py

distributed/http/scheduler/tests/test_scheduler_http.py

Matt711 · 2022-05-13T12:43:13Z

This is looking great! It's exactly what we need for external resource managers to interact with the scheduler.

We should definitely add a documentation page about this API too.

Great! I can create a follow up PR for more documentation

jacobtomlinson

This looks good to me.

@fjetter would you mind reviewing here? Given you were involved in the early conversations about this it would be good to check you are happy with this approach.

jacobtomlinson

Actually thinking a bit more this probably needs a couple more changes.

We don't have an implementation here for workers_to_close. We have retire_workers which allows you to retire by name, but we don't have a way or retiring n workers. Exposing workers_to_close would allow that.

distributed/distributed/scheduler.py

Line 5829 in 77cfc73

def workers_to_close(

I also think we need a little more documentation than this to get this PR in. Perhaps we could add another section to that page with a title like Scheduler API with these methods listed and an example of the request body they are expecting.

jacobtomlinson

Thanks for the quick turnaround. I'm happy with this as a good foundation we can build on.

docs/source/http_services.rst

jacobtomlinson · 2022-05-16T08:15:16Z

I intend to merge this in 24 hours if there are no further comments.

fjetter

I think there are a few cosmetic questions around how we want to deal with exceptions but this should not block the PR. The more fundamental question is how we want to maintain API stability. Merely passing through kwargs is not what I had in mind when we first discussed this API

distributed/http/scheduler/api.py

…uler-api

distributed/http/scheduler/api.py

fjetter

Just a few nits about documentation (this is intended to be public API for users, therefore I think we should invest a bit in documentation even though the API is simple)

@jacobtomlinson I would trust you with this. If you think the additional docs are not necessary, feel free to merge. These comments should not necessarily block anybody.

fjetter · 2022-05-17T12:26:43Z

docs/source/http_services.rst

+Scheduler methods exposed by the API with an example of the request body they take
+
+- ``/api/v1/retire_workers`` : retire certain workers on the scheduler
+
+.. code-block:: json
+
+    {
+        "workers":["tcp://127.0.0.1:53741", "tcp://127.0.0.1:53669"]
+    }
+
+- ``/api/v1/get_workers`` : get all workers on the scheduler
+- ``/api/v1/adaptive_target`` : get the target number of workers based on the scheduler's load 


I think it would be helpful to document an example response

fjetter · 2022-05-17T12:28:58Z

docs/source/http_services.rst

+
+Scheduler methods exposed by the API with an example of the request body they take
+
+- ``/api/v1/retire_workers`` : retire certain workers on the scheduler


It would be nice if this mentioned specifically that addresses are expected and not names. Sometimes they can be interchangable but this API expects addresses. The example below might still be misleading since we're setting names, by default, to the addresses.

jacobtomlinson · 2022-05-17T12:45:50Z

We are blocked on this PR for some work we want to finish this week, so I think I'll just ask @Matt711 to make a follow up to improve the docs if that's ok. Totally agree with the comments though.

psontag · 2022-05-17T13:03:49Z

Hey I just saw this PR.
My current understanding is that there are two modes for the /api/v1/retire_workers route:

Retire n workers based on the workers_to_close method of the scheduler
Retire the provided workers based on their address

But currently there is no API that exposes the workers_to_close method directly.
At least in our use case that would be really helpful though since our graceful shutdown implementation has two phases:

Retrieve workers to close from the scheduler and do some custom graceful shutdown logic
Actually make the retire_worker call to the scheduler.

@jacobtomlinson/@Matt711 Do you think that this might also be interesting for you?

jakirkham · 2022-05-17T17:49:11Z

@philipp-sontag-by would recommend raising a new issue to track this. In fact we might want to do that with any of the meatier tasks left in this PR (if any)

Matt711 · 2022-05-17T19:56:58Z

@philipp-sontag-by I think we can do this, especially since the point of the API is to cover most of the scheduler's methods. I'll create a follow-up PR to add this if @jacobtomlinson is okay with it.

jacobtomlinson · 2022-05-17T20:41:03Z

Yeah let's discuss this in a new issue.

I'm curious what shutdown logic you want to do that retire workers isn't suitable for? As I see it the best way to gracefully shutdown is to allow the scheduler to end the worker process and then clean up the completed pods after.

gjoseph92 · 2022-05-20T20:48:41Z

I'm concerned about a security regression here. By default, this is opening up an API that allows anyone to change cluster state (via retire_workers currently, but I imagine other things might be added someday too).

Prior to this, the only way to do things that affected cluster state was through the client. All the HTTP routes were effectively read-only. (Whether there is a vulnerability in the bokeh dashboard is another topic; it's pretty possible there is, but I'm just talking here in principle.)

I think it's rather common to expose the HTTP routes to the public internet. For example, I believe dask-cloudprovider does this:

By default a Dask security group will be created with ports 8786 and 8787 exposed to the internet https://cloudprovider.dask.org/en/latest/aws.html#dask_cloudprovider.aws.EC2Cluster

You want those ports exposed for convenience, so you can connect to them. But you don't want anyone to be able to do stuff to the cluster, so you set up TLS using temporary credentials. dask-cloudprovider does this for you as well:

When a cluster is launched with any of these cluster managers a set of temporary keys will be generated and distributed to the cluster nodes via their startup script. All communication between the client, scheduler and workers will then be encrypted and only clients and workers with valid certificates will be able to connect to the scheduler.
https://cloudprovider.dask.org/en/latest/security.html#authentication-and-encryption

Currently, if you set up TLS for your cluster, this is mTLS, meaning the scheduler verifies the client's certificate (docs, code). This serves as a form of authentication and authorization: if you've set up cluster security, you can only tell the scheduler to do things if you hold a valid certificate.

However, the HTTP routes have no authentication (they use standard TLS, not mTLS, because mTLS would be very inconvenient when you want to look at the dashboard with a web browser).

So after this change, someone who had gone to the trouble to set up mTLS for their cluster (or was using the defaults of their cluster deployment system) would, by default, have an unauthenticated endpoint running that allowed anyone with access to :8787 (aka the dashboard) to affect cluster state.

I think we should do two things:

Short-term: disable the HTTP API if TLS is specified on the scheduler. This is a reasonable default.
Long-term: figure out a security posture and authentication for the HTTP API that's consistent with the security posture of other things that can affect cluster state (aka the client).

jakirkham · 2022-05-20T20:57:27Z

@gjoseph92 can you please file this as a new issue to make it easier to track?

Add HTTP API to scheduler

a954646

Run pre-commit checks

1811725

Matt711 added 5 commits May 10, 2022 12:46

Add adaptive_target and get_workers

b27770a

Add tests

b5b26d2

Add tests for api endpoints

dee5c8e

Fix tests adaptive_target and retire_workers

015501d

Change key in adaptive_target response and remove

785f33b

Matt711 marked this pull request as ready for review May 12, 2022 19:42

Merge branch 'main' of github.com:dask/distributed into feature/sched…

1587345

…uler-api

jacobtomlinson reviewed May 13, 2022

View reviewed changes

Matt711 added 2 commits May 13, 2022 08:28

Address comments json, headers, update tests

a83ff30

Add documentation

c55dbf3

jacobtomlinson approved these changes May 13, 2022

View reviewed changes

jacobtomlinson requested a review from fjetter May 13, 2022 13:18

jacobtomlinson requested changes May 13, 2022

View reviewed changes

Add workers_to_close method and more documentation

7927b8a

jacobtomlinson approved these changes May 13, 2022

View reviewed changes

docs/source/http_services.rst Outdated Show resolved Hide resolved

docs/source/http_services.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

ba95502

fjetter reviewed May 16, 2022

View reviewed changes

Address comments from fjetter

f2c8961

fjetter reviewed May 16, 2022

View reviewed changes

distributed/http/scheduler/api.py Outdated Show resolved Hide resolved

Matt711 added 3 commits May 16, 2022 12:03

Address more comments

97275b7

Merge branch 'main' of github.com:dask/distributed into feature/sched…

dc7b228

…uler-api

Remove workers_to_close from doc

65c622f

jacobtomlinson reviewed May 16, 2022

View reviewed changes

distributed/http/scheduler/api.py Outdated Show resolved Hide resolved

Add support for closing n workers

15ad992

jacobtomlinson requested a review from fjetter May 17, 2022 08:44

fjetter reviewed May 17, 2022

View reviewed changes

jacobtomlinson merged commit 63cdddd into dask:main May 17, 2022

Matt711 mentioned this pull request May 19, 2022

Replace RPC with Scheduler HTTP API dask/dask-kubernetes#499

Merged

gjoseph92 mentioned this pull request May 20, 2022

Release 2022.05.1 dask/community#245

Closed

gjoseph92 mentioned this pull request May 20, 2022

Don't expose insecure HTTP API #6407

Open

2 tasks

psontag mentioned this pull request May 23, 2022

Expose workers_to_close as a HTTP route #6416

Open

jacobtomlinson mentioned this pull request May 24, 2022

Add authentication to HTTP API #6431

Open

jrbourbeau mentioned this pull request Jul 2, 2024

Add close worker button to worker info page #8742

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTTP API to scheduler #6270

Add HTTP API to scheduler #6270

Matt711 commented May 4, 2022 •

edited

Loading

GPUtester commented May 4, 2022

quasiben commented May 4, 2022

github-actions bot commented May 5, 2022 •

edited

Loading

jacobtomlinson left a comment

Matt711 commented May 13, 2022

jacobtomlinson left a comment

jacobtomlinson left a comment •

edited

Loading

jacobtomlinson left a comment

jacobtomlinson commented May 16, 2022

fjetter left a comment

fjetter left a comment

fjetter May 17, 2022

fjetter May 17, 2022

jacobtomlinson commented May 17, 2022

psontag commented May 17, 2022

jakirkham commented May 17, 2022

Matt711 commented May 17, 2022

jacobtomlinson commented May 17, 2022

gjoseph92 commented May 20, 2022

jakirkham commented May 20, 2022


		Scheduler methods exposed by the API with an example of the request body they take

		- ``/api/v1/retire_workers`` : retire certain workers on the scheduler

Add HTTP API to scheduler #6270

Add HTTP API to scheduler #6270

Conversation

Matt711 commented May 4, 2022 • edited Loading

GPUtester commented May 4, 2022

quasiben commented May 4, 2022

github-actions bot commented May 5, 2022 • edited Loading

Unit Test Results

jacobtomlinson left a comment

Choose a reason for hiding this comment

Matt711 commented May 13, 2022

jacobtomlinson left a comment

Choose a reason for hiding this comment

jacobtomlinson left a comment • edited Loading

Choose a reason for hiding this comment

jacobtomlinson left a comment

Choose a reason for hiding this comment

jacobtomlinson commented May 16, 2022

fjetter left a comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

fjetter May 17, 2022

Choose a reason for hiding this comment

fjetter May 17, 2022

Choose a reason for hiding this comment

jacobtomlinson commented May 17, 2022

psontag commented May 17, 2022

jakirkham commented May 17, 2022

Matt711 commented May 17, 2022

jacobtomlinson commented May 17, 2022

gjoseph92 commented May 20, 2022

jakirkham commented May 20, 2022

Matt711 commented May 4, 2022 •

edited

Loading

github-actions bot commented May 5, 2022 •

edited

Loading

jacobtomlinson left a comment •

edited

Loading