-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTTP API to scheduler #6270
Conversation
Can one of the admins verify this patch? |
add to allowlist |
Unit Test Results 15 files ± 0 15 suites ±0 6h 55m 22s ⏱️ +18s For more details on these failures, see this check. Results for commit 15ad992. ± Comparison against base commit af3b93e. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! It's exactly what we need for external resource managers to interact with the scheduler.
We should definitely add a documentation page about this API too.
Great! I can create a follow up PR for more documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
@fjetter would you mind reviewing here? Given you were involved in the early conversations about this it would be good to check you are happy with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking a bit more this probably needs a couple more changes.
We don't have an implementation here for workers_to_close
. We have retire_workers
which allows you to retire by name, but we don't have a way or retiring n
workers. Exposing workers_to_close
would allow that.
distributed/distributed/scheduler.py
Line 5829 in 77cfc73
def workers_to_close( |
I also think we need a little more documentation than this to get this PR in. Perhaps we could add another section to that page with a title like Scheduler API
with these methods listed and an example of the request body they are expecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turnaround. I'm happy with this as a good foundation we can build on.
I intend to merge this in 24 hours if there are no further comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few cosmetic questions around how we want to deal with exceptions but this should not block the PR. The more fundamental question is how we want to maintain API stability. Merely passing through kwargs is not what I had in mind when we first discussed this API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits about documentation (this is intended to be public API for users, therefore I think we should invest a bit in documentation even though the API is simple)
@jacobtomlinson I would trust you with this. If you think the additional docs are not necessary, feel free to merge. These comments should not necessarily block anybody.
Scheduler methods exposed by the API with an example of the request body they take | ||
|
||
- ``/api/v1/retire_workers`` : retire certain workers on the scheduler | ||
|
||
.. code-block:: json | ||
|
||
{ | ||
"workers":["tcp://127.0.0.1:53741", "tcp://127.0.0.1:53669"] | ||
} | ||
|
||
- ``/api/v1/get_workers`` : get all workers on the scheduler | ||
- ``/api/v1/adaptive_target`` : get the target number of workers based on the scheduler's load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to document an example response
|
||
Scheduler methods exposed by the API with an example of the request body they take | ||
|
||
- ``/api/v1/retire_workers`` : retire certain workers on the scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this mentioned specifically that addresses are expected and not names. Sometimes they can be interchangable but this API expects addresses. The example below might still be misleading since we're setting names, by default, to the addresses.
We are blocked on this PR for some work we want to finish this week, so I think I'll just ask @Matt711 to make a follow up to improve the docs if that's ok. Totally agree with the comments though. |
Hey I just saw this PR.
But currently there is no API that exposes the
@jacobtomlinson/@Matt711 Do you think that this might also be interesting for you? |
@philipp-sontag-by would recommend raising a new issue to track this. In fact we might want to do that with any of the meatier tasks left in this PR (if any) |
@philipp-sontag-by I think we can do this, especially since the point of the API is to cover most of the scheduler's methods. I'll create a follow-up PR to add this if @jacobtomlinson is okay with it. |
Yeah let's discuss this in a new issue. I'm curious what shutdown logic you want to do that retire workers isn't suitable for? As I see it the best way to gracefully shutdown is to allow the scheduler to end the worker process and then clean up the completed pods after. |
I'm concerned about a security regression here. By default, this is opening up an API that allows anyone to change cluster state (via Prior to this, the only way to do things that affected cluster state was through the client. All the HTTP routes were effectively read-only. (Whether there is a vulnerability in the bokeh dashboard is another topic; it's pretty possible there is, but I'm just talking here in principle.) I think it's rather common to expose the HTTP routes to the public internet. For example, I believe dask-cloudprovider does this:
You want those ports exposed for convenience, so you can connect to them. But you don't want anyone to be able to do stuff to the cluster, so you set up TLS using temporary credentials. dask-cloudprovider does this for you as well:
Currently, if you set up TLS for your cluster, this is mTLS, meaning the scheduler verifies the client's certificate (docs, code). This serves as a form of authentication and authorization: if you've set up cluster security, you can only tell the scheduler to do things if you hold a valid certificate. However, the HTTP routes have no authentication (they use standard TLS, not mTLS, because mTLS would be very inconvenient when you want to look at the dashboard with a web browser). So after this change, someone who had gone to the trouble to set up mTLS for their cluster (or was using the defaults of their cluster deployment system) would, by default, have an unauthenticated endpoint running that allowed anyone with access to I think we should do two things:
|
@gjoseph92 can you please file this as a new issue to make it easier to track? |
Closes #5935
This PR exposes some of the scheduler endpoints we need to replace the Dask RPC in the Operator.
pre-commit run --all-files