Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REST API to scheduler #5935

Closed
jacobtomlinson opened this issue Mar 11, 2022 · 3 comments · Fixed by #6270
Closed

Add REST API to scheduler #5935

jacobtomlinson opened this issue Mar 11, 2022 · 3 comments · Fixed by #6270

Comments

@jacobtomlinson
Copy link
Member

I've had a couple of independent conversations lately where folks want to interact with the scheduler from some external service. The specific use cases were external process managers that can scale Dask clusters up and down. Scaling down gracefully requires calling the retire_workers and workers_to_close methods on the scheduler RPC.

Using the RPC for this is problematic because success is heavily dependent on the dask, distributed and python versions used by the scheduler and the external manager. Mismatches can result in failure.

A workaround for this is exposing those methods via a RESTful endpoint. This would allow for a wider range of versions to be supported and means the external process manager doesn't even have to be written in Python.

In a conversation with @stephan-erb-by and @philipp-sontag-by around the Kubernetes operator in dask/dask-kubernetes#256 they mentioned they had done this via a scheduler plugin that added extra HTTP routes to the Dashboard web server. This is fine but does require a plugin to be installed for all distributed clusters managed by the external process manager (the operator in this case).

I wanted to open this issue to gauge the feeling of adding a more official REST API to the scheduler that exposes some general RPC methods via HTTP in a language-agnostic way.

@fjetter
Copy link
Member

fjetter commented Mar 14, 2022

I have a couple of concerns but the primary one is our public API surface. I do not consider our RPC handlers public. Generally, we're doing a very poor job in defining our public API the way it should be. Not everything that is "not underscored" is really a public piece of the API and this is obviously already a big problem with plugins. Right now the only way to interact with the scheduler is the Client or the Cluster. The two examples mentioned are supported on the Client but we'd still need to talk about what the API call would look like, e.g. retire_workers has different signature on client vs scheduler side.

If we go down this road I would like us to be very mindful about what we add to this API and ensure that it is properly versioned from the start.

For instance, I'm relatively at easy if we want to support things like get_versions but would never want to support update_graph. There is a lot of grey in between.


Is this about a REST API or more generally about any HTTP API? retire_workers is not what I would consider RESTful (but I also don't want to get into semantics. Just wondering what we're looking for).


What handlers/API calls are we talking about in the initial iteration? Is there anything other than retire_workers and workers_to_close? Are these actually the primitives we need for a HTTP/REST API or are they merely there and convenient to call?
Are there use cases beyond the K8s operator?

@jacobtomlinson
Copy link
Member Author

jacobtomlinson commented Mar 14, 2022

Thanks for the response @fjetter. I share the same concerns about our public API.

I am by no means wedded to REST in this discussion, a gRPC or GraphQL API would also be fine. I'm not sure what would be most appropriate. The challenge I'm facing is how can external process managers interact with the scheduler in a language and version agnostic way to perform scaling operations.

We currently expose prometheus metrics which is arguably a RESTful endpoint so extending that is one path with reduced friction.

I have a second internal use case, but it is very much in the same vein as the k8s operator.

I think things that would be useful to me are:

  • List workers
  • Get the scheduler's desired number of workers
  • Drain a specified number of workers ahead of removal

I would actually prefer a little more control around draining workers, currently workers exit once they are retired, but that can cause some process managers to restart them. It would be better for them to continue running but to be free of memory and tasks and ready for a signal to exit.

@fjetter
Copy link
Member

fjetter commented Mar 14, 2022

I am by no means wedded to REST in this discussion, a gRPC or GraphQL API would also be fine. I'm not sure what would be most appropriate. The challenge I'm facing is how can external process managers interact with the scheduler in a language and version agnostic way to perform scaling operations.

I'm fine with HTTP but I'm not settled on REST. You can do HTTP without REST. You can do RPC with HTTP. I'm not sure if you can do REST without HTTP 🤔
That may be nitpicking and I don't want us to spiral into discussions about semantics :)

We currently expose prometheus metrics which is arguably a RESTful endpoint so extending that is one path with reduced friction.

From a technical POV I'm not concerned. We have a HTTP server running and we'd simply need to implement the API handlers.

I would actually prefer a little more control around draining workers, currently workers exit once they are retired, but that can cause some process managers to restart them.

This functionality is theoretically possible since internally, we obviously do it this way. That would be a "pause and evict worker" functionality but we do not expose this publicly right now since the only way this comes in handy is a downscaling. However, this is a great example where I'm not entirely sure if simply exposing our existing RPC handlers is what we're looking for.
IIRC, what @stephan-erb-by and @philipp-sontag-by are looking for would also more in this direction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants