-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add REST API to scheduler #5935
Comments
I have a couple of concerns but the primary one is our public API surface. I do not consider our RPC handlers public. Generally, we're doing a very poor job in defining our public API the way it should be. Not everything that is "not underscored" is really a public piece of the API and this is obviously already a big problem with plugins. Right now the only way to interact with the scheduler is the Client or the Cluster. The two examples mentioned are supported on the Client but we'd still need to talk about what the API call would look like, e.g. If we go down this road I would like us to be very mindful about what we add to this API and ensure that it is properly versioned from the start. For instance, I'm relatively at easy if we want to support things like Is this about a REST API or more generally about any HTTP API? What handlers/API calls are we talking about in the initial iteration? Is there anything other than |
Thanks for the response @fjetter. I share the same concerns about our public API. I am by no means wedded to REST in this discussion, a gRPC or GraphQL API would also be fine. I'm not sure what would be most appropriate. The challenge I'm facing is how can external process managers interact with the scheduler in a language and version agnostic way to perform scaling operations. We currently expose prometheus metrics which is arguably a RESTful endpoint so extending that is one path with reduced friction. I have a second internal use case, but it is very much in the same vein as the k8s operator. I think things that would be useful to me are:
I would actually prefer a little more control around draining workers, currently workers exit once they are retired, but that can cause some process managers to restart them. It would be better for them to continue running but to be free of memory and tasks and ready for a signal to exit. |
I'm fine with HTTP but I'm not settled on REST. You can do HTTP without REST. You can do RPC with HTTP. I'm not sure if you can do REST without HTTP 🤔
From a technical POV I'm not concerned. We have a HTTP server running and we'd simply need to implement the API handlers.
This functionality is theoretically possible since internally, we obviously do it this way. That would be a "pause and evict worker" functionality but we do not expose this publicly right now since the only way this comes in handy is a downscaling. However, this is a great example where I'm not entirely sure if simply exposing our existing RPC handlers is what we're looking for. |
I've had a couple of independent conversations lately where folks want to interact with the scheduler from some external service. The specific use cases were external process managers that can scale Dask clusters up and down. Scaling down gracefully requires calling the
retire_workers
andworkers_to_close
methods on the scheduler RPC.Using the RPC for this is problematic because success is heavily dependent on the
dask
,distributed
andpython
versions used by the scheduler and the external manager. Mismatches can result in failure.A workaround for this is exposing those methods via a RESTful endpoint. This would allow for a wider range of versions to be supported and means the external process manager doesn't even have to be written in Python.
In a conversation with @stephan-erb-by and @philipp-sontag-by around the Kubernetes operator in dask/dask-kubernetes#256 they mentioned they had done this via a scheduler plugin that added extra HTTP routes to the Dashboard web server. This is fine but does require a plugin to be installed for all distributed clusters managed by the external process manager (the operator in this case).
I wanted to open this issue to gauge the feeling of adding a more official REST API to the scheduler that exposes some general RPC methods via HTTP in a language-agnostic way.
The text was updated successfully, but these errors were encountered: