-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign Cluster Managers #2235
Comments
Just a gentle ping here. It would be good to get the group to start thinking about this problem. I think that some moderate energy here might be a better use of time than patching around the current system. |
Some thoughts here:
Concerning ClusterManager implementation, for the moment I don't see the need for a major change of how things are done from the abstract class/implementation point of view.
|
Can you expand on this? I think that several groups find it valuable to keep "the thing that launches workers" on a separate machine from the scheduler. |
Was just answering your question above:
Sorry if I did not make myself clear enough. I totaly agree that separating ClusterManager and Scheduler is valuable. |
Ah, I see. Got it. |
This all sounds reasonable to me. It would be useful for me to see the reasons why we can't use the current configuration better spelled out. In dask-jobqueue, we I think we have this configuration:
where the in each
|
In my custom deployment both the scheduler and workers are spun up remotely - there is no My Users create and manage a cluster with I'm not across the current |
@jhamman some reasons to separate the scheduler from the "thing that controls starting and stopping workers"
|
How should we move forward here? Should we try to design something, with rough class modelling (ClusterManager, Adaptive and Scheduler, Client) and interaction? |
Yes, when I tried this briefly I tried to make something within the
dask/distributed respository that had the separation desired, but still
made the distributed/deploy/tests/test_local.py test suite pass (though
perfect backwards compatibility is not necessary).
I suspect that once we have some draft that it will become easier for
people to comment and discuss the issue.
…On Tue, Oct 2, 2018 at 5:33 PM Guillaume Eynard-Bontemps < ***@***.***> wrote:
How should we move forward here? Should we try to design something, with
rough class modelling (ClusterManager, Adaptive and Scheduler, Client) and
interaction?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszAsDQgzQ8DR0XoTVw6vDcLYYTyhlks5ug9uygaJpZM4WbT2t>
.
|
@mrocklin Is there any thought to making the cpu/mem per worker set up as part of the client request, rather than the cluster specification? That is: I might have one request which needs more memory, and another request which needs less (or gpu's, or some other spec). Or is the intent that I would start a different cluster for each request? |
@mturok there are some discussions about this: #2118 and #2208 (comment). The currently admitted solution is to have a set of different worker pools. Then user need to be able to submit tasks to a particular pool through Dask APIs. |
@mturok maybe you're referring to something more fine-grained like the following: http://distributed.dask.org/en/latest/resources.html ? Ideally adaptive deployments would look at the resource requirements of the current set of tasks when requesting workers from a cluster manager. This is an open problem though. I know of groups that have done this sort of thing in-house (the adaptive class is not hard to modify/subclass) but there is no drop-in solution in Dask today. (though of course there could be with moderate effort) |
Just cross posting some findings from other discussion here for consolidation of the ClusterManager approach:
|
I'm wondering, should we consolidate ClusterManager/Adaptive/Scheduler refactor needs somewhere else than in this issue? We could do this in the project wiki for example, but I see it is not used. I would like to avoid copy pasting the entire design each time a modification is identified. Any idea? |
Maybe a PR with just a design document. The PR can serve as a discussion forum and you can add commits as the discussion shapes the design. The actual implementation could be done in a separate PR to keep the discussion around implementation details separate from the design discussion. The design document doesn't need to be merged at the end if it has served its purpose. It may well be a good basis for some docs though. |
I would be tempted to say this has been done. Thoughts @mrocklin? |
Yup. Thanks for flagging @jacobtomlinson |
Over the past few months we've learned more about the requirements to deploy Dask on different cluster resource managers. This has created several projects like
dask-yarn
,dask-jobqueue
, anddask-kubernetes
. These projects share some code within this repository, but also add their own constraints.Now that we have more experience, this might be a good time to redesign things both in the external projects and in the central API.
There are a few changes that have been proposed that likely affect everyone:
Cluster
manager object from the Scheduler in some casesThe first two will likely require synchronized changes to all of the downstream projects.
Separate ClusterManager, Scheduler, and Client
I'm going to start calling the object that starts and stops workers, like
KubeCluster
,ClusterManager
from here on.Previously the
ClusterManager
,Scheduler
, andClient
were started from within a notebook and so all lived in the same process, and could act on the same event loop thread. It didn't matter so much which part did which action. We're now considering separating all of these. This forces us to think about the kinds of communication they'll have to do, and where certain decisions and actions take place. At first glance we might assign actions like the following:Adaptive
logic, and determine how many workers should exist given current loadSome of these actions depend on information from the others. In particular:
ClusterManager
needs to send the following to theScheduler
Scheduler
needs to send the following to theClusterManager
This is all just a guess. I suspect that other things might arise when actually trying to build everything here.
How to specify cluster size
This came up in #2208 by @guillaumeeb . Today we ask users for a number of desired workers
But we might instead want to allow users to specify amount of desired ram or memory
And in the future people will probably also ask for more complex compositions, like some small workers and some big workers, some with GPUs, and so on.
If we now have to establish a communication protocol between the
ClusterManager
and theScheduler/Adaptive
then it might be an interesting challenge to make that future-proof.Further discussion on this topic should probably remain in #2208 , but I wanted to mention it here.
Shared UI
@ian-r-rose has done some excellent work on the JupyterLab extension for Dask. He has also mentioned thoughts on how to include the
ClusterManager
within something like that as a server extension. This would optionally move theClusterManager
outside of the notebook and into the JupyterLab sidebar. A number of people have asked for this in the past. If we're going to redesign things I thought we should also include UI in that process to make sure we get any constraints.Organization
Should the
ClusterManager
object continue to be a part of this repository or should it be spun out? There are probably costs and benefits both ways.cc @jcrist (dask-yarn) @jacobtomlinson (dask-kubernetes) @jhamman @guillaumeeb @lesteve (dask-jobqueue) @ian-r-rose (dask-labextension) for feedback.
The text was updated successfully, but these errors were encountered: