-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add runner #4710
Add runner #4710
Conversation
I'm still a bit confused about the intention of Will a I'm asking so many questions since I'm currently reviewing the interfaces around |
I have similar thoughts to @fjetter I'm also curious about the motivations for this and why Cluster doesn't work in those situations. I haven't perceived any frustration around having MPICluster inherit from Cluster. I may not be well attuned there though. |
I started out using Typically all The paradigm in Dask MPI is different. There is no parent process. Instead the same script is submitted and executed many times in parallel via I started writing an Azure ML exampleAn example of where this would be useful to me is Azure ML. That platform has an API which allows you to submit batch jobs which are run via MPI. Constructing the job definition means deciding how many processes should run, what datasets get mounted, what workspaces are available, what the runtime is, etc. But you have to provide a single script or command for all the processes to run. In that case I expect we would want an Faster VM startup exampleIn many of the VM based cluster managers in dask-cloudprovider they use I would prefer to write a runner for each cloud platform which can use platform native methods to have all the processes negotiate about who is the scheduler. In that case we would still need a cluster manager to create the processes, but we would leverage a runner to do the actual setup of the cluster from within those processes. Homogenous cluster startupThis is a future thought and a bit of a tangent but I would love to see a homogenous command for starting Dask clusters in general which leverage some distributed locking/synchronization or leadership election method. Platform agnostic tools such as etcd, consul or zookeeper could be useful here. The current paradigm of running a cluster means that the scheduler must be started first. $ dask-scheduler
$ # wait for scheduler
$ dask-worker <scheduler IP>
$ dask-worker <scheduler IP>
$ dask-worker <scheduler IP> Instead it would be prefereable to run : $ dask-etcd <etcd address>
$ dask-etcd <etcd address>
$ dask-etcd <etcd address>
$ dask-etcd <etcd address> Etcd would handle which process is the scheduler and which are the workers. This allows all processes to be started concurrently without worrying about race conditions. Workers could proxy port 8786 to the scheduler process, which would allow a client to connect to any process in the cluster. In the case of the scheduler process being lost a worker could switch roles and start a new scheduler. This would effectively be a cluster restart and all work would be lost, but it would allow reuse of resources in a scheduler failure state. |
I'm still a bit struggling to understand the ultimate intention. From what I understand is that you want a class which is actually not managing the cluster but rather manages a process/node but what does it mean to manage this process/node? What I mean with managing a cluster is particularly to From what I understand is that you do not want any of this since MPI (I'm only superficially aware at how MPI works) doesn't work this way. What is this If I take the homogeneous deployment as an example or long term goal, I would actually argue that we would not want to have a |
I can certainly see the motivation for faster VM startup (we've run into this recently at Coiled). I've raised #4715 with some other thoughts and maybe an alternative for that one. The dask-etcd thing sounds fun too. This sounds like an extension/generalization of the use-another-service-to-find-the-scheduler approach that would be useful for faster VM startup time. In both cases I'm curious if there is a more focused abstraction we could add around "something somewhere else that allows Dask servers to coordinate safely". This seems more orthogonal to our current abstractions, and so might be easier to motivate. Thoughts? |
Today Dask MPI uses a method to handle this coordination. Users import and call this method at the top of this script. Only the client process continues beyond this point, all other processes start up the scheduler and workers. I dislike that this is a bit magic. The scheduler address is updated in the Part of my motivation here was to move to a context manager for this, to give transparency to what is going on. I could just make that PR to the
Maybe. But that should probably be implemented as a I feel like this is orthogonal to this discussion though. The goal here is to coordinate Dask components from within an existing parallel job via a context manager.
Yeah I was thinking out loud here. Happy to discuss another time.
I agree that moving away from the scheduler/worker startup model would remove the need for this completely. I tacked that on to this discussion as a bit of an afterthought, but perhaps that is a better route to go down than the |
assert await c.submit(lambda x: x + 1, 10).result() == 11 | ||
assert await c.submit(lambda x: x + 1, 20).result() == 21 | ||
|
||
await asyncio.gather(*[run_code(commworld) for _ in range(4)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this spawns four async tasks (or equivalently four MPI jobs, four processes, etc.) of which one decides to become a scheduler, two decide to become a worker and one will become a client.
Scheduler and worker will effectively ignore the ctx mgr body and act as if they were ordinary server nodes. While the client will actually connect and execute the code. So, effectively this is an async local cluster with two workers in disguise.
This implementation will negotiate the "role" (which I was calling node type) via AsyncCommWorld
which in this case is simply an async lock but in general this interface would need to implement some sort of leader election via a shared filesystem, a distribtued KV store (zookeeper, etcd,...). This inteface is currently discussed in #4715
Is this an accurate summary of what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's it!
The key difference between this and a Cluster
is that the roles are worked out after the processes (or coroutines in the reference implementation) have been created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference between this and a Cluster is that the roles are worked out after the processes
From an implementation perspective, the two may look alike but conceptually I feel these are very different concepts which is why I had a hard time following.
Conceptually I see the Cluster
as the entity which is allowed, or even required, to talk to the hardware or resource manager (trying to avoid the "cluster" term here). It will talk to this hardware manager via some kind of API to spawn new ServerNode
instances of type {Scheduler|Worker}
somewhere else (different process, different VM, ...) while the Runner
will spawn one ServerNode
next to it, similar to what the Nanny
does with the Worker
.
Anyhow, I think that's a nice concept, it just took me while to understand and I think it is valuable to document this properly.
I'm wondering if we need some standardized interface for AsyncCommWorld
or whether this is too backend specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyhow, I think that's a nice concept, it just took me while to understand and I think it is valuable to document this properly.
Great. I think this is useful, but may not be the final incarnation of it. I had a chat with @mrocklin yesterday about service discovery and leadership election which is related to this change.
I'm wondering if we need some standardized interface for AsyncCommWorld or whether this is too backend specific
I think that may be too specific. That class was more of a necessary evil because the lock had to live somewhere.
from ..worker import Worker | ||
|
||
|
||
class Role(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer a term like "type". In deploy environments, role reminds me of some IAM entity. Maybe that's just me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps ServerRole
or ServerType
would be better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both would work for me. Dealer's choice
async def before_scheduler_start(self) -> None: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example what kind of functionality would be executed in these hooks? Adding more functions later on is usually simpler than removing them again later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a look at the MPI PR some of the hooks are used there, but not all of them.
Summarizing my viewpoint from my conversation with @jacobtomlinson I think that we might not want to combine the Scheduler/Worker/Client hybrid class with the get/set-scheduler-address coordinator. I think that those two abstractions might be better living apart. |
commworld = AsyncCommWorld() | ||
|
||
async def run_code(commworld): | ||
with suppress(SystemExit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems atypical in Dask tests. If we go with some sort of dask-runner or dask-server class I think that we should reconsider explicitly calling sys.exit(0)
when that class finishes up.
This PR adds a new
Runner
base class with a similar intention toCluster
in that it is for other projects to subclass and flesh out platform-specific implementation.This work was inspired by dask-mpi which has a deployment paradigm unlike any of the existing cluster management tooling.
Dask MPI assumes that the user wants to submit a single Python script to a parallel batch scheduler, resulting in all processes executing this script. Dask MPI then handles deciding which process will be the scheduler, which will be the workers and which will continue executing the user's client code. Negotiation between processes is done via the MPI ranking and comm.
This kind of functionality would also be useful on similar systems which do not use MPI. The concept of running a single script and having all but one invocations bootstrap a Dask cluster, leaving one to complete the client work, is appealing.
The new
Runner
class takes the concepts from Dask MPI and attempts to mix them with the structure of theCluster
object in terms of asyncio support and the use of context managers.Usage
The general usage for a runner should look like this.
This script should be submitted many time in parallel using an appropriate execution engine.
When
FooRunner
is created each instance of the script will negotiate via some platform specific means to decide which role each process will take. All but the client will block here and run their components. Then when the client finishes and the runner context manager closes all components will be shut down via the Dask comm.Reference implementation
This PR also includes a reference implementation of the runner which uses asyncio to concurrently execute four instances of the
AsyncioRunner
concurrently. These instances will negotiate via anasyncio.Lock
to decide who is the scheduler, who continues with client code and who are workers.Dask MPI
See dask/dask-mpi#69 for an implementation of this to replace the current code in dask-mpi.