Allow the scheduler to dynamically add/remove workers #149

jpsamaroo · 2020-10-05T14:19:44Z

As discussed in #147 , it may benefit certain use cases to known when a worker is unused by Dagger entirely (specifically, no data cached on the worker) so that the worker can be removed from the Distributed pool.

The text was updated successfully, but these errors were encountered:

jpsamaroo · 2021-06-21T13:13:12Z

Expanding on this, it would be great if the scheduler could dynamically add new workers via Distributed whenever it believes that having extra workers would help decrease total runtime of the currently-loaded DAG. The scheduler would call a user-defined function to add workers, which could call into a custom ClusterManager. We would want to be able to specify what kinds of nodes are available (what kinds of processors and how many per node) so that, for example, GPU-only tasks would always have GPUs available. This would be interesting for interactive uses on HPC clusters or for accessing cloud platforms.

@DrChainsaw I see from Discourse that this is probably something you'd be interested in.

DrChainsaw · 2021-06-21T20:30:56Z

This is something that would certainly come in handy for me! Let me know if you want me to test something out.

I do have a fear that I might have added some kind of seed of chaos here with #147 though. The day after #147 was merged there was this discussion in ClusterManagers and it seems like Distributed.jl is not designed for this type of dynamic usage (I felt a little bit like the intern who just pushed Integration Test Email #1 into production).

Or perhaps your proposed method will be more Distributed-friendly?

jpsamaroo · 2021-06-21T23:02:55Z

I think @vchuravy was pointing out that because Distributed was originally designed for HPC clusters where startup is all at once, not all cluster managers will handle this well, and that's to be expected. But that doesn't preclude Distributed from handling this properly for cluster managers that do support dynamic worker changes (such as the LocalManager and probably SSHManager). I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

DrChainsaw · 2021-06-22T11:09:20Z

I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

Alright, just wanted to point it out.

Oh, and in case the above was a polite request for a contribution I'd be happy to help, but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime".

Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

jpsamaroo · 2021-06-22T20:28:53Z

Oh, and in case the above was a polite request for a contribution I'd be happy to help

Not necessarily, I'm happy to do it as well (and the logic for starting/stopping workers is pretty trivial since you already added the logic to handle that in the scheduler).

but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime".
Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

Yeah, that's the key thing to be determined. This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this (say, trigger when it's been X seconds without any scheduling progress, or if the estimated time to DAG completion is greater than X minutes).

DrChainsaw · 2021-06-23T22:03:47Z

This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this

Sounds like a reasonable approach to me. Don't hesitate to ping if there is anything added in #147 which is confusing or if there is something to try out.

kolia · 2021-08-05T14:46:38Z

What is the story around initial loading of code on newly spun up workers, pass in a quote with all your using Package commands to be evaled in the worker's Main?

jpsamaroo · 2021-08-05T15:07:06Z

Generally I use @everywhere using Package1, Package2, ..., which works fine. Distributed's code-loading story isn't great right now, but it's what we've got.

jpsamaroo added scheduler data movement labels Oct 5, 2020

jpsamaroo changed the title ~~Provide indicator of when a worker is no longer used by any scheduler~~ Allow the scheduler to dynamically add/remove processors Jun 21, 2021

jpsamaroo changed the title ~~Allow the scheduler to dynamically add/remove processors~~ Allow the scheduler to dynamically add/remove workers Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the scheduler to dynamically add/remove workers #149

Allow the scheduler to dynamically add/remove workers #149

jpsamaroo commented Oct 5, 2020

jpsamaroo commented Jun 21, 2021

DrChainsaw commented Jun 21, 2021

jpsamaroo commented Jun 21, 2021

DrChainsaw commented Jun 22, 2021

jpsamaroo commented Jun 22, 2021

DrChainsaw commented Jun 23, 2021

kolia commented Aug 5, 2021

jpsamaroo commented Aug 5, 2021

Allow the scheduler to dynamically add/remove workers #149

Allow the scheduler to dynamically add/remove workers #149

Comments

jpsamaroo commented Oct 5, 2020

jpsamaroo commented Jun 21, 2021

DrChainsaw commented Jun 21, 2021

jpsamaroo commented Jun 21, 2021

DrChainsaw commented Jun 22, 2021

jpsamaroo commented Jun 22, 2021

DrChainsaw commented Jun 23, 2021

kolia commented Aug 5, 2021

jpsamaroo commented Aug 5, 2021