Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the scheduler to dynamically add/remove workers #149

Open
jpsamaroo opened this issue Oct 5, 2020 · 8 comments
Open

Allow the scheduler to dynamically add/remove workers #149

jpsamaroo opened this issue Oct 5, 2020 · 8 comments

Comments

@jpsamaroo
Copy link
Member

As discussed in #147 , it may benefit certain use cases to known when a worker is unused by Dagger entirely (specifically, no data cached on the worker) so that the worker can be removed from the Distributed pool.

@jpsamaroo jpsamaroo changed the title Provide indicator of when a worker is no longer used by any scheduler Allow the scheduler to dynamically add/remove processors Jun 21, 2021
@jpsamaroo jpsamaroo changed the title Allow the scheduler to dynamically add/remove processors Allow the scheduler to dynamically add/remove workers Jun 21, 2021
@jpsamaroo
Copy link
Member Author

Expanding on this, it would be great if the scheduler could dynamically add new workers via Distributed whenever it believes that having extra workers would help decrease total runtime of the currently-loaded DAG. The scheduler would call a user-defined function to add workers, which could call into a custom ClusterManager. We would want to be able to specify what kinds of nodes are available (what kinds of processors and how many per node) so that, for example, GPU-only tasks would always have GPUs available. This would be interesting for interactive uses on HPC clusters or for accessing cloud platforms.

@DrChainsaw I see from Discourse that this is probably something you'd be interested in.

@DrChainsaw
Copy link
Contributor

This is something that would certainly come in handy for me! Let me know if you want me to test something out.

I do have a fear that I might have added some kind of seed of chaos here with #147 though. The day after #147 was merged there was this discussion in ClusterManagers and it seems like Distributed.jl is not designed for this type of dynamic usage (I felt a little bit like the intern who just pushed Integration Test Email #1 into production).

Or perhaps your proposed method will be more Distributed-friendly?

@jpsamaroo
Copy link
Member Author

I think @vchuravy was pointing out that because Distributed was originally designed for HPC clusters where startup is all at once, not all cluster managers will handle this well, and that's to be expected. But that doesn't preclude Distributed from handling this properly for cluster managers that do support dynamic worker changes (such as the LocalManager and probably SSHManager). I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

@DrChainsaw
Copy link
Contributor

I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

Alright, just wanted to point it out.

Oh, and in case the above was a polite request for a contribution I'd be happy to help, but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime".

Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

@jpsamaroo
Copy link
Member Author

Oh, and in case the above was a polite request for a contribution I'd be happy to help

Not necessarily, I'm happy to do it as well (and the logic for starting/stopping workers is pretty trivial since you already added the logic to handle that in the scheduler).

but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime".
Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

Yeah, that's the key thing to be determined. This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this (say, trigger when it's been X seconds without any scheduling progress, or if the estimated time to DAG completion is greater than X minutes).

@DrChainsaw
Copy link
Contributor

This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this

Sounds like a reasonable approach to me. Don't hesitate to ping if there is anything added in #147 which is confusing or if there is something to try out.

@kolia
Copy link

kolia commented Aug 5, 2021

What is the story around initial loading of code on newly spun up workers, pass in a quote with all your using Package commands to be evaled in the worker's Main?

@jpsamaroo
Copy link
Member Author

Generally I use @everywhere using Package1, Package2, ..., which works fine. Distributed's code-loading story isn't great right now, but it's what we've got.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants