Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterogenous scheduler? #103

Closed
bstrdsmkr opened this issue Jul 25, 2018 · 5 comments
Closed

Heterogenous scheduler? #103

bstrdsmkr opened this issue Jul 25, 2018 · 5 comments

Comments

@bstrdsmkr
Copy link

We have PBS cluster, but we also have several specialized nodes which are not part of a cluster and we'd like to allow Dask tasks to be ran on either the cluster's compute nodes, or the specialized nodes transparently based on resources.

For example, we have some routine tasks that don't really require anything in order to run, they just need to be executed periodically. We also have computationally expensive tasks that COULD run anywhere, but should really be ran through the cluster. Finally, we have tasks that require specialized hardware (such as GPUs), which is only attached to certain machines.

This looks like it would require a new type of scheduler, but before we start down that rabbit hole, is there something we're missing that makes this work already?

@mrocklin
Copy link
Member

As you suggest from a pure dask perspective you can handle this by starting various dask-worker processes on the different machines and annotate them with things like resources.

What's missing is a nice way to let users control those different groups of workers nicely, using something like dask-jobqueue. This is also discussed in dask/distributed#2118 in the core project.

@guillaumeeb
Copy link
Member

guillaumeeb commented Aug 1, 2018

What you ask for seems to me better handled by a job scheduler like PBS or Slurm than by Dask currently.

Anyway, I feel this is more of a dask/distributed question than a dask-jobqueue one, should we move this upstream in distributed?

Edit : did not see the link on the distributed issue...

@bstrdsmkr
Copy link
Author

Not sure where the discussion should belong, but it seemed to me that any change required would need to be made in PBSCluster.

I think what I need is for the PBSCluster to be able to dispatch jobs to attached workers which are not running on nodes controlled by the PBS to which PBSCluster is attached.

Its entirely possible that PBSCluster already does this and I'm just doing it wrong. It is also entirely possible that the fix would be upstream in distributed

@guillaumeeb
Copy link
Member

I'm going to close this as stale.

Some more answer looking back at the last comment:

I think what I need is for the PBSCluster to be able to dispatch jobs to attached workers which are not running on nodes controlled by the PBS to which PBSCluster is attached.
PBSCluster will only be able to start workers on PBS cluster(s) it has access to. However, you could start the Dask Scheduler on your cluster via PBSCluster, and then manually start other workers on nodes outside the PBS cluster, as soon as they have access to the correct network.

I'm not sure how it will work with the recent SpecCluster changes... But maybe the solution is in implementing a more complex SpecCluster...

@mrocklin
Copy link
Member

I think that the best answer today is to use SpecCluster rather than PBSCluster.

Fortunately, you can now mix and match job types. So you could have several PBSJob objects alongside a few distributed.deploy.ssh.SSHJob objects in your SpecCluster's worker_spec, and everything should work ok.

So we're not doing anything automatic for you, but all of the PBS and SSH logic should be handled for you. It's on you how you want to arrange these jobs though. Operations like scale won't necessarily work (unless you write a bit of code).

https://github.com/dask/distributed/blob/c2cc1a98cbaadc3e7952ea66e5096f0126492539/distributed/deploy/spec.py#L123-L150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants