Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Multiple Worker Types #384

Closed
ljstrnadiii opened this issue Jan 18, 2022 · 5 comments
Closed

Support Multiple Worker Types #384

ljstrnadiii opened this issue Jan 18, 2022 · 5 comments

Comments

@ljstrnadiii
Copy link

As a machine learning engineer I would like to specify the resources for specific operations just like this example in the dask tutorial.

I am pretty new to dask, kubernetes, and helm, but have had success deploying with helm on GKE. And, I think I have seen enough examples of using nodeSelector for gpu node pools, gcs's daemonset for driver installs, etc. that I could get them up and running easily. However, the machines that have accelerators are often limited or we need to scale cpu and gpu independently.

Specifically, I wish to basically run the example in the link:

with dask.annotate(resources={'GPU': 1}):
    processed = [client.submit(process, d) for d in data]
with dask.annotate(resources={'MEMORY': 70e9}):
    final = client.submit(aggregate, processed)

From what I gather, this is not possible as is with a kubernetes-based deployment using HelmCluster or KubeCluster. I am also noticing that .scale is not designed for multiple worker types, which makes me think this might be a heavier lift than I think and looks like this issue might actually belong in dask-distributed.

async def _scale(self, n_workers):

I am seeing the blocker from 2018 here: dask/distributed#2118

I have been able to manually (with kubectl) add a pod with the worker-spec.yaml example here and the scheduler shows it is available where cluster.scale() has no effect, but I honestly have no clue how to build a deployment to control that pod spec even if this hack worked.

note: it looks like ray is doing this here

Anyways, a few questions:

  1. Is it possible to currently support multiple worker types? (in case I missed something)
  2. Is there a deployment hack to take a worker-spec.yaml and deploy an auxiliary set of these workers? i.e. an intermediate step (without support by dask-distributed) to easily supplement an existing worker set with highmem or gpus, etc.
@ljstrnadiii
Copy link
Author

ljstrnadiii commented Jan 18, 2022

Example of a cpu limitation for a given gpu node.
Screen Shot 2022-01-17 at 5 32 14 PM

@jacobtomlinson
Copy link
Member

Thanks for raising this @ljstrnadiii. Excited to see users asking for this.

Scaling multiple worker groups is not well supported by any of the helper tools in Dask today, including dask-kubernetes. The annotations support is still relatively new.

We intend to address this as part of #256.

The only workaround I can suggest today is that you install the Dask Helm Chart and then manually create a second deployment of workers with GPUs and handle the scaling yourself.

@ljstrnadiii
Copy link
Author

This should be closed, yeah?

@ljstrnadiii
Copy link
Author

Really stoked to see progress on this!!!

def scale(self, n_workers, worker_group=None):

If this is working, which I hope to give a try this week, then this issue should be completely closed I think. I am not sure exactly where the annotation effort is at though.

@jacobtomlinson
Copy link
Member

Yeah I'm going to close this out now. We have a blog post in the pipeline to demo this functionality dask/dask-blog#130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants