-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive
doesn't scale quickly for long-running jobs
#3627
Comments
The decision to scale up or down is currently coupled to the measurement of a tasks runtime. (Similar tasks are grouped, see TaskPrefix) We're facing the same issue, see #3516 A workaround for this is to configure default task durations which are used as long as there are no measurements available, e.g.
|
Does this get picked up dynamically, say when I start a cluster scheduler via dask-gateway, or for a particular graph? Meaning, does it work to set the context, or is it a startup-only configuration? How are underscore vs dash handled (as I know they are generally interchangeable)?
|
afaik, yes, it is dynamic and the value of the config is read during task initialization.
Very good question, I believe it is not interchangeable. Theoretically, it should be whatever If you set key names yourself: In [14]: key_split(f"my-func-name-{uuid.uuid4()}")
Out[14]: 'my-func-name'
In [15]: key_split(f"my-func_name-{uuid.uuid4()}")
Out[15]: 'my' If you're executing your own functions and are using client.submit and/or dask.delayed probably more relevant is this snippet. from my understanding user defined functions will never have a dash since the function name is taken literally. The dashes come in with dask internals where the computations are "handwritten" In [16]: def my_func_name():
...: pass
In [19]: key_split(funcname(my_func_name) + "-" + str(uuid.uuid4()))
Out[19]: 'my_func_name' |
Is there a correct incantation of this that would rectify the problem in #4471 ? |
I'm using dask to coordinate some long-running machine learning jobs. I've set up an adaptive cluster (with
dask_jobqueue
) that has a minimum of 5 workers and a maximum of 10. Each task I dispatch takes about two hours to run and consistently uses ~100% of the CPU available to it. However, the adaptive cluster doesn't seem to want to add any more workers. It sits at the minimum number and never increases. Is there some way to modify the scheduling policy so that the cluster scales up more aggressively?I'm aware this isn't exactly the sort of job dask is designed to schedule -- it wants smaller, faster tasks was my impression. I think you might be able to modify
Adaptive
to use a different policy that's better suited for long-running jobs? But I spent some time digging into the source and got kinda lost. Any pointers would be helpful :)My current workaround is ignoring Adaptive and scaling the cluster by hand. I feel bad for taking up nodes longer than I need though.
The text was updated successfully, but these errors were encountered: