-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workload not balancing during scale up on dask-gateway #5599
Comments
My workers and tasks do have resource constraints. In this case, I have 6 CPUs and 45 GB of memory -- and this particular task requests 45 GB. Would the recent fix 96d4fd4 potentially solve this problem? |
I pushed a docker image to our cluster with 96d4fd4 and it didn't seem to make a difference. |
@chrisroat this sounds very similar to #5564 to me, which was recently fixed but not released yet. Can you try on |
Agreed with @gjoseph92's point that this seems similar to #5564. FWIW #5572 was included in the |
I reported this at 2021.12.0, and even later at 96d4fd4. In both cases, I'm seeing this behavior. I've attached the scheduler/worker dumps if that helps understand what is happening. |
Could the needs_info label be removed? I reported this at the release requested. I am starting to try and understand these large cluster dump files, and would appreciate any tips. I am running at 2022.1.0 now, and I'm attaching more cluster dump logs. In this case, the workload doesn't fully balance, even without autoscaling.
If there is even some patch/hack I can make in a personal fork, it would be acceptable. |
What happened:
I have a fully parallel workload of 30 tasks (no dependencies), each of which takes ~20 minutes. My cluster autoscales between 10 and 50 workers. When I start the task graph, I find the 30 jobs distributed 3 a-piece on each initial worker. Eventually, the cluster scales to 30 workers -- but sometimes the tasks will not redistribute.
Even after jobs finish and timing would be known, I have seen a situation where 7 remaining jobs distributed 2-2-2-1 on 4 workers, while there are plenty empty workers.
What you expected to happen:
As new workers come online, they steal tasks from existing workers.
Minimal Complete Verifiable Example:
I tried to replicate on a local cluster, but could not. Things seem to work as expected, and tasks are immediately redistributed.
Anything else we need to know?:
I'm attaching debug schedule/worker logs per @fjetter 's script, at the point where 30 workers are available and 10 workers each have 3 tasks.
Environment:
scheduler_20211214.pkl.gz
worker_20211214.pkl.gz
The text was updated successfully, but these errors were encountered: