workload not balancing during scale up on dask-gateway #5599

chrisroat · 2021-12-14T23:16:27Z

What happened:

I have a fully parallel workload of 30 tasks (no dependencies), each of which takes ~20 minutes. My cluster autoscales between 10 and 50 workers. When I start the task graph, I find the 30 jobs distributed 3 a-piece on each initial worker. Eventually, the cluster scales to 30 workers -- but sometimes the tasks will not redistribute.

Even after jobs finish and timing would be known, I have seen a situation where 7 remaining jobs distributed 2-2-2-1 on 4 workers, while there are plenty empty workers.

What you expected to happen:

As new workers come online, they steal tasks from existing workers.

Minimal Complete Verifiable Example:

I tried to replicate on a local cluster, but could not. Things seem to work as expected, and tasks are immediately redistributed.

Anything else we need to know?:

I'm attaching debug schedule/worker logs per @fjetter 's script, at the point where 30 workers are available and 10 workers each have 3 tasks.

Environment:

Dask version: 2021.12.0
Python version: 3.8.12
Operating System: Ubuntu 20
Install method (conda, pip, source): conda

scheduler_20211214.pkl.gz

worker_20211214.pkl.gz

chrisroat · 2021-12-14T23:25:51Z

My workers and tasks do have resource constraints. In this case, I have 6 CPUs and 45 GB of memory -- and this particular task requests 45 GB. Would the recent fix 96d4fd4 potentially solve this problem?

chrisroat · 2021-12-15T01:27:22Z

I pushed a docker image to our cluster with 96d4fd4 and it didn't seem to make a difference.

gjoseph92 · 2021-12-16T21:11:54Z

@chrisroat this sounds very similar to #5564 to me, which was recently fixed but not released yet. Can you try on main or with #5572?

jrbourbeau · 2021-12-16T22:24:19Z

Agreed with @gjoseph92's point that this seems similar to #5564. FWIW #5572 was included in the distributed==2021.12.0 release, so you should be able to test things out by updating to that release too

chrisroat · 2021-12-16T22:26:26Z

I reported this at 2021.12.0, and even later at 96d4fd4. In both cases, I'm seeing this behavior. I've attached the scheduler/worker dumps if that helps understand what is happening.

chrisroat · 2022-01-22T16:31:47Z

Could the needs_info label be removed? I reported this at the release requested. I am starting to try and understand these large cluster dump files, and would appreciate any tips.

I am running at 2022.1.0 now, and I'm attaching more cluster dump logs. In this case, the workload doesn't fully balance, even without autoscaling.

it's a 3d map_overlap, so has lots of short upstream re-arranging tasks
stack-trim-getitem are the 680 fused tasks containing the long per-chunk calculation
I set the config default duration for stack-item-getitem to 500s, which in my local scheduler tests helps mitigate this issue
cluster has 50+ workers + 15 threads = 750+ threads
Not all 680 tasks are even ready -- some are stuck behind unfinished upstream re-arranging tasks
Only 12-13 threads being used
1 task has lots of the "pre-overlap" stitching tasks

If there is even some patch/hack I can make in a personal fork, it would be acceptable.

dump_765b081abd78484495e033a613140c02_20220122.msgpack.gz

gjoseph92 added the needs info Needs further information from the user label Dec 16, 2021

fjetter added the stealing label Jun 14, 2022

fjetter mentioned this issue Jun 20, 2022

Use cases for work stealing #6600

Open

fjetter added the adaptive All things relating to adaptive scaling label Aug 31, 2022

fjetter mentioned this issue Sep 8, 2022

Allow very fast keys and very expensive transfers as stealing candidates #7022

Merged

gjoseph92 mentioned this issue Dec 13, 2022

Assign tasks to idle workers rather than queuing them #7384

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload not balancing during scale up on dask-gateway #5599

workload not balancing during scale up on dask-gateway #5599

chrisroat commented Dec 14, 2021

chrisroat commented Dec 14, 2021

chrisroat commented Dec 15, 2021

gjoseph92 commented Dec 16, 2021

jrbourbeau commented Dec 16, 2021

chrisroat commented Dec 16, 2021

chrisroat commented Jan 22, 2022

workload not balancing during scale up on dask-gateway #5599

workload not balancing during scale up on dask-gateway #5599

Comments

chrisroat commented Dec 14, 2021

chrisroat commented Dec 14, 2021

chrisroat commented Dec 15, 2021

gjoseph92 commented Dec 16, 2021

jrbourbeau commented Dec 16, 2021

chrisroat commented Dec 16, 2021

chrisroat commented Jan 22, 2022