Root-ish tasks all schedule onto one worker #6573

gjoseph92 · 2022-06-14T02:15:26Z

import time
import dask
import distributed

client = distributed.Client(n_workers=4, threads_per_worker=1)

root = dask.delayed(lambda n: "x" * n)(dask.utils.parse_bytes("1MiB"), dask_key_name="root")
results = [dask.delayed(lambda *args: None)(root, i) for i in range(10000)]
dask.compute(results)

Initially a few results tasks run on other workers, but after about .5 sec, all tasks are just running on a single worker and the other three are idle.

I would have expected these tasks to be evenly assigned to all workers up front

Some variables to play with:

If the size of the root task is smaller, tasks will be assigned to other workers
If you remove dask_key_name="root", then all tasks (including the root) will all run on the same worker. I assume this is because they have similar same key names (lambda) and therefore the same task group, and some scheduling heuristics are based not on graph structure but on naming heuristics

Distributed version: 2022.6.0

The text was updated successfully, but these errors were encountered:

fjetter · 2022-06-14T08:28:02Z

This is a work stealing problem. If we disable work stealing, it works as expected

import time
import dask
import distributed
with dask.config.set({"distributed.scheduler.work-stealing": False}):
    client = distributed.Client(n_workers=4, threads_per_worker=1)
root = dask.delayed(lambda n: "x" * n)(dask.utils.parse_bytes("1MiB"), dask_key_name="root")
results = [dask.delayed(lambda *args: None)(root, i) for i in range(10000)]
dask.compute(results)

fjetter · 2022-08-11T12:17:29Z

I looked into this briefly today and could narrow this down to a couple of issues

Initially, before the execution time could be learned, unknown-task-duration is used for the compute time which basically enables work stealing for all tasks initially for a brief moment. This causes the imbalance.
~~Yet unknown why work stealing even corrects that drastically in the first place~~ see comment below
After the first couple of tasks finished we will learn the actual execution time and will see that it is merely a couple of microseconds, i.e. work stealing will always ignore them

distributed/distributed/stealing.py

Lines 226 to 227 in 99a2db1

if compute_time < 0.005: # 5ms, just give up

return None, None

I'm already arguing in Improve work stealing for scaling situations #4920 that this approach is wrong. This causes the initial imbalance never to be corrected. If these two lines are removed, the occupancy fluctuates but stays roughly homogeneous. Of course, in this situation, no work stealing would still be a better choice

Whether or not unknown tasks are allowed to be stolen is actually a disputed topic.
#5392 disabled this behavior based on a previously written test and #5564 reported this as a regression which caused a revert in #5572

fjetter · 2022-08-11T13:01:39Z

The initial imbalance is cause by work stealing being selecting potential victims greedily in case there are no saturated workers around.

Specifically the following lines

distributed/distributed/stealing.py

Lines 424 to 429 in 99a2db1

    
           saturated = topk(10, s.workers.values(), key=combined_occupancy) 
        
           saturated = [ 
        
               ws 
        
               for ws in saturated 
        
               if combined_occupancy(ws) > 0.2 and len(ws.processing) > ws.nthreads 
        
           ]

are responsible for quite astonishing stealing decisions, e.g. even between two as idle classified workers.

Therefore, as soon as at least one of the workers is classified as idle due to being slightly faster than another worker, this idle worker is allowed to steal work from basically everywhere causing all work to gravitate towards this specific worker.

Finally, this for-loop is exhausting the entire stealable set for the targeted victim/"saturated" worker without accounting for any in-flight occupancy. in-flight occupancy is used for sorting and therefore picking a victim (see combined_occupancy) but we never reevaluate if we should stop stealing from the victim which ultimately causes a severe overcorrection that is very likely responsible for the very aggressive fluctuations we see in stealing often (e.g. #5243)

fjetter · 2022-09-07T13:46:19Z

I investigated what's causing the initial spike of stealing events. Very early in the computation we see that slightly less than 7.5k stealing decisions are enacted which causes this initial imbalance.

Assuming perfect initial task placement, this is explained by the way we update occupancies. Right after the root task finished and all tasks are assigned we can see that the occupancies are already biased due to #7004

this bias is amplified in this example due to a double counting problem of occupancies #7003

after an update of the task duration, _reevaluate_occupancy_worker which is ran periodically and selects one worker to recompute the results assuming CPU load is not above a threshold

will reevaluate this, causing in this example the occupancy of the selected worker to drop dramatically since the unknown-duration is, by default 0.5s but the runtime of these tasks is much smaller. This drastic drop causes this worker to be classified as idle and it will steal all the tasks.

Even if this round robin recalculation is replaced with an exact, "always compute all workers" function, the bias introduced by the double counting causes the worker with the dependency to always be classified as idle which causes heavy stealing as well

xref #5243

Due to how we determine if keys are allowed to be stolen, this imbalance may never be corrected again, see

distributed/distributed/stealing.py

Lines 225 to 233 in 3655f13

    
           compute_time = ws.processing[ts] 
        
           if compute_time < 0.005:  # 5ms, just give up 
        
               return None, None 
        
           nbytes = ts.get_nbytes_deps() 
        
           transfer_time = nbytes / self.scheduler.bandwidth + LATENCY 
        
           cost_multiplier = transfer_time / compute_time 
        
           if cost_multiplier > 100: 
        
               return None, None

.

In this specific case, this will never be rebalance again because of the fast compute time of the tasks

crusaderky · 2022-09-26T12:34:51Z

The reproducer no longer works after #7036.

import time
import dask
import distributed

client = distributed.Client(n_workers=4, threads_per_worker=1)

root = dask.delayed(lambda n: "x" * n)(dask.utils.parse_bytes("1MiB"), dask_key_name="root")
results = [dask.delayed(lambda *args: None)(root, i) for i in range(10000)]
r2 = dask.persist(results)
distributed.wait(r2)

for ws in client.cluster.scheduler.workers.values():
    print(ws.address, len(ws.has_what))

Before #7036:
tcp://127.0.0.1:35985 6881
tcp://127.0.0.1:38925 957
tcp://127.0.0.1:44315 1155
tcp://127.0.0.1:46187 1007

After #7036:
tcp://127.0.0.1:35657 2500
tcp://127.0.0.1:41311 2500
tcp://127.0.0.1:42593 2500
tcp://127.0.0.1:45729 2500

fjetter · 2022-10-18T16:25:15Z

After #7036 the original reproducer was timing dependent (pending on a reevaluate_occupancy)

#7075 closes this. It added the original reproducer as a test case

gjoseph92 added bug Something is broken performance labels Jun 14, 2022

fjetter added scheduling stealing labels Jun 14, 2022

fjetter mentioned this issue Jun 20, 2022

Use cases for work stealing #6600

Open

fjetter mentioned this issue Aug 26, 2022

Failing test_climatic_mean coiled/benchmarks#253

Open

This was referenced Sep 5, 2022

Stealing sensitive tests coiled/benchmarks#305

Closed

Stealing balance not accounting for recent decisions #7002

Closed

hendrikmakait mentioned this issue Sep 6, 2022

Tests for work stealing coiled/benchmarks#308

Merged

fjetter mentioned this issue Sep 7, 2022

Double counting of network transfer cost #7003

Open

This was referenced Sep 8, 2022

test_basic_sum occasionally takes 340% time and 160% memory to complete coiled/benchmarks#315

Closed

Allow very fast keys and very expensive transfers as stealing candidates #7022

Merged

Accurate occupancy calculation / occupancy replacement #7027

Open

hendrikmakait mentioned this issue Oct 4, 2022

Refactor occupancy #7075

Merged

2 tasks

fjetter closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Root-ish tasks all schedule onto one worker #6573

Root-ish tasks all schedule onto one worker #6573

gjoseph92 commented Jun 14, 2022

fjetter commented Jun 14, 2022

fjetter commented Aug 11, 2022 •

edited

Loading

fjetter commented Aug 11, 2022

fjetter commented Sep 7, 2022 •

edited

Loading

crusaderky commented Sep 26, 2022

fjetter commented Oct 18, 2022

Root-ish tasks all schedule onto one worker #6573

Root-ish tasks all schedule onto one worker #6573

Comments

gjoseph92 commented Jun 14, 2022

fjetter commented Jun 14, 2022

fjetter commented Aug 11, 2022 • edited Loading

fjetter commented Aug 11, 2022

fjetter commented Sep 7, 2022 • edited Loading

crusaderky commented Sep 26, 2022

fjetter commented Oct 18, 2022

fjetter commented Aug 11, 2022 •

edited

Loading

fjetter commented Sep 7, 2022 •

edited

Loading