Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider candidates that don't hold any dependencies in decide_worker #4925

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Changes from 6 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5b17f55
Test workers without deps are considered
gjoseph92 Jun 17, 2021
4a81ca8
Consider random subset of workers in decide_worker
gjoseph92 Jun 17, 2021
c57fd72
no-sleep test
gjoseph92 Jun 17, 2021
d810d2d
Comment fastpath. Maybe this is still unnecessary?
gjoseph92 Jun 17, 2021
768d660
Pick from idle workers first
gjoseph92 Jun 17, 2021
346ab17
Update `many_independent_leaves` test
gjoseph92 Jun 18, 2021
420c99e
Uppercase Mb
gjoseph92 Jun 18, 2021
0a004b2
move N_RANDOM_WORKERS within conditional
gjoseph92 Jun 18, 2021
b050d14
Pass in sortedcontainers values, not pydict values
gjoseph92 Jun 18, 2021
9e99b7f
Use sleep test again
gjoseph92 Jun 18, 2021
f6acdc4
Simpler logic
gjoseph92 Jun 18, 2021
524da73
20 -> 10
gjoseph92 Jun 18, 2021
a5d37ae
Over-optimized
gjoseph92 Jun 18, 2021
5842ca8
Revert "Over-optimized"
gjoseph92 Jun 18, 2021
a159245
`random_choices_iter`. over-optimized for now.
gjoseph92 Jun 18, 2021
bb991d1
use `random.choices`
gjoseph92 Jun 18, 2021
58b4bf8
REBASEME Actor: don't hold key references on workers
gjoseph92 Jun 19, 2021
13975cb
Remove flaky data-length check
gjoseph92 Jun 21, 2021
fcb165e
No randomness if < 10 workers to choose from
gjoseph92 Jun 21, 2021
cd382c6
Ensure `decide_worker` args are plain dict_values
gjoseph92 Jun 21, 2021
cc57a8b
1 worker for `test_statistical_profiling`
gjoseph92 Jun 22, 2021
13911bc
no conditional on compiled
gjoseph92 Jun 22, 2021
f2445fe
rerun tests
gjoseph92 Jun 22, 2021
38e6b57
Merge remote-tracking branch 'upstream/main' into decide_worker/add-r…
gjoseph92 Jul 20, 2021
5794540
fix errant actor test
gjoseph92 Jul 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions distributed/scheduler.py
Original file line number Diff line number Diff line change
@@ -2325,6 +2325,7 @@ def decide_worker(self, ts: TaskState) -> WorkerState:
ws = decide_worker(
ts,
self._workers_dv.values(),
self._idle_dv.values(),
valid_workers,
partial(self.worker_objective, ts),
)
@@ -7459,14 +7460,19 @@ def _reevaluate_occupancy_worker(state: SchedulerState, ws: WorkerState):
@cfunc
@exceptval(check=False)
def decide_worker(
ts: TaskState, all_workers, valid_workers: set, objective
ts: TaskState,
all_workers: sortedcontainers.SortedValuesView,
idle_workers: sortedcontainers.SortedValuesView,
valid_workers: set,
objective,
) -> WorkerState:
"""
Decide which worker should take task *ts*.

We choose the worker that has the data on which *ts* depends.
We consider all workers which hold dependencies of *ts*,
plus a sample of 20 random workers (with preference for idle ones).

If several workers have dependencies then we choose the less-busy worker.
From those, we choose the worker where the *objective* function is minimized.

Optionally provide *valid_workers* of where jobs are allowed to occur
(if all workers are allowed to take the task, pass None instead).
@@ -7476,6 +7482,9 @@ def decide_worker(
of bytes sent between workers. This is determined by calling the
*objective* function.
"""
# TODO should it be a bounded fraction of `len(all_workers)`?
N_RANDOM_WORKERS: Py_ssize_t = 20
gjoseph92 marked this conversation as resolved.
Show resolved Hide resolved

ws: WorkerState = None
wws: WorkerState
dts: TaskState
@@ -7486,6 +7495,17 @@ def decide_worker(
candidates = set(all_workers)
else:
candidates = {wws for dts in deps for wws in dts._who_has}
# Add some random workers to into `candidates`, starting with idle ones
# TODO shuffle to prevent hotspots?
candidates.update(idle_workers[:N_RANDOM_WORKERS])
gjoseph92 marked this conversation as resolved.
Show resolved Hide resolved
if len(idle_workers) < N_RANDOM_WORKERS:
sample_from = (
list(valid_workers) if valid_workers is not None else all_workers
)
gjoseph92 marked this conversation as resolved.
Show resolved Hide resolved
candidates.update(
random.sample(sample_from, min(N_RANDOM_WORKERS, len(sample_from)))
# ^ NOTE: `min` because `random.sample` errors if `len(sample) < k`
)
if valid_workers is None:
if not candidates:
candidates = set(all_workers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have idle_workers in this function should this be idle_workers or all_workers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; yes it probably should be.

@@ -7495,7 +7515,7 @@ def decide_worker(
candidates = valid_workers
if not candidates:
if ts._loose_restrictions:
ws = decide_worker(ts, all_workers, None, objective)
ws = decide_worker(ts, all_workers, idle_workers, None, objective)
return ws

ncandidates: Py_ssize_t = len(candidates)
26 changes: 22 additions & 4 deletions distributed/tests/test_scheduler.py
Original file line number Diff line number Diff line change
@@ -100,14 +100,16 @@ async def test_recompute_released_results(c, s, a, b):
assert result == 1


@gen_cluster(client=True)
@gen_cluster(client=True, config={"distributed.scheduler.bandwidth": "1mb"})
gjoseph92 marked this conversation as resolved.
Show resolved Hide resolved
async def test_decide_worker_with_many_independent_leaves(c, s, a, b):
# Make data large to penalize scheduling dependent tasks on other workers
ballast = b"\0" * int(s.bandwidth)
xs = await asyncio.gather(
c.scatter(list(range(0, 100, 2)), workers=a.address),
c.scatter(list(range(1, 100, 2)), workers=b.address),
c.scatter([bytes(i) + ballast for i in range(0, 100, 2)], workers=a.address),
c.scatter([bytes(i) + ballast for i in range(1, 100, 2)], workers=b.address),
)
xs = list(concat(zip(*xs)))
ys = [delayed(inc)(x) for x in xs]
ys = [delayed(lambda s: s[0])(x) for x in xs]

y2s = c.persist(ys)
await wait(y2s)
@@ -126,6 +128,22 @@ async def test_decide_worker_with_restrictions(client, s, a, b, c):
assert x.key in a.data or x.key in b.data


@gen_cluster(
client=True,
nthreads=[("127.0.0.1", 1)] * 3,
config={"distributed.scheduler.work-stealing": False},
)
async def test_decide_worker_select_candidate_holding_no_deps(client, s, a, b, c):
root = await client.scatter(1)
assert sum(root.key in worker.data for worker in [a, b, c]) == 1

tasks = client.map(inc, [root] * 6, pure=False)
await wait(tasks)

assert all(root.key in worker.data for worker in [a, b, c])
gjoseph92 marked this conversation as resolved.
Show resolved Hide resolved
assert len(a.data) == len(b.data) == len(c.data) == 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I look at this again, maybe we do need the sleep.

If we're scheduling incs then it's not unreasonable to schedule six of these things on one worker. That's where the data is. This computation is very cheap. But in contrast, the dependency is very small. We may need the scheduler to learn that slowinc is slow, and so heavy relative to the int dependency.

Maybe

await client.submit(slowinc, 10, delay=0.1)  # learn that slowinc is slow
root = await client.scatter(1)

futures = client.submit(slowinc, [root] * 6, delay=0.1, pure=False)
await wait(futures)

This way we're confident that the computational cost of the future will far outweigh the communciation cost, but things are still fast-ish.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's why I originally went with sleep. It also feels more like the case we're trying to test for. Though the test still passes with 6d91816 (and note that that test fails on main), so I'm not sure what we're thinking about wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently 28 bytes / 100MB/s is just still smaller than however fast inc ran in (a few microseconds probably). It would be safer to bump this up to milliseconds though.

gen_cluster tests can run in less than a second, but more than 100ms. So sleeps of around that 100ms time are more welcome than 1s sleeps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is passing as-is; do you still think I should change it back to sleep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would become a problem if the inc ran in 28 bytes / 100MB/s = 0.2us or less. Inc does actually run this fast.

In [1]: 28 / 100e6
Out[1]: 2.8e-07

In [2]: def inc(x):
   ...:     return x + 1
   ...: 

In [3]: %timeit inc(1)
64.2 ns ± 0.914 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

But presumably the infrastructure around the inc is taking longer. If our timing infrsatructure got much faster then this might result in intermittent errors. It probably doesn't matter at this scale, but it's probably a good idea if it takes only a few minutes.

We might also introduce a penalty of something like 1ms for any communication (I'm surprised that we don't do this today actually) which might tip the scales in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch back to a short sleep for now.

We might also introduce a penalty of something like 1ms for any communication

That seems very worth doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We maybe used to do this? It might be worth checking logs. The worker_objective function is pretty clean if you wanted to add this. We should probably wait until we find a case where it comes up though in an impactful way just to keep from cluttering things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it should be a separate PR for cleanliness



@gen_cluster(client=True, nthreads=[("127.0.0.1", 1)] * 3)
async def test_move_data_over_break_restrictions(client, s, a, b, c):
[x] = await client.scatter([1], workers=b.address)