Tune work stealing to be data size aware #278

mrocklin · 2016-05-17T15:51:15Z

Work stealing can be inefficient if it pulls large datasets around. We avoid these cases.

mrocklin · 2016-05-18T19:17:46Z

distributed/scheduler.py

+                        sum(self.nbytes.get(d, 1000) for d in
+                            self.dependencies[t]) > 1000000]
+                bads = set(bad)
+                good = [t for t in tasks if t not in bads]


This was a hack and needs to be replaced.

mrocklin · 2016-05-18T21:05:43Z

@jcrist @martindurant if either of you are interested in eventually learning more about the distributed scheduler, this PR is somewhat bite-sized and could use review.

mrocklin · 2016-05-19T16:04:04Z

OK, this fails in some workloads (takes up 80% of our scheduling budget) because it's non-constant per new task. When we have many idle and saturated workers with tasks that can't easily be shared (too much data transfer to warrant the movement) we end up cycling though all workers over and over again on each update.

I'll change things around to only respond to newly saturated or newly idle workers.

mrocklin · 2016-05-19T17:59:34Z

OK, this is efficient for the shuffle problem (which is particularly demanding)

martindurant · 2016-05-20T14:16:40Z

distributed/scheduler.py

                # self.ensure_occupied(new_worker)
        else:
            self.ready.appendleft(key)
            # self.ensure_idle_ready()

+    def should_steal(self, key, bandwidth=100e6):


Is there any attempt to estimate the bandwidth?

Maybe using a module level constant and/or an environment variable would allow people with high end hardware to tune this to better reflect their empirically observed bandwidth.

Added a module level constant. I think we should specify these sorts of things with config files eventually when there are other such parameters like choice of compression.

At the moment I'm not too concerned with bandwidth for work-stealing. For most tasks I've seen the computation / communication ratio is either very high or very low.

martindurant · 2016-05-20T14:32:35Z

I see no check for whether the potential thief is on the same machine as the victim, whereas I would have thought that's important for communication bandwidth.

Do you plan to also steal based on memory exhaustion sometime?

martindurant · 2016-05-20T14:43:35Z

docs/source/work-stealing.rst

+--------------------------
+
+If a task has been specifically restricted to run on particular workers (such
+as is the case when special hardware is required) then we do not steal.


Stealing amongst the set of allowed workers is prohibited for a hard restriction?

It's not prohibited, just not implemented. Generally there is a high cost for adding any logic here (error prone, very performance sensitive) so I've tended to avoid some of the more fringe cases. They're definitely valid, they just haven't shown yet themselves to be worth the development and maintenance cost.

mrocklin · 2016-05-21T13:09:33Z

I see no check for whether the potential thief is on the same machine as the victim, whereas I would have thought that's important for communication bandwidth.

True, although in practice intra-node communication (300MB/s + serialization) is not hugely different from inter-node communication (100MB/s + serialization) so at the moment I'm tempted to avoid caring.

Do you plan to also steal based on memory exhaustion sometime?

Perhaps, as applications demand.

mrocklin · 2016-05-21T13:19:53Z

The general lesson so far has been that stealing work definitely helps robustness for computations. Lots of computations are pretty terrible without it. However at the same time it introduces a lot of complexity that is hard to reason about and verify. It's been a bit of a rough ride so far.

ogrisel · 2016-05-23T14:24:27Z

distributed/tests/test_executor.py

+
+@gen_cluster(executor=True, ncores=[('127.0.0.1', 1)] * 10)
+def test_worksteal_many_thieves(e, s, *workers):
+    np = pytest.importorskip('numpy')


numpy is unused here.

mrocklin · 2016-05-23T21:15:29Z

Future work here I think is to have workers send tasks back to the scheduler once it's clear that they're over-burdened. I've run into cases where this occurs, especially when the length of functions varies significantly. This could be done by maintaining excess tasks on the worker in a queue and running a periodic callback to filter out any task that had not been sent to the thread-pool-executor since the previous iteration.

mrocklin · 2016-05-24T02:27:47Z

If there are no other comments then I may merge this tomorrow.

ogrisel · 2016-05-24T07:28:49Z

Future work here I think is to have workers send tasks back to the scheduler once it's clear that they're over-burdened. I've run into cases where this occurs, especially when the length of functions varies significantly. This could be done by maintaining excess tasks on the worker in a queue and running a periodic callback to filter out any task that had not been sent to the thread-pool-executor since the previous iteration.

That's an interesting idea. I think it would be worth maintaining a suite of workload that highlight both common and pathological scheduling behaviors to make sure that we do not introduce performance regressions when refactoring the scheduling logics.

ogrisel · 2016-05-24T08:28:48Z

docs/source/work-stealing.rst

+``log(n)`` cost to the common case.
+
+Instead we allow Python to iterate through the set of saturated workers however
+it finds to be the most efficient.


I don't understand the phrasing of this sentence. Could you please try to make it more explicit?

mrocklin · 2016-05-24T13:37:31Z

I think it would be worth maintaining a suite of workload that highlight both common and pathological scheduling behaviors to make sure that we do not introduce performance regressions when refactoring the scheduling logics.

This sounds good to me. Can you recommend a benchmark suites from a couple of other projects? I'd like to see how other people do this. What do you suggest?

mrocklin · 2016-05-24T13:45:49Z

Currently my solution to avoid performance regressions is to add unit tests to ensure specific behavior:

@gen_cluster(executor=True, ncores=[('127.0.0.1', 1)] * 2)
def test_even_load_on_startup(e, s, a, b):
    x, y = e.map(inc, [1, 2])
    yield _wait([x, y])
    assert len(a.data) == len(b.data) == 1

However, I usually add these only after I notice that something is broken. I've been testing applications by hand. It'd be nice to automate alerts here.

minrk · 2016-05-24T13:49:44Z

Pandas uses asv, which is something I've had some positive experiences with, but haven't managed to fully integrate into a project. I tried to use it myself for pyzmq, but I couldn't get around the hangs that would happen for certain non-functional stages in history.

mrocklin · 2016-05-24T15:24:03Z

I'm planning to handle performance benchmarks after this.

This branch holds some fairly important fixes. I'd like to merge. Any further comments?

ogrisel · 2016-05-24T20:55:24Z

LGTM. I have not tested this branch on a real cluster / workload but I guess we can always fine tune later.

ogrisel · 2016-05-25T10:38:05Z

BTW +1 for asv as a bench framework. While I have never used it myself, it looks really neat.

This was oddly causing a lot of CPU use

We now inspect the expected communication and computation time before deciding to steal a task.

mrocklin force-pushed the work-steal-big branch from 2df06d1 to 24edbc3 Compare May 17, 2016 23:58

mrocklin reviewed May 18, 2016
View reviewed changes

mrocklin changed the title ~~[WIP] Avoid work stealing large data~~ Work stealing May 18, 2016

mrocklin changed the title ~~Work stealing~~ Tune work stealing to be data size aware May 18, 2016

mrocklin force-pushed the work-steal-big branch 3 times, most recently from 8834d99 to a654da7 Compare May 18, 2016 22:28

mrocklin force-pushed the work-steal-big branch from 2ec9d07 to c07e0cd Compare May 19, 2016 17:12

martindurant reviewed May 20, 2016
View reviewed changes

mrocklin force-pushed the work-steal-big branch from 1957025 to a49f29a Compare May 21, 2016 13:05

ogrisel reviewed May 23, 2016
View reviewed changes

mrocklin force-pushed the work-steal-big branch 2 times, most recently from 5179198 to bdff777 Compare May 23, 2016 21:20

ogrisel reviewed May 24, 2016
View reviewed changes

mrocklin mentioned this pull request May 25, 2016

Longitudinal Benchmarks with asv #290

Open

mrocklin added 4 commits May 25, 2016 07:15

don't close closed streams

a8f7242

avoid logerrors in bokeh status server_lifecycle

6b3c587

This was oddly causing a lot of CPU use

punt on timeout heartbeat

ae15f32

Improve work stealing

e79765f

We now inspect the expected communication and computation time before deciding to steal a task.

mrocklin force-pushed the work-steal-big branch from c66515a to e79765f Compare May 25, 2016 14:16

mrocklin merged commit d56e9a8 into dask:master May 25, 2016

mrocklin deleted the work-steal-big branch May 25, 2016 14:40

This was referenced Dec 8, 2021

Task stealing regression in 2021-11-0+ (preventing task load balancing) #5564

Closed

Allow unknown tasks to be stolen #5572

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune work stealing to be data size aware #278

Tune work stealing to be data size aware #278

mrocklin commented May 17, 2016 •

edited

Loading

mrocklin May 18, 2016

mrocklin commented May 18, 2016

mrocklin commented May 19, 2016

mrocklin commented May 19, 2016

martindurant May 20, 2016

mrocklin May 21, 2016

ogrisel May 23, 2016

mrocklin May 23, 2016

martindurant commented May 20, 2016

martindurant May 20, 2016

mrocklin May 21, 2016

mrocklin commented May 21, 2016

mrocklin commented May 21, 2016

ogrisel May 23, 2016

mrocklin May 23, 2016

mrocklin commented May 23, 2016

mrocklin commented May 24, 2016

ogrisel commented May 24, 2016

ogrisel May 24, 2016

mrocklin May 24, 2016

mrocklin commented May 24, 2016

mrocklin commented May 24, 2016

minrk commented May 24, 2016

mrocklin commented May 24, 2016

ogrisel commented May 24, 2016

ogrisel commented May 25, 2016

Tune work stealing to be data size aware #278

Tune work stealing to be data size aware #278

Conversation

mrocklin commented May 17, 2016 • edited Loading

Choose a reason for hiding this comment

mrocklin commented May 18, 2016

mrocklin commented May 19, 2016

mrocklin commented May 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented May 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 21, 2016

mrocklin commented May 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 23, 2016

mrocklin commented May 24, 2016

ogrisel commented May 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 24, 2016

mrocklin commented May 24, 2016

minrk commented May 24, 2016

mrocklin commented May 24, 2016

ogrisel commented May 24, 2016

ogrisel commented May 25, 2016

mrocklin commented May 17, 2016 •

edited

Loading