Data lingers in memory due to imbalance of worker priorities #1747

mrocklin · 2018-02-07T15:17:04Z

We experience some excess memory use because different workers are processing tasks of different priorities.

When we create task graphs we run dask.order on them, which provides a good ordering in order to minimize memory use. When this graph goes out to the workers it gets cut up, and tasks that are very close to each other in the ordering may end up on different workers. Those workers may then get distracted by different things, which means that while some tasks early in the ordering are complete, their co-dependents may not be complete, and are instead trapped on another worker not running, despite their high priority.

We might resolve this in a few ways:

Preferentially steal tasks by their priority. This is possibly expensive (sorting is hard) but might be worth it for tasks without dependencies, or in cases where the number of tasks is not high
Revert back to scheduling tasks only as needed. Currently we schedule all runnable tasks immediately. This helps ensure saturation of hardware. We could rethink this move.
Don't do anything, and rely on mechanisms to slow down workers when they get too much data in memory, allowing their peers to catch up.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-02-07T15:17:32Z

This possibly a partial cause to cause pangeo-data/pangeo#99

rbubley · 2018-02-07T22:14:36Z

If you go for (1), I don’t think you need a full (expensive) sort: you only need the top few, which can be retrieve with a single scan. I.e. O(n) not O(n log n)

caseyjlaw · 2018-02-08T15:01:44Z

FWIW, I am seeing this lingering memory issue in my use case. I use the submit method and chain together a series of futures in graphs than open and close like this:

           |-> process0 ->|
read0----->|-> process1 ->| -> merge0
           |-> process2 ->|

This is repeated for tens of reads/merges and the process step produces a hundred times as many function calls. Nothing too demanding.
I'd like the scheduler to push through the process step in order to free up the read memory. In practice, when I submit many of these graphs, all the read functions get scheduled first and the memory use blows up.

mrocklin · 2018-02-08T15:07:45Z

I suspect that you have a different issue, especially if you are using client.submit. I recommend raising another issue.

…

On Thu, Feb 8, 2018 at 10:01 AM, Casey Law ***@***.***> wrote: FWIW, I am seeing this lingering memory issue in my use case. I use the submit method and chain together a series of futures in graphs than open and close like this: |-> process0 ->| read0----->|-> process1 ->| -> merge0 |-> process2 ->| This is repeated for tens of reads/merges and the process step produces a hundred times as many function calls. Nothing too demanding. I'd like the scheduler to push through the process step in order to free up the read memory. In practice, when I submit many of these graphs, all the read functions get scheduled first and the memory use blows up. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1747 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszB5Ll0MLO3lDTCRMpWdp51TCx3Taks5tSwxYgaJpZM4R83_u> .

sjperkins · 2018-02-14T19:54:14Z

When this graph goes out to the workers it gets cut up, and tasks that are very close to each other in the ordering may end up on different workers. Those workers may then get distracted by different things, which means that while some tasks early in the ordering are complete, their co-dependents may not be complete, and are instead trapped on another worker not running, despite their high priority.

I'd like to re-raise the idea of grouping tasks into partitions that are each assigned to a worker (assignment occurs when first task in the partition starts to execute, as suggested in #1559).

Then, would it not be possible to linearly subdivide the ordering priority space into bins and assigns tasks to each bin? Something like:

task_bins = np.linspace(order_low, order_high, nworkers)
task_order = [t.order for t in tasks]
task_worker = np.digitize(task_order, task_bins)

for task, worker in zip(tasks, task_worker):
   submit(task, worker=worker)

This is probably highly naive when considering actual scheduler resource constraints, but the basic idea might be useful/adaptable when trying to minimise I/O costs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data lingers in memory due to imbalance of worker priorities #1747

Data lingers in memory due to imbalance of worker priorities #1747

mrocklin commented Feb 7, 2018

mrocklin commented Feb 7, 2018

rbubley commented Feb 7, 2018

caseyjlaw commented Feb 8, 2018

mrocklin commented Feb 8, 2018 via email

sjperkins commented Feb 14, 2018

Data lingers in memory due to imbalance of worker priorities #1747

Data lingers in memory due to imbalance of worker priorities #1747

Comments

mrocklin commented Feb 7, 2018

mrocklin commented Feb 7, 2018

rbubley commented Feb 7, 2018

caseyjlaw commented Feb 8, 2018

mrocklin commented Feb 8, 2018 via email

sjperkins commented Feb 14, 2018