fast weighted sum #1224

crusaderky · 2017-01-23T00:29:19Z

In my project I'm struggling with weighted sums of 2000-4000 dask-based xarrays. The time to reach the final dask-based array, the size of the final dask dict, and the time to compute the actual result are horrendous.

So I wrote the below which - as laborious as it may look - gives a performance boost nothing short of miraculous. At the bottom you'll find some benchmarks as well.

https://gist.github.com/crusaderky/62832a5ffc72ccb3e0954021b0996fdf

In my project, this deflated the size of the final dask dict from 5.2 million keys to 3.3 million and cut a 30% from the time required to define it.

I think it's generic enough to be a good addition to the core xarray module. Impressions?

shoyer · 2017-01-23T01:09:49Z

Interesting -- thanks for sharing! I am interested in performance improvements but also a little reluctant to add in specialized optimizations directly into xarray.

You write that this is equivalent to sum(a * w for a, w in zip(arrays, weights)). How does this compare to stacking doing the sum in xarray, e.g.,(arrays * weights).sum('stacked'), where arrays and weights are now DataArray objects with a 'stacked' dimension? Or maybe arrays.dot(weights)?

Using vectorized operations feels a bit more idiomatic (though also maybe more verbose). It also may be more performant. Note that the builtin sum is not optimized well by dask because it's basically equivalent to a loop:

def sum(xs):
    result = 0
    for x in xs:
        result += x
    return result

In contrast, dask.array.sum builds up a tree so it can do the sum in parallel.

There have also been discussion in #422 about adding a dedicated method for weighted mean.

crusaderky · 2017-01-23T02:02:08Z

(arrays * weights).sum('stacked') was my first attempt. It performed considerably worse than sum(a * w for a, w in zip(arrays, weights)) - mostly because xarray.concat() is not terribly performant (I did not look deeper into it).

I did not try dask.array.sum() - worth some playing with.

shoyer · 2017-01-23T02:04:23Z

Was concat slow at graph construction or compute time?

…

On Sun, Jan 22, 2017 at 6:02 PM crusaderky ***@***.***> wrote: (arrays * weights).sum('stacked') was my first attempt. It performed considerably worse than sum(a * w for a, w in zip(arrays, weights)) - mostly because xarray.concat() is not terribly performant (I did not look deeper into it). I did not try dask.array.sum() - worth some playing with. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1224 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1oZrmY8hgglb3RBTcDcFhcLhs8Lbks5rVAoggaJpZM4LqkHo> .

crusaderky · 2018-05-12T10:22:02Z

Both. One of the biggest problem is that the data of my interestest is a mix of

1D arrays with dims=(scenario, ) and shape=(500000, ) (stressed financial instruments under a Monte Carlo stress set)
0D arrays with dims=() (financial instruments that are impervious to the Monte Carlo stresses and never change values)
So before you do concat(), you need to call broadcast(), which effectively means that doing the sums on your bunch of very fast 0D instruments suddendly requires repeating them on 500k points.

Even keeping the two lots separate (which is fastwsum does) performed considerably slower.

However, this was over a year ago and much before xarray.dot() and dask.einsum(), so I'll need to tinker with it again.

crusaderky · 2019-08-09T08:36:09Z

Retiring this as it is way too specialized for the main xarray library.

crusaderky closed this as completed Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast weighted sum #1224

fast weighted sum #1224

crusaderky commented Jan 23, 2017 •

edited

Loading

shoyer commented Jan 23, 2017

crusaderky commented Jan 23, 2017

shoyer commented Jan 23, 2017 via email

crusaderky commented May 12, 2018

crusaderky commented Aug 9, 2019

fast weighted sum #1224

fast weighted sum #1224

Comments

crusaderky commented Jan 23, 2017 • edited Loading

shoyer commented Jan 23, 2017

crusaderky commented Jan 23, 2017

shoyer commented Jan 23, 2017 via email

crusaderky commented May 12, 2018

crusaderky commented Aug 9, 2019

crusaderky commented Jan 23, 2017 •

edited

Loading