WIP: Fast implementation of reduction along dims #5294

lindahua · 2014-01-04T04:29:08Z

This is a new implementation of reduction along dims, mainly to improve the performance.

Currently, I have implemented a generic code generation function & macro, and a specialized method for sum over contiguous arrays. Key features of the implementation:

Cache friendly access. The input elements are accessed in the same order as it is stored in the memory, no matter what region the user specify.
Based on three vector-kernel (instead of just a binary operation). This allows a variety of optimization to be incorporated (e.g. when SIMD is available, we can easily take advantage of it by rewriting the vector kernels using SIMD instructions).
Functions in this implementation recursively calls over slices of the array of lower-rank. In this way, it no longer needs the trick of caching functions in a dictionary (which the original implementation relies on).
The implementation is carefully tested.

This implementation is based on the gist: https://gist.github.com/lindahua/8251556
(with some modification to take care of corner cases, such as empty input).

The design is explained here: https://gist.github.com/lindahua/8255431.

The script in the gist above runs a benchmark to compare the performance this new implementation and the original one. On my macbook pro, it yields 2x - 16x performance gain (depending on the region):

# the result is obtained by running it 10 times over an array of size (100,100,100,10)
# et0: elapsed seconds using original method
# et1: elapsed seconds using new method
#
region = 1           :  et0 =  0.1245,  et1 =  0.0605,  gain =   2.0587x
region = 2           :  et0 =  0.3095,  et1 =  0.0885,  gain =   3.4970x
region = 3           :  et0 =  0.4836,  et1 =  0.0708,  gain =   6.8286x
region = 4           :  et0 =  0.5405,  et1 =  0.1278,  gain =   4.2284x
region = (1,2)       :  et0 =  0.1098,  et1 =  0.0464,  gain =   2.3665x
region = (1,3)       :  et0 =  0.1959,  et1 =  0.0485,  gain =   4.0368x
region = (1,4)       :  et0 =  0.1445,  et1 =  0.0502,  gain =   2.8805x
region = (2,3)       :  et0 =  0.5159,  et1 =  0.0785,  gain =   6.5687x
region = (2,4)       :  et0 =  0.4115,  et1 =  0.0784,  gain =   5.2470x
region = (3,4)       :  et0 =  1.0379,  et1 =  0.0645,  gain =  16.0804x
region = (1,2,3)     :  et0 =  0.1063,  et1 =  0.0461,  gain =   2.3080x
region = (1,2,4)     :  et0 =  0.1077,  et1 =  0.0460,  gain =   2.3426x
region = (1,3,4)     :  et0 =  0.1587,  et1 =  0.0476,  gain =   3.3352x
region = (2,3,4)     :  et0 =  1.1473,  et1 =  0.0767,  gain =  14.9629x
region = (1,2,3,4)   :  et0 =  0.1110,  et1 =  0.0459,  gain =   2.4203x

This is still work in progress. I will continue to add other methods, such as maximum, minimum, all, any, prod, etc.

I may also write a short document to explain the rationale behind the implementation.

JeffBezanson · 2014-01-04T06:44:26Z

Totally awesome!

ViralBShah · 2014-01-04T07:12:23Z

This is really cool. It would be nice if some of these design rationales could be captured in our documentation rather than staying buried in issues and PRs.

wlbksy · 2014-01-04T09:52:21Z

Why et1 in region = 4 takes significant longer time than other cases?
Is it due to 10 times too less to evaluate the performance?

timholy · 2014-01-04T12:22:11Z

Isn't the solution here simpler, and likely quite a bit faster? It doesn't require any recursion.

timholy · 2014-01-04T13:14:21Z

Actually, I just tested this; speedwise they're very similar (if, like in your code, one inserts @inbounds appropriately). So recursion doesn't cost you anything. Very nice.

At least in your gist, the timings are biased slightly in your favor, because you do one less iteration for fsum than for sum. But of course your overall point remains, because the gain is much larger than 10%.

lindahua · 2014-01-04T13:30:37Z

@timholy I just fixed the bug of using less iteration on my part (it was just a typo or something). Now, I updated the gist to include the Cartesian solution proposed by you. Here is it:

# et0:  original implementation in julia base
# etc:  using Tim's Cartesian solution
# et1:  the solution proposed here
#
region = 1           :  et0 =  0.1867,  etc =  0.0927,  et1 =  0.0649,  et0/et1 =   2.8745x, etc/et1 =   1.4278x
region = 2           :  et0 =  0.3525,  etc =  0.0896,  et1 =  0.0966,  et0/et1 =   3.6511x, etc/et1 =   0.9278x
region = 3           :  et0 =  0.7997,  etc =  0.0852,  et1 =  0.0750,  et0/et1 =  10.6661x, etc/et1 =   1.1365x
region = 4           :  et0 =  0.8138,  etc =  0.1617,  et1 =  0.1426,  et0/et1 =   5.7079x, etc/et1 =   1.1340x
region = (1,2)       :  et0 =  0.1689,  etc =  0.0937,  et1 =  0.0533,  et0/et1 =   3.1721x, etc/et1 =   1.7598x
region = (1,3)       :  et0 =  0.2591,  etc =  0.0894,  et1 =  0.0551,  et0/et1 =   4.6984x, etc/et1 =   1.6208x
region = (1,4)       :  et0 =  0.1966,  etc =  0.0899,  et1 =  0.0557,  et0/et1 =   3.5276x, etc/et1 =   1.6121x
region = (2,3)       :  et0 =  0.5810,  etc =  0.0894,  et1 =  0.0902,  et0/et1 =   6.4449x, etc/et1 =   0.9913x
region = (2,4)       :  et0 =  0.4919,  etc =  0.0874,  et1 =  0.0893,  et0/et1 =   5.5066x, etc/et1 =   0.9781x
region = (3,4)       :  et0 =  1.0960,  etc =  0.0853,  et1 =  0.0715,  et0/et1 =  15.3305x, etc/et1 =   1.1930x
region = (1,2,3)     :  et0 =  0.1645,  etc =  0.0909,  et1 =  0.0515,  et0/et1 =   3.1954x, etc/et1 =   1.7667x
region = (1,2,4)     :  et0 =  0.1673,  etc =  0.0916,  et1 =  0.0511,  et0/et1 =   3.2743x, etc/et1 =   1.7926x
region = (1,3,4)     :  et0 =  0.2159,  etc =  0.0881,  et1 =  0.0529,  et0/et1 =   4.0791x, etc/et1 =   1.6637x
region = (2,3,4)     :  et0 =  1.2277,  etc =  0.0843,  et1 =  0.0854,  et0/et1 =  14.3812x, etc/et1 =   0.9872x
region = (1,2,3,4)   :  et0 =  0.1663,  etc =  0.0910,  et1 =  0.0511,  et0/et1 =   3.2542x, etc/et1 =   1.7809x

Overall, the PR here is a little bit faster than the Cartesian solution, probably because the use of linear indexing (these functions update the offset in an efficient manner, instead of calculating the offset in the inner loop).

In terms of recursion, two tricks are used here to alleviate its overhead: the recursion is unrolled after when rank is less than or equal to two, and another trick called rank-compression which I may explain later in a separate document. The benefit of using recursion here is that it can truly handle arrays of arbitrary dimension (without using a dictionary of functions or something alike).

timholy · 2014-01-04T14:29:44Z

Those are good tricks, and indeed I really do like the fact that you don't need a dictionary. (Of course Julia builds separate methods for each dimensionality anyway, so there is a dictionary of sorts in the background, but your version makes the dispatch "transparent" and in a way that works at compile-time.) So indeed this is great stuff.

However, I did find a problem: if you generalize your gist for AbstractArrays and then do this:

y = sub(x, :, :, :, :)

then you get

julia> do_perf(y, 1)
region = 1           :  et0 =  0.4839,  et1 =  5.4938, et2 =  0.5763 gain =   0.0881x

julia> do_perf(y, (2,4))
region = (2,4)       :  et0 =  1.1106,  et1 =  5.6182, et2 =  0.5561 gain =   0.1977x

(In my version, et2 is the Cartesian-based version, and I didn't compute any ratios with it.) When you can't use linear indexing, using Cartesian indexing gives you gains that are bigger than anything you see realized here. Since we're all hoping to move towards array views, this is a concern.

lindahua · 2014-01-04T14:38:24Z

@timholy The current implementation here is aimed for contiguous arrays (note my argument annotation as Array, which may later be relaxed to DenseArray when Jeff's new hierarchy is implemented).

For general strided-arrays, of course linear indexing is not the optimal choice, and I am working on a version that uses subscripts (instead of linear indexing) for them.

At this point, non-contiguous arrays still fallback to the original implementation in Base.

lindahua · 2014-01-04T14:54:20Z

The overall strategy here is recursive slicing. For example, given an array of size (3, 4, 5, 6) -- the computation is first decomposed into the computation of 6 slices of size (3, 4, 5), and for each (3, 4, 5)-slice, the computation is further decomposed into the computation of 5 slices of size (3, 4).

How the computation is conducted/decomposed on each slice is determined by whether there is reduction along the first & last dimension of that slice -- that's why you see two internal methods $(fa!) and $(fb!).

So, in theory, there's nothing prevent us from using Cartesian indexing instead of linear indexing here when general strided arrays are input.

lindahua · 2014-01-04T18:59:17Z

I wrote a document to explain the rationales underlying my design: https://gist.github.com/lindahua/8255431.

gitfoxi · 2014-01-04T20:29:49Z

Broken link?
On Jan 4, 2014 10:59 AM, "Dahua Lin" notifications@github.com wrote:

I wrote a document to explain the rationales underlying my design:
https://gist.github.com/lindahua/825543.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5294#issuecomment-31585699
.

lindahua · 2014-01-04T20:31:46Z

There was a typo in the link, which has been fixed. Would you please try again?
https://gist.github.com/lindahua/8255431

Now the code-gen allows generating reduction functions that may contain more than one input arrays.

lindahua · 2014-01-04T21:43:33Z

I think this PR is ready for merge.

lindahua · 2014-01-04T21:47:06Z

This PR also fixes #2325.

StefanKarpinski · 2014-01-04T23:32:58Z

I'm happy to merge this.

WIP: Fast implementation of reduction along dims

ViralBShah · 2014-01-05T04:40:39Z

This is tangential, but I really would like to start gathering notes such as the one @lindahua wrote in one place. It really shows off the quality of the work, and also makes it easier for newcomers to come up to speed.

Perhaps we can have juleps under our docs, or as a separate repository.

jiahao · 2014-01-05T04:55:18Z

I really would like to start gathering notes such as the one @lindahua wrote in one place

+1

Sounds like a job that's perfect for the Julia Page Program™

StefanKarpinski · 2014-01-05T05:05:20Z

I think it's a solid blog post. Why not the Julia Blog?

gitfoxi · 2014-01-05T14:06:20Z

+1

I also enjoyed the @lindahua https://github.com/lindahua post.

On Sat, Jan 4, 2014 at 9:05 PM, Stefan Karpinski
notifications@github.comwrote:

I think it's a solid blog post. Why not the Julia Blog?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5294#issuecomment-31596653
.

Michael

stevengj · 2014-01-06T00:34:02Z

(This is pretty much how multidimensional loops are handled in FFTW: recursively peeling off one dimension at a time, merging of loops that represent a single fixed-stride region, and coarsened rank-1 or rank-2 base cases to amortize the recursion overhead. See the end of section IV-C-5 of our paper on peeling off of one loop at a time, typically choosing the loop with the smallest stride, and IV-B about merging/canonicalization of "vector loops" that represent a constant-offset sequence of memory locations.)

stevengj · 2014-01-06T00:43:19Z

FYI, here is the FFTW loop-merging/canonicalization code in case it is of any use. (Here, an FFTW "tensor" represents a set of rnk (rank) nested loops over some input and output arrays, to be executed in arbitrary order, where each loop i has length dims[i].n and has input/output strides dims[i].is/dims[i].os. In Julia parlance, this acts on strided subarrays of dimension rnk.)

timholy · 2014-01-06T00:47:36Z

The Julia version is #3778

stevengj · 2014-01-06T00:52:15Z

@timholy, is the Julia version less general than FFTW's algorithm? It looks like it only merges the ranges if all of the ranges are mergeable, whereas FFTW merges any subset(s) of the loops that can be merged.

lindahua · 2014-01-06T01:03:28Z

Currently, Julia's subarrays are just a mess. I believe when @StefanKarpinski's array view lands, many of these algorithms can be expressed more elegantly without compromising efficiency.

timholy · 2014-01-06T11:31:55Z

@stevengj, for SubArrays I think it doesn't (or didn't, at the time of that PR) matter; if you can't do them all, you are forced to use some variant of linear indexing. That said, since the function is general it would be better, if we need it at all, to generalize it.

lindahua added 5 commits January 2, 2014 15:08

move reducedim related stuff to a separate file

d5339ea

move reducedim codes from array.jl to reducedim.jl

994bbce

more efficient implementation of reduced_dims

4060391

Fast sum along dimensions over contiguous arrays

579ab44

specialized for cases with region being an integer

2ffdbfa

some tuning of the vector kernels

01dfa4a

lindahua added 6 commits January 4, 2014 14:32

allow specify the result type of sum.

c3771fe

allow minimum and maximum over a subvector

483e13f

apply the new reduction framework to maximum & minimum

54ba470

apply new reduction framework to prod

3b1a8ea

updates to the maximum & minimum codes

9ecc321

more flexible code generation functions

da2d02c

Now the code-gen allows generating reduction functions that may contain more than one input arrays.

lindahua added a commit that referenced this pull request Jan 5, 2014

Merge pull request #5294 from lindahua/dh/reducedim2

85cb30f

WIP: Fast implementation of reduction along dims

lindahua merged commit 85cb30f into JuliaLang:master Jan 5, 2014

lindahua mentioned this pull request Jan 5, 2014

sum(x, 2) is too slow #2325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fast implementation of reduction along dims #5294

WIP: Fast implementation of reduction along dims #5294

lindahua commented Jan 4, 2014

JeffBezanson commented Jan 4, 2014

ViralBShah commented Jan 4, 2014

wlbksy commented Jan 4, 2014

timholy commented Jan 4, 2014

timholy commented Jan 4, 2014

lindahua commented Jan 4, 2014

timholy commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

gitfoxi commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

StefanKarpinski commented Jan 4, 2014

ViralBShah commented Jan 5, 2014

jiahao commented Jan 5, 2014

StefanKarpinski commented Jan 5, 2014

gitfoxi commented Jan 5, 2014

stevengj commented Jan 6, 2014

stevengj commented Jan 6, 2014

timholy commented Jan 6, 2014

stevengj commented Jan 6, 2014

lindahua commented Jan 6, 2014

timholy commented Jan 6, 2014

WIP: Fast implementation of reduction along dims #5294

WIP: Fast implementation of reduction along dims #5294

Conversation

lindahua commented Jan 4, 2014

JeffBezanson commented Jan 4, 2014

ViralBShah commented Jan 4, 2014

wlbksy commented Jan 4, 2014

timholy commented Jan 4, 2014

timholy commented Jan 4, 2014

lindahua commented Jan 4, 2014

timholy commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

gitfoxi commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

lindahua commented Jan 4, 2014

StefanKarpinski commented Jan 4, 2014

ViralBShah commented Jan 5, 2014

jiahao commented Jan 5, 2014

StefanKarpinski commented Jan 5, 2014

gitfoxi commented Jan 5, 2014

stevengj commented Jan 6, 2014

stevengj commented Jan 6, 2014

timholy commented Jan 6, 2014

stevengj commented Jan 6, 2014

lindahua commented Jan 6, 2014

timholy commented Jan 6, 2014