Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fast implementation of reduction along dims #5294

Merged
merged 12 commits into from
Jan 5, 2014
Merged

WIP: Fast implementation of reduction along dims #5294

merged 12 commits into from
Jan 5, 2014

Conversation

lindahua
Copy link
Contributor

@lindahua lindahua commented Jan 4, 2014

This is a new implementation of reduction along dims, mainly to improve the performance.

Currently, I have implemented a generic code generation function & macro, and a specialized method for sum over contiguous arrays. Key features of the implementation:

  • Cache friendly access. The input elements are accessed in the same order as it is stored in the memory, no matter what region the user specify.
  • Based on three vector-kernel (instead of just a binary operation). This allows a variety of optimization to be incorporated (e.g. when SIMD is available, we can easily take advantage of it by rewriting the vector kernels using SIMD instructions).
  • Functions in this implementation recursively calls over slices of the array of lower-rank. In this way, it no longer needs the trick of caching functions in a dictionary (which the original implementation relies on).
  • The implementation is carefully tested.

This implementation is based on the gist: https://gist.github.com/lindahua/8251556
(with some modification to take care of corner cases, such as empty input).

The design is explained here: https://gist.github.com/lindahua/8255431.

The script in the gist above runs a benchmark to compare the performance this new implementation and the original one. On my macbook pro, it yields 2x - 16x performance gain (depending on the region):

# the result is obtained by running it 10 times over an array of size (100,100,100,10)
# et0: elapsed seconds using original method
# et1: elapsed seconds using new method
#
region = 1           :  et0 =  0.1245,  et1 =  0.0605,  gain =   2.0587x
region = 2           :  et0 =  0.3095,  et1 =  0.0885,  gain =   3.4970x
region = 3           :  et0 =  0.4836,  et1 =  0.0708,  gain =   6.8286x
region = 4           :  et0 =  0.5405,  et1 =  0.1278,  gain =   4.2284x
region = (1,2)       :  et0 =  0.1098,  et1 =  0.0464,  gain =   2.3665x
region = (1,3)       :  et0 =  0.1959,  et1 =  0.0485,  gain =   4.0368x
region = (1,4)       :  et0 =  0.1445,  et1 =  0.0502,  gain =   2.8805x
region = (2,3)       :  et0 =  0.5159,  et1 =  0.0785,  gain =   6.5687x
region = (2,4)       :  et0 =  0.4115,  et1 =  0.0784,  gain =   5.2470x
region = (3,4)       :  et0 =  1.0379,  et1 =  0.0645,  gain =  16.0804x
region = (1,2,3)     :  et0 =  0.1063,  et1 =  0.0461,  gain =   2.3080x
region = (1,2,4)     :  et0 =  0.1077,  et1 =  0.0460,  gain =   2.3426x
region = (1,3,4)     :  et0 =  0.1587,  et1 =  0.0476,  gain =   3.3352x
region = (2,3,4)     :  et0 =  1.1473,  et1 =  0.0767,  gain =  14.9629x
region = (1,2,3,4)   :  et0 =  0.1110,  et1 =  0.0459,  gain =   2.4203x

This is still work in progress. I will continue to add other methods, such as maximum, minimum, all, any, prod, etc.

I may also write a short document to explain the rationale behind the implementation.

@JeffBezanson
Copy link
Member

Totally awesome!

@ViralBShah
Copy link
Member

This is really cool. It would be nice if some of these design rationales could be captured in our documentation rather than staying buried in issues and PRs.

@wlbksy
Copy link
Contributor

wlbksy commented Jan 4, 2014

Why et1 in region = 4 takes significant longer time than other cases?
Is it due to 10 times too less to evaluate the performance?

@timholy
Copy link
Member

timholy commented Jan 4, 2014

Isn't the solution here simpler, and likely quite a bit faster? It doesn't require any recursion.

@timholy
Copy link
Member

timholy commented Jan 4, 2014

Actually, I just tested this; speedwise they're very similar (if, like in your code, one inserts @inbounds appropriately). So recursion doesn't cost you anything. Very nice.

At least in your gist, the timings are biased slightly in your favor, because you do one less iteration for fsum than for sum. But of course your overall point remains, because the gain is much larger than 10%.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

@timholy I just fixed the bug of using less iteration on my part (it was just a typo or something). Now, I updated the gist to include the Cartesian solution proposed by you. Here is it:

# et0:  original implementation in julia base
# etc:  using Tim's Cartesian solution
# et1:  the solution proposed here
#
region = 1           :  et0 =  0.1867,  etc =  0.0927,  et1 =  0.0649,  et0/et1 =   2.8745x, etc/et1 =   1.4278x
region = 2           :  et0 =  0.3525,  etc =  0.0896,  et1 =  0.0966,  et0/et1 =   3.6511x, etc/et1 =   0.9278x
region = 3           :  et0 =  0.7997,  etc =  0.0852,  et1 =  0.0750,  et0/et1 =  10.6661x, etc/et1 =   1.1365x
region = 4           :  et0 =  0.8138,  etc =  0.1617,  et1 =  0.1426,  et0/et1 =   5.7079x, etc/et1 =   1.1340x
region = (1,2)       :  et0 =  0.1689,  etc =  0.0937,  et1 =  0.0533,  et0/et1 =   3.1721x, etc/et1 =   1.7598x
region = (1,3)       :  et0 =  0.2591,  etc =  0.0894,  et1 =  0.0551,  et0/et1 =   4.6984x, etc/et1 =   1.6208x
region = (1,4)       :  et0 =  0.1966,  etc =  0.0899,  et1 =  0.0557,  et0/et1 =   3.5276x, etc/et1 =   1.6121x
region = (2,3)       :  et0 =  0.5810,  etc =  0.0894,  et1 =  0.0902,  et0/et1 =   6.4449x, etc/et1 =   0.9913x
region = (2,4)       :  et0 =  0.4919,  etc =  0.0874,  et1 =  0.0893,  et0/et1 =   5.5066x, etc/et1 =   0.9781x
region = (3,4)       :  et0 =  1.0960,  etc =  0.0853,  et1 =  0.0715,  et0/et1 =  15.3305x, etc/et1 =   1.1930x
region = (1,2,3)     :  et0 =  0.1645,  etc =  0.0909,  et1 =  0.0515,  et0/et1 =   3.1954x, etc/et1 =   1.7667x
region = (1,2,4)     :  et0 =  0.1673,  etc =  0.0916,  et1 =  0.0511,  et0/et1 =   3.2743x, etc/et1 =   1.7926x
region = (1,3,4)     :  et0 =  0.2159,  etc =  0.0881,  et1 =  0.0529,  et0/et1 =   4.0791x, etc/et1 =   1.6637x
region = (2,3,4)     :  et0 =  1.2277,  etc =  0.0843,  et1 =  0.0854,  et0/et1 =  14.3812x, etc/et1 =   0.9872x
region = (1,2,3,4)   :  et0 =  0.1663,  etc =  0.0910,  et1 =  0.0511,  et0/et1 =   3.2542x, etc/et1 =   1.7809x

Overall, the PR here is a little bit faster than the Cartesian solution, probably because the use of linear indexing (these functions update the offset in an efficient manner, instead of calculating the offset in the inner loop).

In terms of recursion, two tricks are used here to alleviate its overhead: the recursion is unrolled after when rank is less than or equal to two, and another trick called rank-compression which I may explain later in a separate document. The benefit of using recursion here is that it can truly handle arrays of arbitrary dimension (without using a dictionary of functions or something alike).

@timholy
Copy link
Member

timholy commented Jan 4, 2014

Those are good tricks, and indeed I really do like the fact that you don't need a dictionary. (Of course Julia builds separate methods for each dimensionality anyway, so there is a dictionary of sorts in the background, but your version makes the dispatch "transparent" and in a way that works at compile-time.) So indeed this is great stuff.

However, I did find a problem: if you generalize your gist for AbstractArrays and then do this:

y = sub(x, :, :, :, :)

then you get

julia> do_perf(y, 1)
region = 1           :  et0 =  0.4839,  et1 =  5.4938, et2 =  0.5763 gain =   0.0881x

julia> do_perf(y, (2,4))
region = (2,4)       :  et0 =  1.1106,  et1 =  5.6182, et2 =  0.5561 gain =   0.1977x

(In my version, et2 is the Cartesian-based version, and I didn't compute any ratios with it.) When you can't use linear indexing, using Cartesian indexing gives you gains that are bigger than anything you see realized here. Since we're all hoping to move towards array views, this is a concern.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

@timholy The current implementation here is aimed for contiguous arrays (note my argument annotation as Array, which may later be relaxed to DenseArray when Jeff's new hierarchy is implemented).

For general strided-arrays, of course linear indexing is not the optimal choice, and I am working on a version that uses subscripts (instead of linear indexing) for them.

At this point, non-contiguous arrays still fallback to the original implementation in Base.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

The overall strategy here is recursive slicing. For example, given an array of size (3, 4, 5, 6) -- the computation is first decomposed into the computation of 6 slices of size (3, 4, 5), and for each (3, 4, 5)-slice, the computation is further decomposed into the computation of 5 slices of size (3, 4).

How the computation is conducted/decomposed on each slice is determined by whether there is reduction along the first & last dimension of that slice -- that's why you see two internal methods $(fa!) and $(fb!).

So, in theory, there's nothing prevent us from using Cartesian indexing instead of linear indexing here when general strided arrays are input.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

I wrote a document to explain the rationales underlying my design: https://gist.github.com/lindahua/8255431.

@gitfoxi
Copy link
Contributor

gitfoxi commented Jan 4, 2014

Broken link?
On Jan 4, 2014 10:59 AM, "Dahua Lin" notifications@github.com wrote:

I wrote a document to explain the rationales underlying my design:
https://gist.github.com/lindahua/825543.


Reply to this email directly or view it on GitHubhttps://github.com//pull/5294#issuecomment-31585699
.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

There was a typo in the link, which has been fixed. Would you please try again?
https://gist.github.com/lindahua/8255431

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

I think this PR is ready for merge.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 4, 2014

This PR also fixes #2325.

@StefanKarpinski
Copy link
Member

I'm happy to merge this.

lindahua added a commit that referenced this pull request Jan 5, 2014
WIP: Fast implementation of reduction along dims
@lindahua lindahua merged commit 85cb30f into JuliaLang:master Jan 5, 2014
@ViralBShah
Copy link
Member

This is tangential, but I really would like to start gathering notes such as the one @lindahua wrote in one place. It really shows off the quality of the work, and also makes it easier for newcomers to come up to speed.

Perhaps we can have juleps under our docs, or as a separate repository.

@jiahao
Copy link
Member

jiahao commented Jan 5, 2014

I really would like to start gathering notes such as the one @lindahua wrote in one place

+1

Sounds like a job that's perfect for the Julia Page Program™

@StefanKarpinski
Copy link
Member

I think it's a solid blog post. Why not the Julia Blog?

@lindahua lindahua mentioned this pull request Jan 5, 2014
@gitfoxi
Copy link
Contributor

gitfoxi commented Jan 5, 2014

+1

I also enjoyed the @lindahua https://github.com/lindahua post.

On Sat, Jan 4, 2014 at 9:05 PM, Stefan Karpinski
notifications@github.comwrote:

I think it's a solid blog post. Why not the Julia Blog?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5294#issuecomment-31596653
.

Michael

@stevengj
Copy link
Member

stevengj commented Jan 6, 2014

(This is pretty much how multidimensional loops are handled in FFTW: recursively peeling off one dimension at a time, merging of loops that represent a single fixed-stride region, and coarsened rank-1 or rank-2 base cases to amortize the recursion overhead. See the end of section IV-C-5 of our paper on peeling off of one loop at a time, typically choosing the loop with the smallest stride, and IV-B about merging/canonicalization of "vector loops" that represent a constant-offset sequence of memory locations.)

@stevengj
Copy link
Member

stevengj commented Jan 6, 2014

FYI, here is the FFTW loop-merging/canonicalization code in case it is of any use. (Here, an FFTW "tensor" represents a set of rnk (rank) nested loops over some input and output arrays, to be executed in arbitrary order, where each loop i has length dims[i].n and has input/output strides dims[i].is/dims[i].os. In Julia parlance, this acts on strided subarrays of dimension rnk.)

@timholy
Copy link
Member

timholy commented Jan 6, 2014

The Julia version is #3778

@stevengj
Copy link
Member

stevengj commented Jan 6, 2014

@timholy, is the Julia version less general than FFTW's algorithm? It looks like it only merges the ranges if all of the ranges are mergeable, whereas FFTW merges any subset(s) of the loops that can be merged.

@lindahua
Copy link
Contributor Author

lindahua commented Jan 6, 2014

Currently, Julia's subarrays are just a mess. I believe when @StefanKarpinski's array view lands, many of these algorithms can be expressed more elegantly without compromising efficiency.

@timholy
Copy link
Member

timholy commented Jan 6, 2014

@stevengj, for SubArrays I think it doesn't (or didn't, at the time of that PR) matter; if you can't do them all, you are forced to use some variant of linear indexing. That said, since the function is general it would be better, if we need it at all, to generalize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants