-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Fast implementation of reduction along dims #5294
Conversation
Totally awesome! |
This is really cool. It would be nice if some of these design rationales could be captured in our documentation rather than staying buried in issues and PRs. |
Why |
Isn't the solution here simpler, and likely quite a bit faster? It doesn't require any recursion. |
Actually, I just tested this; speedwise they're very similar (if, like in your code, one inserts At least in your gist, the timings are biased slightly in your favor, because you do one less iteration for |
@timholy I just fixed the bug of using less iteration on my part (it was just a typo or something). Now, I updated the gist to include the Cartesian solution proposed by you. Here is it: # et0: original implementation in julia base
# etc: using Tim's Cartesian solution
# et1: the solution proposed here
#
region = 1 : et0 = 0.1867, etc = 0.0927, et1 = 0.0649, et0/et1 = 2.8745x, etc/et1 = 1.4278x
region = 2 : et0 = 0.3525, etc = 0.0896, et1 = 0.0966, et0/et1 = 3.6511x, etc/et1 = 0.9278x
region = 3 : et0 = 0.7997, etc = 0.0852, et1 = 0.0750, et0/et1 = 10.6661x, etc/et1 = 1.1365x
region = 4 : et0 = 0.8138, etc = 0.1617, et1 = 0.1426, et0/et1 = 5.7079x, etc/et1 = 1.1340x
region = (1,2) : et0 = 0.1689, etc = 0.0937, et1 = 0.0533, et0/et1 = 3.1721x, etc/et1 = 1.7598x
region = (1,3) : et0 = 0.2591, etc = 0.0894, et1 = 0.0551, et0/et1 = 4.6984x, etc/et1 = 1.6208x
region = (1,4) : et0 = 0.1966, etc = 0.0899, et1 = 0.0557, et0/et1 = 3.5276x, etc/et1 = 1.6121x
region = (2,3) : et0 = 0.5810, etc = 0.0894, et1 = 0.0902, et0/et1 = 6.4449x, etc/et1 = 0.9913x
region = (2,4) : et0 = 0.4919, etc = 0.0874, et1 = 0.0893, et0/et1 = 5.5066x, etc/et1 = 0.9781x
region = (3,4) : et0 = 1.0960, etc = 0.0853, et1 = 0.0715, et0/et1 = 15.3305x, etc/et1 = 1.1930x
region = (1,2,3) : et0 = 0.1645, etc = 0.0909, et1 = 0.0515, et0/et1 = 3.1954x, etc/et1 = 1.7667x
region = (1,2,4) : et0 = 0.1673, etc = 0.0916, et1 = 0.0511, et0/et1 = 3.2743x, etc/et1 = 1.7926x
region = (1,3,4) : et0 = 0.2159, etc = 0.0881, et1 = 0.0529, et0/et1 = 4.0791x, etc/et1 = 1.6637x
region = (2,3,4) : et0 = 1.2277, etc = 0.0843, et1 = 0.0854, et0/et1 = 14.3812x, etc/et1 = 0.9872x
region = (1,2,3,4) : et0 = 0.1663, etc = 0.0910, et1 = 0.0511, et0/et1 = 3.2542x, etc/et1 = 1.7809x Overall, the PR here is a little bit faster than the Cartesian solution, probably because the use of linear indexing (these functions update the offset in an efficient manner, instead of calculating the offset in the inner loop). In terms of recursion, two tricks are used here to alleviate its overhead: the recursion is unrolled after when rank is less than or equal to two, and another trick called rank-compression which I may explain later in a separate document. The benefit of using recursion here is that it can truly handle arrays of arbitrary dimension (without using a dictionary of functions or something alike). |
Those are good tricks, and indeed I really do like the fact that you don't need a dictionary. (Of course Julia builds separate methods for each dimensionality anyway, so there is a dictionary of sorts in the background, but your version makes the dispatch "transparent" and in a way that works at compile-time.) So indeed this is great stuff. However, I did find a problem: if you generalize your gist for
then you get
(In my version, et2 is the Cartesian-based version, and I didn't compute any ratios with it.) When you can't use linear indexing, using Cartesian indexing gives you gains that are bigger than anything you see realized here. Since we're all hoping to move towards array views, this is a concern. |
@timholy The current implementation here is aimed for contiguous arrays (note my argument annotation as For general strided-arrays, of course linear indexing is not the optimal choice, and I am working on a version that uses subscripts (instead of linear indexing) for them. At this point, non-contiguous arrays still fallback to the original implementation in Base. |
The overall strategy here is recursive slicing. For example, given an array of size How the computation is conducted/decomposed on each slice is determined by whether there is reduction along the first & last dimension of that slice -- that's why you see two internal methods So, in theory, there's nothing prevent us from using Cartesian indexing instead of linear indexing here when general strided arrays are input. |
I wrote a document to explain the rationales underlying my design: https://gist.github.com/lindahua/8255431. |
Broken link?
|
There was a typo in the link, which has been fixed. Would you please try again? |
Now the code-gen allows generating reduction functions that may contain more than one input arrays.
I think this PR is ready for merge. |
This PR also fixes #2325. |
I'm happy to merge this. |
WIP: Fast implementation of reduction along dims
This is tangential, but I really would like to start gathering notes such as the one @lindahua wrote in one place. It really shows off the quality of the work, and also makes it easier for newcomers to come up to speed. Perhaps we can have |
+1 Sounds like a job that's perfect for the Julia Page Program™ |
I think it's a solid blog post. Why not the Julia Blog? |
+1 I also enjoyed the @lindahua https://github.com/lindahua post. On Sat, Jan 4, 2014 at 9:05 PM, Stefan Karpinski
Michael |
(This is pretty much how multidimensional loops are handled in FFTW: recursively peeling off one dimension at a time, merging of loops that represent a single fixed-stride region, and coarsened rank-1 or rank-2 base cases to amortize the recursion overhead. See the end of section IV-C-5 of our paper on peeling off of one loop at a time, typically choosing the loop with the smallest stride, and IV-B about merging/canonicalization of "vector loops" that represent a constant-offset sequence of memory locations.) |
FYI, here is the FFTW loop-merging/canonicalization code in case it is of any use. (Here, an FFTW "tensor" represents a set of |
The Julia version is #3778 |
@timholy, is the Julia version less general than FFTW's algorithm? It looks like it only merges the ranges if all of the ranges are mergeable, whereas FFTW merges any subset(s) of the loops that can be merged. |
Currently, Julia's subarrays are just a mess. I believe when @StefanKarpinski's array view lands, many of these algorithms can be expressed more elegantly without compromising efficiency. |
@stevengj, for SubArrays I think it doesn't (or didn't, at the time of that PR) matter; if you can't do them all, you are forced to use some variant of linear indexing. That said, since the function is general it would be better, if we need it at all, to generalize it. |
This is a new implementation of reduction along dims, mainly to improve the performance.
Currently, I have implemented a generic code generation function & macro, and a specialized method for sum over contiguous arrays. Key features of the implementation:
This implementation is based on the gist: https://gist.github.com/lindahua/8251556
(with some modification to take care of corner cases, such as empty input).
The design is explained here: https://gist.github.com/lindahua/8255431.
The script in the gist above runs a benchmark to compare the performance this new implementation and the original one. On my macbook pro, it yields 2x - 16x performance gain (depending on the region):
This is still work in progress. I will continue to add other methods, such as
maximum
,minimum
,all
,any
,prod
, etc.I may also write a short document to explain the rationale behind the implementation.