Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ReshapedArrays #10507

Closed
wants to merge 8 commits into from
Closed

WIP: ReshapedArrays #10507

wants to merge 8 commits into from

Conversation

timholy
Copy link
Member

@timholy timholy commented Mar 13, 2015

This one is rather more fun than my last PR. It adds the ability to reshape an arbitrary AbstractArray without making a copy, creating a new type of view. The logic and a demonstration of this approach was described in #9874 (comment). Specifically, reshaping does not compose with subarray-indexing, so you really need two types of views (unfortunately).

To get good performance out of this, I had to optimize the tar out of it. (Honestly, I tried not to.) Any splatting/slurping was deadly, so I had to manually unpack each element of every tuple. I also had to add @inline markers in lots of places. And the performance still isn't great, I think mostly because of tuples (see below). Finally, I stole @simonster's neat work on speeding up div (I credited the commit to him), updating it for julia 0.4. I noted that I got errors if I tried to create a multiplicative inverse for 1, so I had to introduce a whole new type just to handle this special case. If any of the smart number-focused folks around here has a better idea, I'm all ears.

One way in which this is incomplete (i.e., won't pass tests) is that it doesn't allow one to index a ReshapedArray with anything other than (1) a scalar index (linear indexing) or (2) scalar indexing with the full N indexes, if the array is N-dimensional. I could implement that for ReshapedArrays just like I did for SubArrays, but frankly this kind of stuff is no fun to implement time and time again. I'm hoping @mbauman delivers us a general framework (see #10458 (comment)) so that people who implement AbstractArrays only need to worry about those two types of indexing.

This also won't pass the tests because currently it reshapes a Range. Since Ranges are "read-only", I suspect that's not what we want? They are also currently used heavily for creating of arrays in combination with reshape.

Finally, I also took the opportunity to create a performance test suite for our array indexing. The results have revealed a few unexpected trouble spots, particularly for small arrays---the main problems seem to come from extracting tuple fields. Here's the key:

  • sumelt refers to for a in A indexing; sumeach to for I in eachindex(A); sumfast to for i = 1:length(A) if A has a fast linear indexing trait set (otherwise it's redundant with sumeach).
  • I refers to integer, F to float. The latter is better-SIMDable.
  • s means small (3-by-5), b means big (300-by-500)
  • ArrayLF has fast linear indexing, ArrayLS does not. ArrayStrides and ArrayStride1 are two implementations of very simple strided array types, used for comparison against other types that exploit similar operations.
sumeltIs Array        1.777    2.515    1.813    0.067
sumeachIs Array      17.027   23.399   17.356    0.747
sumfastIs Array       1.025    1.919    1.047    0.047
sumeltIs ArrayLS     13.934   16.032   14.202    0.414
sumeachIs ArrayLS     3.281    5.637    3.362    0.182
sumfastIs ArrayLS     3.315    7.288    3.582    0.546
sumeltIs ArrayLF      3.966    7.565    4.070    0.225
sumeachIs ArrayLF     3.281    4.538    3.368    0.156
sumfastIs ArrayLF     4.069    5.183    4.182    0.157
sumeltIs ArrayStrides   13.890   20.870   16.064    1.337
sumeachIs ArrayStrides    2.492    6.038    2.705    0.397
sumfastIs ArrayStrides    2.639    4.689    2.761    0.253
sumeltIs ArrayStrides1   14.280   16.082   14.558    0.417
sumeachIs ArrayStrides1    3.281    7.003    3.445    0.303
sumfastIs ArrayStrides1    3.315    6.232    3.442    0.267
sumeltIs SubArray     3.144    4.949    3.454    0.290
sumeachIs SubArray   19.258   28.084   20.460    1.716
sumfastIs SubArray   19.183   26.505   19.822    0.908
sumeltIs ReshapedArray{ArrayLS}    3.087    4.647    3.337    0.167
sumeachIs ReshapedArray{ArrayLS}   18.944   21.837   19.382    0.588
sumfastIs ReshapedArray{ArrayLS}   19.711   30.648   20.668    1.440
sumeltIs ReshapedArray{ArrayLF}    4.291    8.029    4.548    0.495
sumeachIs ReshapedArray{ArrayLF}   18.952   24.026   19.500    1.000
sumfastIs ReshapedArray{ArrayLF}    4.205    6.382    4.349    0.288
sumeltFs Array        1.538    2.622    1.599    0.115
sumeachFs Array      18.690   25.214   19.747    1.607
sumfastFs Array       1.299    1.802    1.339    0.069
sumeltFs ArrayLS     14.289   21.898   15.281    1.617
sumeachFs ArrayLS     4.171    7.126    4.339    0.378
sumfastFs ArrayLS     4.171    7.311    4.311    0.363
sumeltFs ArrayLF      4.223    8.144    4.537    0.611
sumeachFs ArrayLF     4.171    5.512    4.313    0.249
sumfastFs ArrayLF     4.240    6.903    4.435    0.264
sumeltFs ArrayStrides   13.638   22.034   16.183    1.387
sumeachFs ArrayStrides    2.552    5.009    2.710    0.322
sumfastFs ArrayStrides    2.618    5.898    2.782    0.378
sumeltFs ArrayStrides1   14.293   24.465   15.856    2.400
sumeachFs ArrayStrides1    4.171    6.993    4.454    0.537
sumfastFs ArrayStrides1    4.171    6.126    4.353    0.316
sumeltFs SubArray     3.076    8.096    3.426    0.474
sumeachFs SubArray   19.343   24.725   20.346    1.318
sumfastFs SubArray   19.450   29.048   20.408    1.721
sumeltFs ReshapedArray{ArrayLS}    3.321    4.863    3.437    0.192
sumeachFs ReshapedArray{ArrayLS}   18.963   29.058   19.810    1.659
sumfastFs ReshapedArray{ArrayLS}   19.740   26.780   21.341    1.947
sumeltFs ReshapedArray{ArrayLF}    4.291    7.570    4.559    0.498
sumeachFs ReshapedArray{ArrayLF}   18.937   25.426   19.814    1.363
sumfastFs ReshapedArray{ArrayLF}    4.205    9.719    4.439    0.432
sumeltIb Array       10.536   14.805   11.133    0.936
sumeachIb Array       6.653   10.035    6.895    0.497
sumfastIb Array       6.131    9.889    6.480    0.607
sumeltIb ArrayLS    136.357  156.693  143.836    4.902
sumeachIb ArrayLS     6.594   11.962    6.951    0.696
sumfastIb ArrayLS     6.640   11.981    7.063    0.778
sumeltIb ArrayLF     10.467   18.773   11.143    1.266
sumeachIb ArrayLF     6.635   11.774    7.071    0.835
sumfastIb ArrayLF     6.171    9.665    6.343    0.304
sumeltIb ArrayStrides  140.613  150.800  145.009    3.801
sumeachIb ArrayStrides   11.023   19.091   11.587    0.989
sumfastIb ArrayStrides   10.999   18.700   11.454    0.755
sumeltIb ArrayStrides1  145.468  152.717  147.724    2.491
sumeachIb ArrayStrides1    6.625   10.784    6.906    0.536
sumfastIb ArrayStrides1    6.602    9.302    6.792    0.300
sumeltIb SubArray    26.313   31.715   27.250    1.178
sumeachIb SubArray    7.205   12.633    7.516    0.573
sumfastIb SubArray    7.192   13.702    7.581    0.760
sumeltIb ReshapedArray{ArrayLS}   21.288   27.281   22.266    1.341
sumeachIb ReshapedArray{ArrayLS}    7.048   10.469    7.369    0.526
sumfastIb ReshapedArray{ArrayLS}    7.037   10.476    7.247    0.376
sumeltIb ReshapedArray{ArrayLF}   10.577   14.395   10.843    0.513
sumeachIb ReshapedArray{ArrayLF}    7.083   11.046    7.308    0.450
sumfastIb ReshapedArray{ArrayLF}    6.181    9.886    6.400    0.430
sumeltFb Array       15.433   17.467   15.789    0.473
sumeachFb Array       3.598    7.467    3.761    0.415
sumfastFb Array       3.362    5.245    3.512    0.196
sumeltFb ArrayLS    140.020  154.997  145.069    4.714
sumeachFb ArrayLS     3.576    6.038    3.745    0.269
sumfastFb ArrayLS     3.575    6.646    3.707    0.292
sumeltFb ArrayLF     15.460   19.912   15.784    0.561
sumeachFb ArrayLF     3.574    5.398    3.676    0.191
sumfastFb ArrayLF     3.401    5.646    3.490    0.197
sumeltFb ArrayStrides  136.608  161.737  143.648    6.908
sumeachFb ArrayStrides   15.453   21.897   16.317    1.018
sumfastFb ArrayStrides   15.445   16.973   15.753    0.434
sumeltFb ArrayStrides1  140.737  241.423  195.276   38.955
sumeachFb ArrayStrides1    3.568    9.799    4.632    0.979
sumfastFb ArrayStrides1    3.566    8.079    3.909    0.654
sumeltFb SubArray    29.383   42.928   32.245    2.760
sumeachFb SubArray    6.774   22.297    9.393    4.191
sumfastFb SubArray   16.320   32.008   17.832    2.597
sumeltFb ReshapedArray{ArrayLS}   51.433   60.201   52.709    1.883
sumeachFb ReshapedArray{ArrayLS}   16.672   31.846   20.355    4.379
sumfastFb ReshapedArray{ArrayLS}   16.650   30.555   17.631    2.464
sumeltFb ReshapedArray{ArrayLF}   10.603   20.125   11.748    1.705
sumeachFb ReshapedArray{ArrayLF}    7.016    9.973    7.474    0.555
sumfastFb ReshapedArray{ArrayLF}    6.157    7.850    6.380    0.294

As you can see, indexing with ReshapedArrays has an unfortunate penalty with respect to Arrays.

@timholy
Copy link
Member Author

timholy commented Mar 13, 2015

I should have added, you can see that most of that penalty is accounted for by comparison with ArrayStrides, which reveals a problem when using a stored tuple field to compute indexing. Since ArrayStride1 performs better, one can expect that this will get a lot better once Jeff's tuple overhaul lands.

@ViralBShah
Copy link
Member

Wow!

@Jutho
Copy link
Contributor

Jutho commented Mar 15, 2015

It's great to see MultiplicativeInverse, however I find the name somewhat confusing. It sounds like it is already an inverse, i.e. if b=MultiplicativeInverse(a) and I want to compute div(x,a), it sounds like I would need to type x*b instead of div(x,b). I was thinking something along the lines of FastDividingInteger (but I certainly hope that somebody can come up with a fancier and/or shorter name). Maybe then it could be implement to be a fully supported type of integer (i.e. abstract FastDividingInteger{T} <: Integer), such that it can be used just like any other Integer (with the promotion rule that as soon as it is combined with another integer, the result is the other integer type, such that not every operation requires to compute the multiplicative inverse information).

Linear indexing into an AbstractArray A could then simply be implemented using
ind2sub(map(FastDividingInteger,size(A)),linearindex) (together with maybe the improved versions of ind2sub in #10337).

With respect to ReshapedArray: This is very nice. I was wondering, though, whether all the complexity to keep track if certain groups of sizes in the new array match with certain groups of sizes in the old array is worth the gain in efficiency; especially since this cannot be determined at compile time and thus gives rise to type instability in the construction. With the fast multiplicative inverse, wouldn't a simple wrapper type that implements indexing using ind2sub(fastsizeparent,sub2ind(newsize,index)) [where fastsizeparent=map(FastDividingInteger,size(parent)) was computed at compile time] be almost as efficient (and type stable)?

@timholy
Copy link
Member Author

timholy commented Mar 17, 2015

I agree with the name, will see what I can come up with.

Linear indexing into an AbstractArray A could then simply be implemented using
ind2sub(map(FastDividingInteger,size(A)),linearindex) (together with maybe the improved versions of ind2sub in #10337).

While I haven't timed it myself, I doubt it will be so simple. For a single call, I'm pretty sure div will be considerably faster. The advantages here will manifest only when multiple calls are being made. (@simonster said the crossover point is somewhere around 32.)

I will look into the grouping part, too. But I suspect we want it the way it is, since div with a FastDividingInteger isn't really all that fast. Still, the type instability is a big problem---the grouping is one source, the difficulty in expressing an inverse for 1 is another.

@Jutho
Copy link
Contributor

Jutho commented Mar 17, 2015

While I haven't timed it myself, I doubt it will be so simple. For a single call, I'm pretty sure div will be considerably faster. The advantages here will manifest only when multiple calls are being made. (@simonster said the crossover point is somewhere around 32.)

I should have explaied myself better. I didn't mean to imply that getindex should literally be implemented in that way. That would indeed mean that the multiplicative inverse is calculated at every indexing operation, which will certainly be slow. I just meant that if some new subtype of AbstractArray wants to support 'efficient' linear indexing, then it should compute its size as a tuple of FastDividingIntegers in the constructor, just like ReshapedArray is currently doing.

@timholy
Copy link
Member Author

timholy commented Mar 20, 2015

OK, this is basically working now (the failures are OSX timeouts), with one big omission not caught by our current tests: indexing a ReshapedArray with the "wrong" number of indexes (neither 1 nor ndims(A)). @mbauman, what are your thoughts on this? Any chance we can provide such functionality through your array revamp? Or should I implement it specifically for ReshapedArrays? It does not fit within your "must be of fixed dimensionality" requirement.

# Special type to handle div by 1
immutable FastDivInteger1{T} <: FastDivInteger{T} end

(*)(a::FastDivInteger, b::FastDivInteger) = a.divisor*b.divisor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe when a and b are both FastDivInteger there product should also be one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a reasonable suggestion.

@mbauman
Copy link
Member

mbauman commented Mar 20, 2015

I'll try to spend some more time with my abstract fallbacks this weekend. But there's still some major design issues. First and foremost is how we have array types like this spell their getindex(::T, ::Int, ::Int, #=...ndims(A) indices...=#) fallback method. I suppose one solution would be:

function getindex(A::MyType, I::Int...)
    ndims(A) != length(I) && invoke(getindex, tuple(AbstractArray, typeof(I)...), A, I...)
    
end

@timholy
Copy link
Member Author

timholy commented Mar 22, 2015

I was wondering, though, whether all the complexity to keep track if certain groups of sizes in the new array match with certain groups of sizes in the old array is worth the gain in efficiency; especially since this cannot be determined at compile time and thus gives rise to type instability in the construction.

I had a little more time and finally looked into this. Let me show you the benchmark:

julia> A = rand(5,5,5);

julia> B = sub(A, 1:5, 1:5, 1:5);

julia> mi5 = Base.FastDivInteger(5)
Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01)

julia> mi = (mi5, mi5, mi5)
(Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01),Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01),Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01))

julia> ind = Base.Reshaped.IndexMD(mi, size(B))
Base.Reshaped.IndexMD{3,3,(Base.SignedFastDivInteger{Int64},Base.SignedFastDivInteger{Int64},Base.SignedFastDivInteger{Int64})}((Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01),Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01),Base.SignedFastDivInteger{Int64}(5,7378697629483820647,0,0x01)),(5,5,5))

julia> R1 = Base.Reshaped.ReshapedArray{eltype(B),ndims(B),typeof(B),(Colon,Colon,Colon)}(B, (:,:,:), size(B));

julia> R2 = Base.Reshaped.ReshapedArray{eltype(B),ndims(B),typeof(B),(typeof(ind),)}(B, (ind,), size(B));

julia> function runsum(R, n)
           s = 0.0
           for k = 1:n
               for I in eachindex(R)
                   s += R[I]
               end
           end
           s
       end
runsum (generic function with 1 method)

julia> runsum(R1, 1)
62.84683709443972

julia> runsum(R2, 1)
62.84683709443972

julia> @time runsum(R1, 10^5)
elapsed time: 0.153061342 seconds (32 kB allocated)
6.2846837094109e6

julia> @time runsum(R2, 10^5)
elapsed time: 0.39375505 seconds (112 bytes allocated)
6.2846837094109e6

julia> @time runsum(R1, 10^5)
elapsed time: 0.152238983 seconds (112 bytes allocated)
6.2846837094109e6

julia> @time runsum(R2, 10^5)
elapsed time: 0.401355134 seconds (112 bytes allocated)
6.2846837094109e6

So, the two options appear to be:

  • Make it type stable but exact a ~3x runtime performance penalty
  • Make it faster at runtime but essentially force people to use a function barrier to circumvent problems from type instability.

Thoughts on how to chose between these?

@timholy
Copy link
Member Author

timholy commented Mar 22, 2015

@mbauman, thanks for thinking about it. But I suspect that approach will have an intolerable runtime cost. I'll follow up over in #10525.

@timholy
Copy link
Member Author

timholy commented Dec 24, 2015

For anyone who might be interested: I now suspect the best approach is not to stress too much about making "ordinary" indexing all that fast for a reshaped array. The better approach (consistent, I think, with suggestions that @simonster has made), is to build special iterators for reshaped arrays.

The key question is whether we want to have them "pretend" to the outside world that they have the external shape (in which case they'll need to maintain both an "external" and and "internal" iterator), or whether it's OK for them to just provide the "internal" iterator. In other words, should algorithms that use two arrays look like this:

for I in eachindex(A, B)
    A[I] = B[I]
end

or like this:

for (IA, IB) in zip(eachindex(A), eachindex(B))
    A[IA] = B[IB]
end

or (if you're really worried about performance) like this:

RA = eachindex(A)
RB = eachindex(B)
if RA == RB
    for I in RA
        A[I] = B[I]
    end
else
    for (IA, IB) in zip(RA, RB)
        A[IA] = B[IB]
    end
end

The latter gives you the advantage of sometimes only needing to check/update a single iterator, but using two when the two arrays are not commensurate.

The hard parts: handling reductions & broadcasting the way we do now (see reducedim.jl), handling offsets I+Ioffset (e.g., it would be instructive to try to write a laplacian algorithm).

@timholy
Copy link
Member Author

timholy commented Feb 12, 2016

Since folks seem to be thinking about this a bit, let me point out that the 3x runtime penalty seems like it might not be inevitable. All the operations for fast division should be able to be performed in registers, which makes me think that (even though it's more computations) it should barely be measurable on the scale of, say, cache misses.

If (through better codegen?) we could reduce that penalty to something much smaller, then this PR wouldn't be stuck between unpleasant alternatives.

Also, I'll note it's been a long time since I benchmarked this, and I'm going by memory here.

@timholy
Copy link
Member Author

timholy commented Mar 11, 2016

Superseded by #15449.

@timholy timholy closed this Mar 11, 2016
@DilumAluthge DilumAluthge deleted the teh/reshapedarrays branch March 25, 2021 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants