Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try implementing N-dimensional indexing for fast linear SubArrays #30266

Merged
merged 1 commit into from
Dec 17, 2018

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Dec 4, 2018

Ref:
#30218 (comment).

This is largely a test balloon to see what Nanosoldier thinks.

@nanosoldier runbenchmarks(ALL, vs = ":master")

@kshyatt kshyatt added the arrays [a, r, r, a, y, s] label Dec 4, 2018
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@mbauman
Copy link
Member Author

mbauman commented Dec 5, 2018

Well that's interesting. I'll have to dig into this a bit more — that improved a few benchmarks but it didn't improve the ones I was targeting.

@StefanKarpinski
Copy link
Member

that improved a few benchmarks but it didn't improve the ones I was targeting.

Win?

@mbauman
Copy link
Member Author

mbauman commented Dec 5, 2018

Just to expand on this a little bit: this PR only changes things for views of arrays with fast linear indexing: V = view(A, I, J, K). Assuming A uses linear indexing as its canonical lookup, there are three ways to compute the linear index into A from V[i, j, k] as we cache the offset and stride.

  1. convert the tuple (i, j, k) into one linear index into V, then multiply by stride and add offset to find the linear index into A. This ends up being:

     # precompute offset and stride at construction time, then
     offset + ((i-1) + (j-1)*size(V, 1) + (k-1)*size(V,2))*stride + 1
    
  2. In cases where we statically know that they're unit-strider, then we can avoid the multiplication and -1/+1 dance. This is definitely simpler and we do it when we can:

     # precompute offset and stride at construction time, then
     offset + (i + (j-1)*size(V, 1) + (k-1)*size(V,2))
    
  3. Or we could just re-index into the stored indices and then squash them down into a linear index into A. It's not immediately obvious if this will be a win as it depends on the complexity of reindexing:

     I[i] + (J[j]-1)*size(A, 1) + (K[k]-1)*size(A, 2)
    

Now, the complexity of re-indexing into our indices (I[i] and such) is limited because only a limited set of indices are able to be represented with the alternate offset/stride computation. The regressions in #30218 occurred where I, J, and K were all integers, so the indexing operations were totally free. In general we're limited — by definition — to only having up to one range in our set of indices (with everything else being :s or integers, which are free), so that's just be an addition and maybe a multiplication. So (3) should be no worse than (1) or (2).

But here's the kicker: that's all a red herring. The thing that is foiling LLVM's vectorization is computing the offset and stride at construction time. That's why this PR didn't solve the issue.

@mbauman
Copy link
Member Author

mbauman commented Dec 6, 2018

There was a 2x regression in LinAlg — not sure if real or not.

@nanosoldier runbenchmarks("linalg", vs = ":master")

Assuming that comes back clean, I think we should merge this.

I think we'll have to eat the synthetic benchmark regression from #29895, though — it ends up being an improvement for the usage of 0-d subarrays, but it's the construction of them that ends up foiling vectorization in these particular cases.

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against comparison commit: failed process: Process(`make -j3 USECCACHE=1`, ProcessExited(2)) [2]

Logs and partial data can be found here
cc @ararslan

@mbauman
Copy link
Member Author

mbauman commented Dec 7, 2018

Try again?

@nanosoldier runbenchmarks("linalg", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @ararslan

@@ -222,13 +222,14 @@ end

# In general, we simply re-index the parent indices by the provided ones
SlowSubArray{T,N,P,I} = SubArray{T,N,P,I,false}
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
function getindex(V::SubArray{T,N}, I::Vararg{Int,N}) where {T,N}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant optimization --- just remove the word "slow"! 😄

@JeffBezanson
Copy link
Member

I assume this is intended to fix a regression for 1.1?

@mbauman
Copy link
Member Author

mbauman commented Dec 12, 2018

Yeah, that was the goal but I missed and improved performance elsewhere instead. I suppose we could still target it for 1.1 as a mitigating perf improvement.

@JeffBezanson JeffBezanson added this to the 1.1 milestone Dec 12, 2018
@KristofferC KristofferC mentioned this pull request Dec 12, 2018
52 tasks
@JeffBezanson JeffBezanson merged commit 433ba13 into master Dec 17, 2018
@JeffBezanson JeffBezanson deleted the mb/cartesiansubarray branch December 17, 2018 20:51
KristofferC pushed a commit that referenced this pull request Dec 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants