-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster view creation #19259
Faster view creation #19259
Conversation
May as well see what the soldier reports in any case: @nanosoldier |
Your benchmark job has completed, but no benchmarks were actually executed. Perhaps your tag predicate contains mispelled tags? cc @jrevels |
( |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Nice! I like the direction, good catch. What kind of speedups are you seeing? Aside from the test failures, might want to look into the |
parent::P | ||
indexes::I | ||
offset1::L # for linear indexing and pointer, only stored when LinearFast | ||
stride1::L # for linear indexing, only stored when LinearFast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabs -> spaces
I didn't push this sooner since I had a hard time constructing benchmarks that actually demonstrated that this was an improvement. The inlining changes were purely based upon simplifications to The reason I pushed it when I did was because I didn't want others to duplicate this work as they investigated #19257. That result seems to have been somewhat spurious, but we need more benchmarks here in any case. I'm not sure when I'll have time to test this further. |
So what's your call, @mbauman, merge or not? If you don't think it's a regression, I'd be in favor. |
Lemme drop the last commit here and propose it separately since it's breaking. |
006d861
to
13f8518
Compare
13f8518
to
8b812ae
Compare
I'd like to merge this for 0.6, but it'd be good to wait until JuliaCI/BaseBenchmarks.jl#54 makes it onto Nanosoldier for a final perf check. |
compute_offset1(parent, stride1::Integer, dims::Tuple{Int}, inds::Tuple{Colon}, I::Tuple) = compute_linindex(parent, I) - stride1*first(indices(parent, dims[1])) # index-preserving case | ||
compute_offset1(parent, stride1::Integer, dims, inds, I::Tuple) = compute_linindex(parent, I) - stride1 # linear indexing starts with 1 | ||
compute_offset1(parent, stride1::Integer, dims::Tuple{Int}, inds::Tuple{Colon}, I::Tuple) = (@_inline_meta; compute_linindex(parent, I) - stride1*first(indices(parent, dims[1]))) # index-preserving case | ||
compute_offset1(parent, stride1::Integer, dims, inds, I::Tuple) = (@_inline_meta; compute_linindex(parent, I) - stride1) # linear indexing starts with 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit overly long lines here, wrap?
All it needs is the dimensionality of the indexing result.
8b812ae
to
410cd49
Compare
@nanosoldier |
Had to kick nanosoldier, retriggering: @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Hm, I cannot reproduce any of the largest regressions… in fact Since I can't reproduce the perf issues, I think this is as good as I can make it. |
As another data point, on my laptop I'm getting a ~3x slowdown for the benchmark in the comment above. The LLVM codes are here: https://gist.github.com/KristofferC/1cf9e09d97289b521f494c9c68958043. Edit: Removed some confusion... |
Thanks for checking Kristoffer. I'm seeing the same LLVM IR as you, but 75us on this branch compared to 130us on master. I have an old i5; OpenBLAS reports compiling for Nehalem, LLVM reports westmere.
… On Jan 3, 2017, at 6:09 PM, Kristoffer Carlsson ***@***.***> wrote:
As another data point, on my laptop I'm getting a ~3x slowdown. The LLVM codes are here: https://gist.github.com/KristofferC/1cf9e09d97289b521f494c9c68958043 and what seems suspicious is this call: %22 = call i64 @jlsys_convert_52679(%jl_value_t* inttoptr (i64 139759635777136 to %jl_value_t*), double %21) in the loop.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I get 115us on PR and ~45us on master. On 0.5 it is around 115us as well. This PR improves much more than it regresses though so while it is sometimes good to be greedy, perhaps this can just be merged? |
The failure is the current ongoing 32-bit Linux problem. I'm with @KristofferC on this. |
Sounds good to me too. |
@yuyichao postulated that the slowdown is due to LLVM generating bad native code on newer architectures causing a partial register stall when converting an integer to double. I confirmed this by recompiling the sysimg for Line 84 in 30bf89f
|
In that light I would say we should merge this and tackle the code generation issue later. |
I wonder if this has regressed? master: julia> A = rand(1000,1000,1);
julia> @benchmark view(A, :, :, 1) seconds=1
BenchmarkTools.Trial:
memory estimate: 880 bytes
allocs estimate: 40
--------------
minimum time: 14.049 μs (0.00% GC)
median time: 14.456 μs (0.00% GC)
mean time: 14.703 μs (0.00% GC)
maximum time: 64.914 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% julia-0.5: julia> A = rand(1000,1000,1);
julia> @benchmark view(A, :, :, 1) seconds=1
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 1
--------------
minimum time: 27.119 ns (0.00% GC)
median time: 28.238 ns (0.00% GC)
mean time: 35.656 ns (6.56% GC)
maximum time: 1.478 μs (95.10% GC)
--------------
samples: 10000
evals/sample: 995
time tolerance: 5.00%
memory tolerance: 1.00% (I'd fix this myself except I'm up to my neck in another project right now.) |
I get:
|
Looks like the missed interpolation of |
You are so right. Newbie error 😄. With it, they're the same speed on 0.5 and 0.6. |
This is a series of performance patches that I wrote a while ago, but never had a chance to fully vet their performance impact. At a minimum, this fixes #19257. We really need better benchmarks for subarray creation in nanosoldier; I don't think SubArray creation is really tested there.
The fourth commit is the most micro- of micro-optimizations, and it might be a little too cute… but I think there's good reason to do it beyond the micro-optimization. See the commit message for more details.