Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix @simd for non 1 step CartesianPartition #42736

Merged
merged 3 commits into from
Feb 25, 2022

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Oct 21, 2021

Previous code assumes that the steps of each axes are 1, which is not valid after #37829.
This PR tries to add support for non 1 step CartesianPartition.
Before

julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3)),1:4)
       println(i)
       end
CartesianIndex(1, 1) # wrong result
CartesianIndex(2, 1)
CartesianIndex(3, 1)
CartesianIndex(1, 2)
CartesianIndex(2, 2)
CartesianIndex(3, 2)
CartesianIndex(1, 3)
CartesianIndex(2, 3)
CartesianIndex(3, 3)
julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3,1:2:4)),1:4)
       println(i)
       end
ERROR: MethodError: no method matching LinearIndices(::Tuple{StepRange{Int64, Int64}, StepRange{Int64, Int64}})

After

julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3)),1:4)
       println(i)
       end
CartesianIndex(1, 1)
CartesianIndex(3, 1)
CartesianIndex(1, 3)
CartesianIndex(3, 3)
julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3,1:2:4)),1:4)
       println(i)
       end
CartesianIndex(1, 1, 1)
CartesianIndex(3, 1, 1)
CartesianIndex(1, 3, 1)
CartesianIndex(3, 3, 1)

Besides the above fix, this PR also tries to simplify the inner loop with a Generator based outer range. It accelerates 3/4d CartesianPartition a little on my desktop.
Some benchmarks:

julia> f(a) = @inbounds @simd for i in view(CartesianIndices(a),2:length(a)-1)
       a[i] = 1.0
       end
julia> a = zeros(256,256);
julia> @btime f($a)
  10.700 μs (0 allocations: 0 bytes)  # before: 10.900 μs (0 allocations: 0 bytes)  
julia> a = zeros(64,64,64);
julia> @btime f($a)
  44.500 μs (0 allocations: 0 bytes)  # 49.700 μs (0 allocations: 0 bytes)
julia> a = zeros(16,16,16,16);
julia> @btime f($a)
  14.000 μs (0 allocations: 0 bytes)  # 15.800 μs (0 allocations: 0 bytes)

@vchuravy vchuravy requested a review from mbauman October 21, 2021 13:24
@vchuravy vchuravy added bugfix This change fixes an existing bug compiler:simd instruction-level vectorization backport 1.6 Change should be backported to release-1.6 backport 1.7 needs nanosoldier run This PR should have benchmarks run on it needs tests Unit tests are required for this change labels Oct 21, 2021
@vtjnash vtjnash removed the needs tests Unit tests are required for this change label Oct 21, 2021
@vtjnash
Copy link
Member

vtjnash commented Oct 21, 2021

@nanosoldier runbenchmarks("simd", vs=":master")

test/iterators.jl Outdated Show resolved Hide resolved
@vtjnash vtjnash added merge me PR is reviewed. Merge when all tests are passing and removed needs nanosoldier run This PR should have benchmarks run on it labels Oct 21, 2021
@vtjnash
Copy link
Member

vtjnash commented Oct 21, 2021

Looks like tests mostly failed because OpenBLAS tries to start too many threads and exhausts the kernel resources, but not related to this PR.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@N5N3
Copy link
Member Author

N5N3 commented Oct 22, 2021

Although strange, the performance of 4D @simd does increase after modification. (looks like an oversight)
Now the @simd performance of CartesianPartition is close to CartesianIndices (at least if there is a long enough first dimension).

Benchmark code:

bench1(f::OP, a, b = ()) where OP = begin
    @inbounds @simd for i in eachindex(IndexCartesian(), a)
        a[i] = f(map(x -> (@inbounds x[i]), b)...)
    end
end
bench2(f::OP, a, b = ()) where OP = begin
    iter = view(eachindex(IndexCartesian(), a), 2:length(a) - 1)
    @inbounds @simd for i in iter
        a[i] = f(map(x -> (@inbounds x[i]), b)...)
    end
end
println("----------fill!------------")
N = 2^16
for n = 2:4
    a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
    println("N = ", n)
    @btime bench1(Returns(1), $a)
    @btime bench2(Returns(1), $a)
end
println("---------- .* ------------")
N = 2^16
for n = 2:4
    a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
    b = randn(size(a))
    c = randn(size(a))
    println("N = ", n)
    @btime bench1(*, $a, ($c, $b))
    @btime bench2(*, $a, ($c, $b))
end

Result:

----------fill!------------
N = 2
  10.800 μs (0 allocations: 0 bytes)
  11.100 μs (0 allocations: 0 bytes)
N = 3
  10.700 μs (0 allocations: 0 bytes)
  10.900 μs (0 allocations: 0 bytes)
N = 4
  11.100 μs (0 allocations: 0 bytes)
  10.800 μs (0 allocations: 0 bytes)
---------- .* ------------
N = 2
  22.700 μs (0 allocations: 0 bytes)
  22.800 μs (0 allocations: 0 bytes)
N = 3
  22.700 μs (0 allocations: 0 bytes)
  22.800 μs (0 allocations: 0 bytes)
N = 4
  22.800 μs (0 allocations: 0 bytes)
  23.300 μs (0 allocations: 0 bytes)

@DilumAluthge
Copy link
Member

@vtjnash I notice that a few commits were pushed after you added the merge me label, so I'll remove the label until you have a chance to review the new commits.

@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Oct 22, 2021
@DilumAluthge
Copy link
Member

But anyway, all CI is green (except for tester_freebsd64, which is a known issue).

@vchuravy
Copy link
Member

@nanosoldier runbenchmarks("simd", vs=":master")

@vchuravy
Copy link
Member

@N5N3 Coukd you also add some Benchmarks for this to BaseBenchmarks.jl?

@N5N3
Copy link
Member Author

N5N3 commented Oct 22, 2021

I’d like to, but I’m quite unfamiliar with BaseBenchmarks.jl.
IIUC, it only tests Vector’s simd performance.
Should we also add bench for CartesianIndices?

@vchuravy
Copy link
Member

Should we also add bench for CartesianIndices?

Yes please. This is quite sensitive to future code changes or unrelated changes in part of the pipeline.
I think we can land this before adding benchmarks there, but they would be great to have.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@vtjnash
Copy link
Member

vtjnash commented Oct 23, 2021

Seemed like nanosoldier perf test failed also, whereas merge-me means if tests are good.

@N5N3
Copy link
Member Author

N5N3 commented Oct 23, 2021

IIUC, nanosoldier's simd benchmark didn't cover CartesianPartition and CartesianIndices.
So it should be impossible for this PR to make any "visible" regression?

The worst case with 2.00 time ratio seems to be a summation with boundscheck over 1:length(X)
The worst case with 2.00 time ratio is the summation of -10^5:-1:1 with boundscheck, which also shouldn't be affected by this PR.

Anyway I agree that performance should be tested before we merge this PR. I'll add Cartesian related simd benchmark to BaseBenchmarks.jl as soon as possible.(Edit: Benchmark added @vchuravy @vtjnash)

@N5N3
Copy link
Member Author

N5N3 commented Oct 26, 2021

It tooks about 30ns to generate 2d reshaped CartesianIndices on my Desktop.
Since we only index into the outer range once, there's no need to make it fast-indexable.

By calling 3-args Base.ReshapedArray directly, the above overhead is avoided.
This does save some time for 3d/4d cases with small size.
(The definition of CartesianPartition is widen a little to make the view of a "fake" ReshapedArrayLF still a CartesianPartition)

@N5N3
Copy link
Member Author

N5N3 commented Oct 27, 2021

Looks like all tests failed because of network problem.

@KristofferC KristofferC mentioned this pull request Nov 19, 2021
29 tasks
@N5N3
Copy link
Member Author

N5N3 commented Nov 21, 2021

Just notice that reused simd_index makes the position of @inbounds influence @simd for's performance. (BitArray's broadcast is the only use case I found in Base.)

On the other hand, I still think we'd better make CartesianIndices's simd_index always inbound like before, since it is really strange that @inbounds affects the speed of index iteration.

@N5N3
Copy link
Member Author

N5N3 commented Jan 6, 2022

Bump?

@KristofferC KristofferC mentioned this pull request Jan 10, 2022
50 tasks
@KristofferC KristofferC added the merge me PR is reviewed. Merge when all tests are passing label Feb 25, 2022
revert changes in reshapedarray.jl
use Iterators.rest
move `@inbounds` outside the loop body. see JuliaLang#38086
`first(Base.axes1())` works well, but `length(::StepRange)` won't be omitted by LLVM if `step` is unknown at compile time.
@KristofferC
Copy link
Member

Rebased to get another CI run.

@N5N3
Copy link
Member Author

N5N3 commented Feb 25, 2022

Thanks for pick this up.
Error on "julia-master" should be #42752 (thus unrelated).

@DilumAluthge DilumAluthge merged commit 2e2c16a into JuliaLang:master Feb 25, 2022
@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Feb 25, 2022
@N5N3 N5N3 deleted the SimdCartesianPartition branch February 26, 2022 06:06
@N5N3 N5N3 added the backport 1.8 Change should be backported to release-1.8 label Feb 26, 2022
staticfloat pushed a commit to JuliaCI/julia-buildkite-testing that referenced this pull request Mar 2, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
KristofferC pushed a commit that referenced this pull request Mar 15, 2022
KristofferC pushed a commit that referenced this pull request Mar 15, 2022
KristofferC pushed a commit that referenced this pull request Mar 16, 2022
@KristofferC KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Mar 21, 2022
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug compiler:simd instruction-level vectorization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants