-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix @simd
for non 1 step CartesianPartition
#42736
Conversation
@nanosoldier |
Looks like tests mostly failed because OpenBLAS tries to start too many threads and exhausts the kernel resources, but not related to this PR. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Benchmark code:
bench1(f::OP, a, b = ()) where OP = begin
@inbounds @simd for i in eachindex(IndexCartesian(), a)
a[i] = f(map(x -> (@inbounds x[i]), b)...)
end
end
bench2(f::OP, a, b = ()) where OP = begin
iter = view(eachindex(IndexCartesian(), a), 2:length(a) - 1)
@inbounds @simd for i in iter
a[i] = f(map(x -> (@inbounds x[i]), b)...)
end
end
println("----------fill!------------")
N = 2^16
for n = 2:4
a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
println("N = ", n)
@btime bench1(Returns(1), $a)
@btime bench2(Returns(1), $a)
end
println("---------- .* ------------")
N = 2^16
for n = 2:4
a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
b = randn(size(a))
c = randn(size(a))
println("N = ", n)
@btime bench1(*, $a, ($c, $b))
@btime bench2(*, $a, ($c, $b))
end Result:
----------fill!------------
N = 2
10.800 μs (0 allocations: 0 bytes)
11.100 μs (0 allocations: 0 bytes)
N = 3
10.700 μs (0 allocations: 0 bytes)
10.900 μs (0 allocations: 0 bytes)
N = 4
11.100 μs (0 allocations: 0 bytes)
10.800 μs (0 allocations: 0 bytes)
---------- .* ------------
N = 2
22.700 μs (0 allocations: 0 bytes)
22.800 μs (0 allocations: 0 bytes)
N = 3
22.700 μs (0 allocations: 0 bytes)
22.800 μs (0 allocations: 0 bytes)
N = 4
22.800 μs (0 allocations: 0 bytes)
23.300 μs (0 allocations: 0 bytes) |
@vtjnash I notice that a few commits were pushed after you added the |
But anyway, all CI is green (except for |
@nanosoldier |
@N5N3 Coukd you also add some Benchmarks for this to BaseBenchmarks.jl? |
I’d like to, but I’m quite unfamiliar with BaseBenchmarks.jl. |
Yes please. This is quite sensitive to future code changes or unrelated changes in part of the pipeline. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Seemed like nanosoldier perf test failed also, whereas merge-me means if tests are good. |
IIUC,
Anyway I agree that performance should be tested before we merge this PR. |
It tooks about 30ns to generate 2d reshaped By calling 3-args |
Looks like all tests failed because of network problem. |
Just notice that reused On the other hand, I still think we'd better make |
ba8e114
to
86a627a
Compare
86a627a
to
8fef013
Compare
8fef013
to
86d4375
Compare
Bump? |
revert changes in reshapedarray.jl use Iterators.rest
move `@inbounds` outside the loop body. see JuliaLang#38086
`first(Base.axes1())` works well, but `length(::StepRange)` won't be omitted by LLVM if `step` is unknown at compile time.
86d4375
to
90d0ea1
Compare
Rebased to get another CI run. |
Thanks for pick this up. |
Previous code assumes that the steps of each axes are 1, which is not valid after #37829.
This PR tries to add support for non 1 step
CartesianPartition
.Before
After
Besides the above fix, this PR also tries to simplify the inner loop with a
Generator
based outer range. It accelerates 3/4d CartesianPartition a little on my desktop.Some benchmarks: