Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cartesian related simd benchmarks #284

Merged
merged 4 commits into from
Nov 2, 2021
Merged

Add Cartesian related simd benchmarks #284

merged 4 commits into from
Nov 2, 2021

Conversation

N5N3
Copy link
Contributor

@N5N3 N5N3 commented Oct 23, 2021

Some original perf-test functions are extended to bench 2/3/4d Cartesian simd.
Since the length of 1st dim definitely influence the performace, I‘m not confident with the representativeness of choosed bench size.
Pinging @chriselrod for advice.

see also JuliaLang/julia#42736

add Cartesian related benchmarks
@chriselrod
Copy link

chriselrod commented Oct 23, 2021

Some original perf-test functions are extended to bench 2/3/4d Cartesian simd. Since the length of 1st dim definitely influence the performace, I‘m not confident with the representativeness of choosed bench size. Pinging @chriselrod for advice.

see also JuliaLang/julia#42736

LLVM is generally going to unroll by 4x the simd vector width.
So for Float32, AVX512 would give us 4*8 = 64 iterations at a time.
So testing 31, 32, 63, and 64 should be enough to test SIMD on all current architectures.

FWIW, the dim1 = 31, 32, and 63 times will fail to vectorize.

However, that doesn't actually matter here, because the benchmark is totally dominated by memory bandwidth.
Comparing with LoopVectorization, which will not fail to vectorize just because dim1=63

julia> size(v), typeof(v)
((63, 8, 8, 65), SubArray{Float32, 4, Array{Float32, 4}, NTuple{4, Base.OneTo{Int64}}, false})

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_axpy!, 10_000, n, v, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.31e+09   50.0%  #  4.1 cycles per ns
┌ instructions             5.94e+09   75.0%  #  1.8 insns per cycle
│ branch-instructions      9.23e+08   75.0%  # 15.6% of instructions
└ branch-misses            7.38e+06   75.0%  #  0.8% of branch instructions
┌ task-clock               8.08e+08  100.0%  # 808.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.32e+08   25.0%  # 20.0% of dcache loads
│ L1-dcache-loads          1.66e+09   25.0%
└ L1-icache-load-misses    2.03e+05   25.0%
┌ dTLB-load-misses         1.96e+02   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.66e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_turbo!, 10_000, n, v, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.05e+09   50.0%  #  3.8 cycles per ns
┌ instructions             9.30e+08   75.0%  #  0.3 insns per cycle
│ branch-instructions      9.70e+07   75.0%  # 10.4% of instructions
└ branch-misses            5.71e+04   75.0%  #  0.1% of branch instructions
┌ task-clock               7.92e+08  100.0%  # 791.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.30e+08   25.0%  # 91.1% of dcache loads
│ L1-dcache-loads          3.62e+08   25.0%
└ L1-icache-load-misses    6.61e+04   25.0%
┌ dTLB-load-misses         1.28e+02   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               3.62e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We see LoopVectorization required less than 1/6 the total number of instructions, but missed 91.1% of dcache loads.

For comparison, using parent to use linear indexing:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_axpy!, 10_000, n, parent(v), parent(x))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.07e+09   50.0%  #  3.9 cycles per ns
┌ instructions             6.21e+08   75.0%  #  0.2 insns per cycle
│ branch-instructions      4.23e+07   75.0%  #  6.8% of instructions
└ branch-misses            3.20e+04   75.0%  #  0.1% of branch instructions
┌ task-clock               7.97e+08  100.0%  # 797.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.29e+08   25.0%  # 99.6% of dcache loads
│ L1-dcache-loads          3.30e+08   25.0%
└ L1-icache-load-misses    1.32e+05   25.0%
┌ dTLB-load-misses         2.00e+01   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               3.30e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_turbo!, 10_000, n, parent(v), parent(x))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.07e+09   50.0%  #  3.9 cycles per ns
┌ instructions             6.21e+08   75.0%  #  0.2 insns per cycle
│ branch-instructions      4.22e+07   75.0%  #  6.8% of instructions
└ branch-misses            3.69e+04   75.0%  #  0.1% of branch instructions
┌ task-clock               7.96e+08  100.0%  # 795.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.29e+08   25.0%  # 99.6% of dcache loads
│ L1-dcache-loads          3.30e+08   25.0%
└ L1-icache-load-misses    5.41e+04   25.0%
┌ dTLB-load-misses         1.60e+01   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               3.30e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

They both now experience 99.6% dcache load misses.

It is 8x faster if I cut nbytes in half, but still quite bad, going from 0.2 instructions/cycle to 0.8.
The computer I'm testing on has a 1 MiB L2 cache (Cascadelake-X).
Thus 2 arrays that are 1 MiB (the default) requires streaming memory from the L3 cache.
Cutting the size in half, making them both 0.5 MiB, thus gives us a 4x improvement thanks to letting the memory fit in this CPU's L2.
If you want to test something other than the L2 or L3 -> register memory bandwidth, it may make sense to make the arrays much smaller.
The L1 cache of most modern CPUs is 32 KiB.
But maybe you could at least shoot for fitting in L2 caches.

EDIT:
I also totally forgot somehow that I always set an arg to LLVM to make it use 512 bit vectors for AVX512 CPUs. By default, LLVM 10 and newer will actually only use 256 bit vectors (start Julia with -C"native,-prefer-256-bit").
So results would look a little different.

make nbytes smaller to fitting in L2 caches.
@N5N3
Copy link
Contributor Author

N5N3 commented Oct 23, 2021

Looks like LinearAlgebra Bench dead on windows-latest.
Codecov regression seems coming from lines with end only.
Is this mergable? @vchuravy @vtjnash

end
end
const nbytes = 1 << 18
Copy link

@chriselrod chriselrod Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 512 KiB combined for the two arrays should be good for Intel server processors Skylake and newer, client processors Ice Lake and newer, and for AMD Zen.

1. avoid `view` in `CartesianPartition`'s bench.
2. make benchsize smaller
3. add a manually partitional sum bench
@N5N3
Copy link
Contributor Author

N5N3 commented Oct 27, 2021

The maximum benched size is reduced to 32kB each Array.
On my 9600KF, the median of (median) time cost ration between JuliaLang/julia#42736 and 1.7.0-rc2 for 4d case is 89% (94% for 3d case, 100% for 2d case).

@vchuravy vchuravy merged commit 43db1e4 into JuliaCI:master Nov 2, 2021
@N5N3 N5N3 deleted the moresimd branch November 2, 2021 12:28
vtjnash referenced this pull request in JuliaLang/julia Nov 16, 2021
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants