Add Cartesian related simd benchmarks #284

N5N3 · 2021-10-23T06:28:16Z

Some original perf-test functions are extended to bench 2/3/4d Cartesian simd.
Since the length of 1st dim definitely influence the performace, I‘m not confident with the representativeness of choosed bench size.
Pinging @chriselrod for advice.

see also JuliaLang/julia#42736

add Cartesian related benchmarks

fix 1.0

chriselrod · 2021-10-23T07:15:42Z

Some original perf-test functions are extended to bench 2/3/4d Cartesian simd. Since the length of 1st dim definitely influence the performace, I‘m not confident with the representativeness of choosed bench size. Pinging @chriselrod for advice.

see also JuliaLang/julia#42736

LLVM is generally going to unroll by 4x the simd vector width.
So for Float32, AVX512 would give us 4*8 = 64 iterations at a time.
So testing 31, 32, 63, and 64 should be enough to test SIMD on all current architectures.

FWIW, the dim1 = 31, 32, and 63 times will fail to vectorize.

However, that doesn't actually matter here, because the benchmark is totally dominated by memory bandwidth.
Comparing with LoopVectorization, which will not fail to vectorize just because dim1=63

julia> size(v), typeof(v)
((63, 8, 8, 65), SubArray{Float32, 4, Array{Float32, 4}, NTuple{4, Base.OneTo{Int64}}, false})

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_axpy!, 10_000, n, v, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.31e+09   50.0%  #  4.1 cycles per ns
┌ instructions             5.94e+09   75.0%  #  1.8 insns per cycle
│ branch-instructions      9.23e+08   75.0%  # 15.6% of instructions
└ branch-misses            7.38e+06   75.0%  #  0.8% of branch instructions
┌ task-clock               8.08e+08  100.0%  # 808.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.32e+08   25.0%  # 20.0% of dcache loads
│ L1-dcache-loads          1.66e+09   25.0%
└ L1-icache-load-misses    2.03e+05   25.0%
┌ dTLB-load-misses         1.96e+02   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.66e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_turbo!, 10_000, n, v, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.05e+09   50.0%  #  3.8 cycles per ns
┌ instructions             9.30e+08   75.0%  #  0.3 insns per cycle
│ branch-instructions      9.70e+07   75.0%  # 10.4% of instructions
└ branch-misses            5.71e+04   75.0%  #  0.1% of branch instructions
┌ task-clock               7.92e+08  100.0%  # 791.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.30e+08   25.0%  # 91.1% of dcache loads
│ L1-dcache-loads          3.62e+08   25.0%
└ L1-icache-load-misses    6.61e+04   25.0%
┌ dTLB-load-misses         1.28e+02   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               3.62e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We see LoopVectorization required less than 1/6 the total number of instructions, but missed 91.1% of dcache loads.

For comparison, using parent to use linear indexing:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_axpy!, 10_000, n, parent(v), parent(x))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.07e+09   50.0%  #  3.9 cycles per ns
┌ instructions             6.21e+08   75.0%  #  0.2 insns per cycle
│ branch-instructions      4.23e+07   75.0%  #  6.8% of instructions
└ branch-misses            3.20e+04   75.0%  #  0.1% of branch instructions
┌ task-clock               7.97e+08  100.0%  # 797.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.29e+08   25.0%  # 99.6% of dcache loads
│ L1-dcache-loads          3.30e+08   25.0%
└ L1-icache-load-misses    1.32e+05   25.0%
┌ dTLB-load-misses         2.00e+01   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               3.30e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(perf_turbo!, 10_000, n, parent(v), parent(x))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.07e+09   50.0%  #  3.9 cycles per ns
┌ instructions             6.21e+08   75.0%  #  0.2 insns per cycle
│ branch-instructions      4.22e+07   75.0%  #  6.8% of instructions
└ branch-misses            3.69e+04   75.0%  #  0.1% of branch instructions
┌ task-clock               7.96e+08  100.0%  # 795.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.29e+08   25.0%  # 99.6% of dcache loads
│ L1-dcache-loads          3.30e+08   25.0%
└ L1-icache-load-misses    5.41e+04   25.0%
┌ dTLB-load-misses         1.60e+01   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               3.30e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

They both now experience 99.6% dcache load misses.

It is 8x faster if I cut nbytes in half, but still quite bad, going from 0.2 instructions/cycle to 0.8.
The computer I'm testing on has a 1 MiB L2 cache (Cascadelake-X).
Thus 2 arrays that are 1 MiB (the default) requires streaming memory from the L3 cache.
Cutting the size in half, making them both 0.5 MiB, thus gives us a 4x improvement thanks to letting the memory fit in this CPU's L2.
If you want to test something other than the L2 or L3 -> register memory bandwidth, it may make sense to make the arrays much smaller.
The L1 cache of most modern CPUs is 32 KiB.
But maybe you could at least shoot for fitting in L2 caches.

EDIT:
I also totally forgot somehow that I always set an arg to LLVM to make it use 512 bit vectors for AVX512 CPUs. By default, LLVM 10 and newer will actually only use 256 bit vectors (start Julia with -C"native,-prefer-256-bit").
So results would look a little different.

make nbytes smaller to fitting in L2 caches.

N5N3 · 2021-10-23T14:07:37Z

Looks like LinearAlgebra Bench dead on windows-latest.
Codecov regression seems coming from lines with end only.
Is this mergable? @vchuravy @vtjnash

chriselrod · 2021-10-23T22:50:36Z

src/simd/SIMDBenchmarks.jl

    end
 end
+const nbytes = 1 << 18


Yeah, 512 KiB combined for the two arrays should be good for Intel server processors Skylake and newer, client processors Ice Lake and newer, and for AMD Zen.

1. avoid `view` in `CartesianPartition`'s bench. 2. make benchsize smaller 3. add a manually partitional sum bench

N5N3 · 2021-10-27T07:31:18Z

The maximum benched size is reduced to 32kB each Array.
On my 9600KF, the median of (median) time cost ration between JuliaLang/julia#42736 and 1.7.0-rc2 for 4d case is 89% (94% for 3d case, 100% for 2d case).

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

N5N3 added 2 commits October 23, 2021 14:16

Update SIMDBenchmarks.jl

9244397

add Cartesian related benchmarks

Update SIMDBenchmarks.jl

7563d84

fix 1.0

Update SIMDBenchmarks.jl

221e5a0

make nbytes smaller to fitting in L2 caches.

chriselrod reviewed Oct 23, 2021

View reviewed changes

Update SIMDBenchmarks.jl

cbfd0f5

1. avoid `view` in `CartesianPartition`'s bench. 2. make benchsize smaller 3. add a manually partitional sum bench

N5N3 mentioned this pull request Nov 1, 2021

Fix @simd for non 1 step CartesianPartition JuliaLang/julia#42736

Merged

vchuravy approved these changes Nov 1, 2021

View reviewed changes

chriselrod approved these changes Nov 2, 2021

View reviewed changes

vchuravy merged commit 43db1e4 into JuliaCI:master Nov 2, 2021

N5N3 deleted the moresimd branch November 2, 2021 12:28

vtjnash referenced this pull request in JuliaLang/julia Nov 16, 2021

Unbreak source distribution tarball construction. (#43096)

d958c8c

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cartesian related simd benchmarks #284

Add Cartesian related simd benchmarks #284

N5N3 commented Oct 23, 2021 •

edited

Loading

chriselrod commented Oct 23, 2021 •

edited

Loading

N5N3 commented Oct 23, 2021

chriselrod Oct 23, 2021 •

edited

Loading

N5N3 commented Oct 27, 2021

Add Cartesian related simd benchmarks #284

Add Cartesian related simd benchmarks #284

Conversation

N5N3 commented Oct 23, 2021 • edited Loading

chriselrod commented Oct 23, 2021 • edited Loading

N5N3 commented Oct 23, 2021

chriselrod Oct 23, 2021 • edited Loading

Choose a reason for hiding this comment

N5N3 commented Oct 27, 2021

N5N3 commented Oct 23, 2021 •

edited

Loading

chriselrod commented Oct 23, 2021 •

edited

Loading

chriselrod Oct 23, 2021 •

edited

Loading