Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use std::arch SIMD and runtime target feature detection #22

Merged
merged 51 commits into from
Nov 15, 2018
Merged

Conversation

bluss
Copy link
Owner

@bluss bluss commented Nov 10, 2018

Improve performance out of the box on x86-64 by using target_feature. 🔢 🔥

Implement a fast 8x8 sgemm kernel for x86-64 using std::arch intrinsics. We maintain portability with other platforms.

  1. Multiple target_feature invocations of the same implementation allow us to compile multiple versions of the code and jump into the best one at runtime. The same simd intrinsics or Rust code would for example compile to better code inside a target_feature(enabled="avx") function than an sse function.

  2. Port a BLIS sgemm 8x8 x86 avx kernel to Rust instrinsics. Use the same memory layout in the main loop.

  3. Use std::alloc for aligned allocation

  4. Runtime target feature detection and target feature specific compilation allows us to offer native performance out of the box on x86-64 (no compiler flags needed)

  5. Test compile to multiple targets (x86-64, x86, aarch64) and run the x86 ones in travis

  6. in the f32 avx kernel, schedule loads one iteration ahead, this improves throughput

Fixes #14

@bluss bluss force-pushed the std-simd-x86 branch 3 times, most recently from d8b05e5 to 0a8beaf Compare November 11, 2018 00:41
@bluss bluss changed the title Use std::arch to simdify gemm kernels (x86 and sgemm first) Use std::arch SIMD and runtime target feature detection Nov 12, 2018
@bluss
Copy link
Owner Author

bluss commented Nov 12, 2018

Benchmarks. Comparing target-feature=+avx compiled from current master with "generic" build on this PR, and we still have improvements! This is tested on my AVX enabled ivy bridge laptop.

We see some overhead in the small sizes for sgemm - because sgemm is now requesting a 32 byte-aligned packing buffer from the system allocator, apparenly this goes through posix_memalign and is a bit slow.

before: master native-cpu build
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  138         231                   93   67.39% 
 mat_mul_f32::m007  169         271                  102   60.36% 
 mat_mul_f32::m008  180         218                   38   21.11% 
 mat_mul_f32::m009  371         489                  118   31.81% 
 mat_mul_f32::m012  458         586                  128   27.95% 
 mat_mul_f32::m016  601         544                  -57   -9.48% 
 mat_mul_f32::m032  2,984       2,451               -533  -17.86% 
 mat_mul_f32::m064  18,254      15,384            -2,870  -15.72% 
 mat_mul_f32::m127  123,472     109,885          -13,587  -11.00% 
 mat_mul_f64::m004  134         130                   -4   -2.99% 
 mat_mul_f64::m007  208         210                    2    0.96% 
 mat_mul_f64::m008  220         230                   10    4.55% 
 mat_mul_f64::m009  417         424                    7    1.68% 
 mat_mul_f64::m012  513         550                   37    7.21% 
 mat_mul_f64::m016  789         827                   38    4.82% 
 mat_mul_f64::m032  4,395       4,347                -48   -1.09% 
 mat_mul_f64::m064  28,604      28,099              -505   -1.77% 
 mat_mul_f64::m127  207,470     199,336           -8,134   -3.92% 

The most exciting part is having these gains in a "generic" build where target feature detection takes care of it. The improvement compared with master's generic build is quite big:

before: master generic build
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m127  200,461     109,885          -90,576  -45.18% 
 mat_mul_f64::m127  370,069     199,336         -170,733  -46.14%

This implements the sgemm kernel in sse (x86) intrinsics; more or less
corresponding to the existing code but in a 4x4 version.
@bluss
Copy link
Owner Author

bluss commented Nov 13, 2018

Improved benchmarks for sgemm.

Fixed inflated allocation size bug and brought down the overhead on small matrices.
Scheduling data loads early got another 10% speedup in the bigger matrices. Benchmarks are still on an avx ivy bridge laptop.

Incremental improvents from the previous benchmark to this state of the pr:

before: this pr generic build  (previously posted benchmark)
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  231         191                  -40  -17.32% 
 mat_mul_f32::m008  218         195                  -23  -10.55% 
 mat_mul_f32::m012  586         602                   16    2.73% 
 mat_mul_f32::m016  544         526                  -18   -3.31% 
 mat_mul_f32::m032  2,451       2,259               -192   -7.83% 
 mat_mul_f32::m064  15,384      13,635            -1,749  -11.37% 
 mat_mul_f32::m127  109,885     97,045           -12,840  -11.68%

Benchmarks compared with master:

before: master native-cpu build
after: this pr generic build  (pr at this time)

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  138         191                   53   38.41% 
 mat_mul_f32::m008  180         195                   15    8.33% 
 mat_mul_f32::m012  458         602                  144   31.44% 
 mat_mul_f32::m016  601         526                  -75  -12.48% 
 mat_mul_f32::m032  2,984       2,259               -725  -24.30% 
 mat_mul_f32::m064  18,254      13,635            -4,619  -25.30% 
 mat_mul_f32::m127  123,472     97,045           -26,427  -21.40%

Switch to unmasked (no default redirect to buffer for the output)
because we don't get inlining between the masked_kernel function and the
actual kernel, when we use target_feature marked functions.

Use _mm_shuffle_ps instead of permute:
Shuffle produces clean code for generic x86-64 (permute produces
a function call).
With thanks to @matematikaadit's example of a travis file
Using arrays instead of numbered variables lets us clean up the code
(using the "static" loop macros to simplify). Benchmarks with the same
performance.
Load the a vector "striped" just like how the avx kernel from the BLIS
project does it.

We de-stripe the vectors post the main loop with shuffles.
This simplifies the packing code by getting a correctly aligned
allocation directly.

But since only the System allocator is stable, it also means we switch
allocator if System isn't the Rust default. I was thinking that explains
the small performance improvement in the benchmark due to this allocator
change.
Use permute and movss to store, and unroll the loop. Using a for loop
can cause the compiler to store the array (c) on the stack.
This allows us to test/benchmark various implementations even in the
presence of runtime target detection.

Set MMNO_avx=1 or MMNO_sse=1 to disable avx or sse respectively.
Runtime target feature detection takes the place of this
This preserves its performance even with my local target-cpu=native
settings (otherwise it would regress). This is while it is inlining the
kernel into the main gemm loop.
Explain the element layout in the vectors properly and remove stray
comments from the implementation.
This code makes no difference in this commit, but it saves the
information that we can transpose the operation (and prefer column major
C instead if we want to).
This is apparently the first version where the crate builds (due to std::alloc)
Comment out the benchmarks we seldom comparer. Also don't run the
"reference implementation" benchmarks, those are seldom compared.
Allocating 32 byte-aligned buffers costs some overhead that is noticable
for small matrix multiplication problems, while there is hard to detect
any benchmark change due to using unaligned loads.
This uses the default Rust global allocator.
This should make it possible to run the α ab multiplication while we are
waiting for C to load.
As the bug showed, it must be clear this parameter is the number of
elements (not bytes).
There was a bug earlier in this same pull request: we were allocating
a far too large packing buffer due to multiplying with element size
twice.
This improves performance by a massive 10% on the 127×127 benchmark.

We can assume the kernel is called with k > 0, so we split off the last
iteration of the loop and make sure we don't load in the last iteration
(as it would go out of bounds of the packing area).

There was an alternative, we could extend the packing buffer by 8 dummy
elements to allow us to load up to one more vector, but it did not make
a difference in benchmarks.
Aligned loads are needed (at the moment) to not regress performance for
-Ctarget-cpu=native compilation on x86-64 with avx.
This is a later rust feature.
Will become an error in the future, so fix now.
These tests are rudimentary so far, but they cover all the possible
kernels (avx, sse2, fallback) we have so far.
@bluss
Copy link
Owner Author

bluss commented Nov 15, 2018

Benchmarks vs OpenBLAS's single threaded mode, and the new sgemm (f32) kernel wins. Which is pretty shocking! 🔥 For the (autovectorized) f64 kernel we "compete". Using libopenblas-base in debian version 0.2.19

test blas_mat_mul_f32::m064 ... bench:      17,496 ns/iter (+/- 322)
test blas_mat_mul_f32::m127 ... bench:     126,223 ns/iter (+/- 4,609)
test blas_mat_mul_f64::m064 ... bench:      25,466 ns/iter (+/- 319)
test blas_mat_mul_f64::m127 ... bench:     188,728 ns/iter (+/- 24,764)
test mat_mul_f32::m064      ... bench:      14,021 ns/iter (+/- 49)
test mat_mul_f32::m127      ... bench:      98,657 ns/iter (+/- 2,477)
test mat_mul_f64::m064      ... bench:      27,382 ns/iter (+/- 2,574)
test mat_mul_f64::m127      ... bench:     204,120 ns/iter (+/- 4,243)

"m064" corresponds to an all row major 64×64×64 matrix product and "m127" to 127×127×127

There should be several caveats, and I'm sure an openBlas author would know the cause much better than me. For example, the choice of blocking strategy vs available cache sizes or the size of matrix multiplication problem you optimize for. I'm not testing on a beefy machine but a laptop.

Internal crate to not expose testing features or testing deps in the
main crate that we publish.
Just to make sure the benchmarks are correct, use the default blas so it
can have its preferred column major layout and everything. But the
performance is the same (unsurprisingly).
@bluss bluss merged commit 437132f into master Nov 15, 2018
@bluss bluss deleted the std-simd-x86 branch November 15, 2018 21:29
@bluss
Copy link
Owner Author

bluss commented Nov 15, 2018

💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant