Use std::arch SIMD and runtime target feature detection #22

bluss · 2018-11-10T18:41:45Z

Improve performance out of the box on x86-64 by using target_feature. 🔢 🔥

Implement a fast 8x8 sgemm kernel for x86-64 using std::arch intrinsics. We maintain portability with other platforms.

Multiple target_feature invocations of the same implementation allow us to compile multiple versions of the code and jump into the best one at runtime. The same simd intrinsics or Rust code would for example compile to better code inside a target_feature(enabled="avx") function than an sse function.
Port a BLIS sgemm 8x8 x86 avx kernel to Rust instrinsics. Use the same memory layout in the main loop.
Use std::alloc for aligned allocation
Runtime target feature detection and target feature specific compilation allows us to offer native performance out of the box on x86-64 (no compiler flags needed)
Test compile to multiple targets (x86-64, x86, aarch64) and run the x86 ones in travis
in the f32 avx kernel, schedule loads one iteration ahead, this improves throughput

Fixes #14

bluss · 2018-11-12T20:49:47Z

Benchmarks. Comparing target-feature=+avx compiled from current master with "generic" build on this PR, and we still have improvements! This is tested on my AVX enabled ivy bridge laptop.

We see some overhead in the small sizes for sgemm - because sgemm is now requesting a 32 byte-aligned packing buffer from the system allocator, apparenly this goes through posix_memalign and is a bit slow.

before: master native-cpu build
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  138         231                   93   67.39% 
 mat_mul_f32::m007  169         271                  102   60.36% 
 mat_mul_f32::m008  180         218                   38   21.11% 
 mat_mul_f32::m009  371         489                  118   31.81% 
 mat_mul_f32::m012  458         586                  128   27.95% 
 mat_mul_f32::m016  601         544                  -57   -9.48% 
 mat_mul_f32::m032  2,984       2,451               -533  -17.86% 
 mat_mul_f32::m064  18,254      15,384            -2,870  -15.72% 
 mat_mul_f32::m127  123,472     109,885          -13,587  -11.00% 
 mat_mul_f64::m004  134         130                   -4   -2.99% 
 mat_mul_f64::m007  208         210                    2    0.96% 
 mat_mul_f64::m008  220         230                   10    4.55% 
 mat_mul_f64::m009  417         424                    7    1.68% 
 mat_mul_f64::m012  513         550                   37    7.21% 
 mat_mul_f64::m016  789         827                   38    4.82% 
 mat_mul_f64::m032  4,395       4,347                -48   -1.09% 
 mat_mul_f64::m064  28,604      28,099              -505   -1.77% 
 mat_mul_f64::m127  207,470     199,336           -8,134   -3.92%

The most exciting part is having these gains in a "generic" build where target feature detection takes care of it. The improvement compared with master's generic build is quite big:

before: master generic build
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m127  200,461     109,885          -90,576  -45.18% 
 mat_mul_f64::m127  370,069     199,336         -170,733  -46.14%

This implements the sgemm kernel in sse (x86) intrinsics; more or less corresponding to the existing code but in a 4x4 version.

bluss · 2018-11-13T13:08:11Z

Improved benchmarks for sgemm.

Fixed inflated allocation size bug and brought down the overhead on small matrices.
Scheduling data loads early got another 10% speedup in the bigger matrices. Benchmarks are still on an avx ivy bridge laptop.

Incremental improvents from the previous benchmark to this state of the pr:

before: this pr generic build  (previously posted benchmark)
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  231         191                  -40  -17.32% 
 mat_mul_f32::m008  218         195                  -23  -10.55% 
 mat_mul_f32::m012  586         602                   16    2.73% 
 mat_mul_f32::m016  544         526                  -18   -3.31% 
 mat_mul_f32::m032  2,451       2,259               -192   -7.83% 
 mat_mul_f32::m064  15,384      13,635            -1,749  -11.37% 
 mat_mul_f32::m127  109,885     97,045           -12,840  -11.68%

Benchmarks compared with master:

before: master native-cpu build
after: this pr generic build  (pr at this time)

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  138         191                   53   38.41% 
 mat_mul_f32::m008  180         195                   15    8.33% 
 mat_mul_f32::m012  458         602                  144   31.44% 
 mat_mul_f32::m016  601         526                  -75  -12.48% 
 mat_mul_f32::m032  2,984       2,259               -725  -24.30% 
 mat_mul_f32::m064  18,254      13,635            -4,619  -25.30% 
 mat_mul_f32::m127  123,472     97,045           -26,427  -21.40%

Switch to unmasked (no default redirect to buffer for the output) because we don't get inlining between the masked_kernel function and the actual kernel, when we use target_feature marked functions. Use _mm_shuffle_ps instead of permute: Shuffle produces clean code for generic x86-64 (permute produces a function call).

…forms)

@matematikaadit

With thanks to @matematikaadit's example of a travis file

Using arrays instead of numbered variables lets us clean up the code (using the "static" loop macros to simplify). Benchmarks with the same performance.

Load the a vector "striped" just like how the avx kernel from the BLIS project does it. We de-stripe the vectors post the main loop with shuffles.

This simplifies the packing code by getting a correctly aligned allocation directly. But since only the System allocator is stable, it also means we switch allocator if System isn't the Rust default. I was thinking that explains the small performance improvement in the benchmark due to this allocator change.

Use permute and movss to store, and unroll the loop. Using a for loop can cause the compiler to store the array (c) on the stack.

This allows us to test/benchmark various implementations even in the presence of runtime target detection. Set MMNO_avx=1 or MMNO_sse=1 to disable avx or sse respectively.

We don't need this anymore.

Runtime target feature detection takes the place of this

This preserves its performance even with my local target-cpu=native settings (otherwise it would regress). This is while it is inlining the kernel into the main gemm loop.

Explain the element layout in the vectors properly and remove stray comments from the implementation.

This code makes no difference in this commit, but it saves the information that we can transpose the operation (and prefer column major C instead if we want to).

This is apparently the first version where the crate builds (due to std::alloc)

Comment out the benchmarks we seldom comparer. Also don't run the "reference implementation" benchmarks, those are seldom compared.

Allocating 32 byte-aligned buffers costs some overhead that is noticable for small matrix multiplication problems, while there is hard to detect any benchmark change due to using unaligned loads.

This uses the default Rust global allocator.

This should make it possible to run the α ab multiplication while we are waiting for C to load.

As the bug showed, it must be clear this parameter is the number of elements (not bytes).

There was a bug earlier in this same pull request: we were allocating a far too large packing buffer due to multiplying with element size twice.

This improves performance by a massive 10% on the 127×127 benchmark. We can assume the kernel is called with k > 0, so we split off the last iteration of the loop and make sure we don't load in the last iteration (as it would go out of bounds of the packing area). There was an alternative, we could extend the packing buffer by 8 dummy elements to allow us to load up to one more vector, but it did not make a difference in benchmarks.

Aligned loads are needed (at the moment) to not regress performance for -Ctarget-cpu=native compilation on x86-64 with avx.

This is a later rust feature.

Will become an error in the future, so fix now.

These tests are rudimentary so far, but they cover all the possible kernels (avx, sse2, fallback) we have so far.

bluss · 2018-11-15T18:54:28Z

Benchmarks vs OpenBLAS's single threaded mode, and the new sgemm (f32) kernel wins. Which is pretty shocking! 🔥 For the (autovectorized) f64 kernel we "compete". Using libopenblas-base in debian version 0.2.19

test blas_mat_mul_f32::m064 ... bench:      17,496 ns/iter (+/- 322)
test blas_mat_mul_f32::m127 ... bench:     126,223 ns/iter (+/- 4,609)
test blas_mat_mul_f64::m064 ... bench:      25,466 ns/iter (+/- 319)
test blas_mat_mul_f64::m127 ... bench:     188,728 ns/iter (+/- 24,764)
test mat_mul_f32::m064      ... bench:      14,021 ns/iter (+/- 49)
test mat_mul_f32::m127      ... bench:      98,657 ns/iter (+/- 2,477)
test mat_mul_f64::m064      ... bench:      27,382 ns/iter (+/- 2,574)
test mat_mul_f64::m127      ... bench:     204,120 ns/iter (+/- 4,243)

"m064" corresponds to an all row major 64×64×64 matrix product and "m127" to 127×127×127

There should be several caveats, and I'm sure an openBlas author would know the cause much better than me. For example, the choice of blocking strategy vs available cache sizes or the size of matrix multiplication problem you optimize for. I'm not testing on a beefy machine but a laptop.

Internal crate to not expose testing features or testing deps in the main crate that we publish.

Just to make sure the benchmarks are correct, use the default blas so it can have its preferred column major layout and everything. But the performance is the same (unsurprisingly).

bluss · 2018-11-15T21:29:33Z

💯

bluss force-pushed the std-simd-x86 branch 3 times, most recently from d8b05e5 to 0a8beaf Compare November 11, 2018 00:41

bluss changed the title ~~Use std::arch to simdify gemm kernels (x86 and sgemm first)~~ Use std::arch SIMD and runtime target feature detection Nov 12, 2018

FEAT: Full 4x4 sse gemm kernel (only supports masked)

b6b089a

This implements the sgemm kernel in sse (x86) intrinsics; more or less corresponding to the existing code but in a 4x4 version.

bluss force-pushed the std-simd-x86 branch from aca9123 to 066f723 Compare November 13, 2018 09:05

bluss mentioned this pull request Nov 13, 2018

Update to matrixmultiply 0.2 when available rust-ndarray/ndarray#541

Closed

bluss added 21 commits November 14, 2018 19:36

FIX: sgemm: Use whole vector load/stores of C if strides allow

1eb4ec9

Dispatch to specialized versions at runtime (avx vs sse on x86-* plat…

6907d2f

…forms)

FEAT: sgemm: add back generic kernel fallback and fix test

51f0e23

FIX: Request 16-byte alignment for simd sgemm and use aligned load

c8ffe7b

MAINT: Edit travis for multiarch builds

186d015

With thanks to @matematikaadit's example of a travis file

MAINT: Add test for loop_m/loop_n correctness

4c4f3a8

FIX: Remove uninitialized in sgemm fallback kernel

c50b77c

FEAT: Write sse kernel using arrays

c73c895

Using arrays instead of numbered variables lets us clean up the code (using the "static" loop macros to simplify). Benchmarks with the same performance.

FEAT: add avx 8x8 sgemm kernel

68b5b16

FEAT: avx sgemm: use "striped" vectors ported from BLIS

5fa96e8

Load the a vector "striped" just like how the avx kernel from the BLIS project does it. We de-stripe the vectors post the main loop with shuffles.

FEAT: sgemm: move shuffle and permutation masks to macros and constants

770da12

FEAT: for dgemm, enable avx code generation of the fallback impl

80ed600

FIX: Use inline on the sgemm fallback impl

8bd4d9e

FIX: Remove redundant parantheses in macro

d5a477f

FEAT: sgemm: Update store C to match what blis kernel does

bccd99f

Use permute and movss to store, and unroll the loop. Using a for loop can cause the compiler to store the array (c) on the stack.

FIX: sgemm make the main kernel function #[inline]

8da0c5e

FIX: sgemm un-pub internal functions

2904174

TEST: Implement aligned alloc so we can test with 32-byte align

3d9fded

TEST: Add a way to disable a specific target feature's detection

858ed46

This allows us to test/benchmark various implementations even in the presence of runtime target detection. Set MMNO_avx=1 or MMNO_sse=1 to disable avx or sse respectively.

bluss added 21 commits November 14, 2018 19:46

FIX: Use the is_x86_target_feature_detected shim in dgemm too

777fd75

FIX: Remove uninitialized in dgemm fallback kernel

c579ee9

We don't need this anymore.

FEAT: Remove now-unused build script

62b937f

Runtime target feature detection takes the place of this

FIX: Set inline(never) on the sgemm kernel

8a5a6b8

This preserves its performance even with my local target-cpu=native settings (otherwise it would regress). This is while it is inlining the kernel into the main gemm loop.

FEAT: Clean up and improve comments in sgemm avx kernel

dce0762

Explain the element layout in the vectors properly and remove stray comments from the implementation.

FIX: Add de-striping comment to sgemm

0aa7e8a

FEAT: Add code that shows we can transpose the sgemm avx kernel

ad51677

This code makes no difference in this commit, but it saves the information that we can transpose the operation (and prefer column major C instead if we want to).

MAINT: Build in travis from Rust 1.28.0

955c32b

This is apparently the first version where the crate builds (due to std::alloc)

TEST: Reduce number of benchmarks

10e578f

Comment out the benchmarks we seldom comparer. Also don't run the "reference implementation" benchmarks, those are seldom compared.

FEAT: sgemm use unaligned loads

a9e62b3

Allocating 32 byte-aligned buffers costs some overhead that is noticable for small matrix multiplication problems, while there is hard to detect any benchmark change due to using unaligned loads.

FIX: In aligned alloc, use main alloc/dealloc functions

f1ac534

This uses the default Rust global allocator.

FIX: in sgemm, move the α * (AB) multiplication down

157b166

This should make it possible to run the α ab multiplication while we are waiting for C to load.

FIX: Fixup debug!() statements, add one for packing buffer allocation

76d2806

FIX: In Alloc, rename parameter to nelem

bf57a21

As the bug showed, it must be clear this parameter is the number of elements (not bytes).

FIX: Pass number of elements (not bytes) when making a packing buffer

2202095

There was a bug earlier in this same pull request: we were allocating a far too large packing buffer due to multiplying with element size twice.

FIX: Remove unused imports in aligned_alloc.rs

7be5085

FIX: sgemm use aligned loads

69ddcfd

Aligned loads are needed (at the moment) to not regress performance for -Ctarget-cpu=native compilation on x86-64 with avx.

FIX: aligned_alloc: don't use implicit std

a537a03

This is a later rust feature.

FIX: Fix trait definition with missing parameter

fdc80e6

Will become an error in the future, so fix now.

TEST: Make sure all kernels are tested

b5cc042

These tests are rudimentary so far, but they cover all the possible kernels (avx, sse2, fallback) we have so far.

bluss force-pushed the std-simd-x86 branch from fecec86 to b5cc042 Compare November 14, 2018 18:46

bluss added 3 commits November 15, 2018 20:18

TEST: Add internal crate for benchmarks vs Openblas

4dad63f

Internal crate to not expose testing features or testing deps in the main crate that we publish.

MAINT: Update copyright headers

6364fac

TEST: Use "blas" crate for blas-bench

58b8683

Just to make sure the benchmarks are correct, use the default blas so it can have its preferred column major layout and everything. But the performance is the same (unsurprisingly).

bluss merged commit 437132f into master Nov 15, 2018

bluss deleted the std-simd-x86 branch November 15, 2018 21:29

bluss mentioned this pull request Nov 17, 2018

Use stable SIMD intrinsics with runtime detection bluss/twoway#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use std::arch SIMD and runtime target feature detection #22

Use std::arch SIMD and runtime target feature detection #22

bluss commented Nov 10, 2018 •

edited

Loading

bluss commented Nov 12, 2018 •

edited

Loading

bluss commented Nov 13, 2018 •

edited

Loading

bluss commented Nov 15, 2018 •

edited

Loading

bluss commented Nov 15, 2018

Use std::arch SIMD and runtime target feature detection #22

Use std::arch SIMD and runtime target feature detection #22

Conversation

bluss commented Nov 10, 2018 • edited Loading

bluss commented Nov 12, 2018 • edited Loading

bluss commented Nov 13, 2018 • edited Loading

bluss commented Nov 15, 2018 • edited Loading

bluss commented Nov 15, 2018

bluss commented Nov 10, 2018 •

edited

Loading

bluss commented Nov 12, 2018 •

edited

Loading

bluss commented Nov 13, 2018 •

edited

Loading

bluss commented Nov 15, 2018 •

edited

Loading