-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use std::arch SIMD and runtime target feature detection #22
Conversation
d8b05e5
to
0a8beaf
Compare
Benchmarks. Comparing target-feature=+avx compiled from current master with "generic" build on this PR, and we still have improvements! This is tested on my AVX enabled ivy bridge laptop. We see some overhead in the small sizes for sgemm - because sgemm is now requesting a 32 byte-aligned packing buffer from the system allocator, apparenly this goes through posix_memalign and is a bit slow.
The most exciting part is having these gains in a "generic" build where target feature detection takes care of it. The improvement compared with master's generic build is quite big:
|
This implements the sgemm kernel in sse (x86) intrinsics; more or less corresponding to the existing code but in a 4x4 version.
Improved benchmarks for sgemm. Fixed inflated allocation size bug and brought down the overhead on small matrices. Incremental improvents from the previous benchmark to this state of the pr:
Benchmarks compared with master:
|
Switch to unmasked (no default redirect to buffer for the output) because we don't get inlining between the masked_kernel function and the actual kernel, when we use target_feature marked functions. Use _mm_shuffle_ps instead of permute: Shuffle produces clean code for generic x86-64 (permute produces a function call).
With thanks to @matematikaadit's example of a travis file
Using arrays instead of numbered variables lets us clean up the code (using the "static" loop macros to simplify). Benchmarks with the same performance.
Load the a vector "striped" just like how the avx kernel from the BLIS project does it. We de-stripe the vectors post the main loop with shuffles.
This simplifies the packing code by getting a correctly aligned allocation directly. But since only the System allocator is stable, it also means we switch allocator if System isn't the Rust default. I was thinking that explains the small performance improvement in the benchmark due to this allocator change.
Use permute and movss to store, and unroll the loop. Using a for loop can cause the compiler to store the array (c) on the stack.
This allows us to test/benchmark various implementations even in the presence of runtime target detection. Set MMNO_avx=1 or MMNO_sse=1 to disable avx or sse respectively.
We don't need this anymore.
Runtime target feature detection takes the place of this
This preserves its performance even with my local target-cpu=native settings (otherwise it would regress). This is while it is inlining the kernel into the main gemm loop.
Explain the element layout in the vectors properly and remove stray comments from the implementation.
This code makes no difference in this commit, but it saves the information that we can transpose the operation (and prefer column major C instead if we want to).
This is apparently the first version where the crate builds (due to std::alloc)
Comment out the benchmarks we seldom comparer. Also don't run the "reference implementation" benchmarks, those are seldom compared.
Allocating 32 byte-aligned buffers costs some overhead that is noticable for small matrix multiplication problems, while there is hard to detect any benchmark change due to using unaligned loads.
This uses the default Rust global allocator.
This should make it possible to run the α ab multiplication while we are waiting for C to load.
As the bug showed, it must be clear this parameter is the number of elements (not bytes).
There was a bug earlier in this same pull request: we were allocating a far too large packing buffer due to multiplying with element size twice.
This improves performance by a massive 10% on the 127×127 benchmark. We can assume the kernel is called with k > 0, so we split off the last iteration of the loop and make sure we don't load in the last iteration (as it would go out of bounds of the packing area). There was an alternative, we could extend the packing buffer by 8 dummy elements to allow us to load up to one more vector, but it did not make a difference in benchmarks.
Aligned loads are needed (at the moment) to not regress performance for -Ctarget-cpu=native compilation on x86-64 with avx.
This is a later rust feature.
Will become an error in the future, so fix now.
These tests are rudimentary so far, but they cover all the possible kernels (avx, sse2, fallback) we have so far.
Benchmarks vs OpenBLAS's single threaded mode, and the new sgemm (f32) kernel wins. Which is pretty shocking! 🔥 For the (autovectorized) f64 kernel we "compete". Using libopenblas-base in debian version 0.2.19
"m064" corresponds to an all row major 64×64×64 matrix product and "m127" to 127×127×127 There should be several caveats, and I'm sure an openBlas author would know the cause much better than me. For example, the choice of blocking strategy vs available cache sizes or the size of matrix multiplication problem you optimize for. I'm not testing on a beefy machine but a laptop. |
Internal crate to not expose testing features or testing deps in the main crate that we publish.
Just to make sure the benchmarks are correct, use the default blas so it can have its preferred column major layout and everything. But the performance is the same (unsurprisingly).
💯 |
Improve performance out of the box on x86-64 by using
target_feature
. 🔢 🔥Implement a fast 8x8 sgemm kernel for x86-64 using
std::arch
intrinsics. We maintain portability with other platforms.Multiple target_feature invocations of the same implementation allow us to compile multiple versions of the code and jump into the best one at runtime. The same simd intrinsics or Rust code would for example compile to better code inside a
target_feature(enabled="avx")
function than an sse function.Port a BLIS sgemm 8x8 x86 avx kernel to Rust instrinsics. Use the same memory layout in the main loop.
Use std::alloc for aligned allocation
Runtime target feature detection and target feature specific compilation allows us to offer native performance out of the box on x86-64 (no compiler flags needed)
Test compile to multiple targets (x86-64, x86, aarch64) and run the x86 ones in travis
in the f32 avx kernel, schedule loads one iteration ahead, this improves throughput
Fixes #14