-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for using VAES instructions for NI parallel operations. #396
Conversation
@silvanshade I'd definitely recommend trying to get Structurally it'd look pretty much like what you have, but you'd have both the You'd need to add detection for VAES, and a branch to use it if available. If that's not something you're particularly interested in, we can work with this and @newpavlov or myself can complete it. Either way, thanks! |
Yeah, I had created the PR that added that, since I was originally going to try and use
I can create a branch like this. That's not the main difficulty, as I understand it. Rather, the I could change it to where there is a I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the What do you think? |
That sounds great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be better to write a separate implementation in the vaes
module instead of piggybacking on the ni
module. It's also probably worth to increase number of blocks processed in parallel for the VAES backend. Right now, you call only two aesdec
/aesenc
functions per round, thus potentially loosing on additional ILP-based throughput (the instructions have latency of 3 cycles and throughput of 1 cycle). Additionally, with AVX-512 you have 32 ZMM registers, so you have less register pressure.
So it is possible to write local versions of these intrinsics, e.g., #[inline]
#[target_feature(enable = "avx512f")]
pub(super) unsafe fn pf_mm512_aesdec_epi128(data: __m512i, round_key: __m512i) -> __m512i {
let result: __m512i;
asm!(
"vaesdec {result}, {data}, {round_key}",
data = in(zmm_reg) data,
round_key = in(zmm_reg) round_key,
result = out(zmm_reg) result,
options(pure, nomem, nostack, preserves_flags)
);
result
} But what I didn't realize is that it's still necessary to have the If that seems reasonable, I will add that feature ( |
Initially I was planning to do that but the reason I opted not to is because, for the single block case, we still basically want to fall back to the I did try increasing the parallel blocks to 32 (calling 8 of the respective instructions) but didn't notice a performance difference in the benchmarks here, although I only have the one system to test on. But I agree it probably makes sense in general, and especially for a separate backend. |
Yes, also key expansion code will be the same. But I think that the parallel processing function definitely should live in the So I think we should define separate backends (i.e. structs which implement the |
@silvanshade aah yes, that's unfortunate. I've ran into similar issues in the past and the only way I solved it was opening a stabilization PR for the relevant target features (which in the past I did manage to get one merged), although offhand I'm not sure what the blockers are. Not seeing much discussion here either: rust-lang/rust#44839 |
@tarcieri @newpavlov I’ve updated the PR and tried to address prior feedback. There’s also a companion PR for an action here This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt. We still have to do the broadcasting at least once, but now we can limit that to just the key schedule functions and avoid the additional overhead for parallel encrypt/decrypt. One thing I did not address is trying to merge the VAES backend into the autodetect framework. The reason for this basically is that: although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time. This is a problem specifically having to do with the key size. For instance, if we wanted to have a backend which dynamically selected between AESNI or VAES, we have to compromise on either using Both are problematic. Going from Given that, I thought it would be best to just keep the backends separate for the time being. In order to use the VAES backend, the Also, I increased the block size for VAES to 64. Going from 32 to 64 doesn’t seem to make any difference on my system, but then neither did going from 8 to 32. But potentially it could make a difference somewhere. I suspect the reason a difference isn’t noticeable though is because the compiler is probably doing a decent job of unrolling the loops already, at least for these tests. |
I don't think it's worth to store broadcasted keys as part of
I think the only way for working around this is instead of using polyfills to implement encrypt/decrypt functions as one |
Can you elaborate on this? I'm not entirely sure I understand what this would look like or why this would be beneficial. Is the idea that this would make it to where the non-broadcasted round keys are still available for |
The main reason is that it would quadruple the size of
Yes. Instead of this: struct $name_enc {
round_keys: [__m512i; $rounds],
}
struct $name_back_enc<'a>(&'a $name_enc); It would be better to write this: struct $name_enc {
round_keys: [__m128i; $rounds],
}
struct $name_back_enc<'a> {
// Owned copy of broadcasted round keys
k1: [__m512i; $rounds],
// References $name_enc
k2: &'a [__m128i; $rounds],
} During parallel block processing the broadcasted round keys are likely to stay in registers and may not be even spilled to stack (assuming you will use an appropriate value for |
Okay, I tried refactoring how you suggested. Initially, the results were a little surprising, because the single block case was suddenly far slower than before the refactoring. Before splitting the key representation:
After splitting the key representation:
The only thing that really changed was that I moved the call to This made me suspect maybe the compiler was able to optimize the previous case better. I tried switching to a slightly different representation where the parallel keys are lazily initialized, only if
|
aes/src/lib.rs
Outdated
any(target_arch = "x86", target_arch = "x86_64"), | ||
not(aes_force_soft) | ||
))] { | ||
mod x86_64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module is gated on target_arch = "x86"
in addition to x86_64
so perhaps the module name should just be x86
?
That said I'm not sure we've actually successfully tested using AES-NI via a 32-bit build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good question, I wasn't sure about that either.
The reason I opted for x86_64
though is because I was under the impression that the AES-NI extension was only actually available for x86_64
architecture CPUs.
But from a brief google search it seems that it should be possible to get AES-NI to work for 32-bit targets though (based on some old Intel sample code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should still work when targeting those CPUs with a 32-bit binary, even if they're natively 64-bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
I think I’d opt to keep it as x86_64
since that’s the ISA where it’s available (even if running 32-bit target binaries), and also because 32-bit is slowly disappearing pretty much everywhere anyway.
But I don’t feel particularly strongly about it.
Would you like me to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit weird to say something is x86_64
if it still works on a 32-bit target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check whether a 32 bit binary which uses VAES instructions works or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check whether a 32 bit binary which uses VAES instructions works or not?
I just tested and it does work.
I used the following .cargo/config.toml
:
[build]
target = "i686-unknown-linux-gnu"
[target.i686-unknown-linux-gnu]
linker = "clang-17"
rustflags = [
"-C", "link-arg=-fuse-ld=lld-17",
"-C", "link-arg=--target=i686-unknown-linux-gnu",
"-C", "target-feature=+aes,+sse3,+vaes",
]
This is with nightly-2024-07-02
on Ubuntu Mantic 23.10
using Debian multiarch configuration (with i386
architecture added).
I also renamed the module to x86
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should note that I can't seem to actually test the 32-bit VAES target on CI though since I can't quite figure out how to get intel SDE to run those 32-bit binaries on the 64-bit host. It looks like it should be possible for it to run those but everything I've tried has resulted in it refusing to run them or just crashing (before getting to the tests).
Compiler also can sometimes have difficulties with optimizing |
I've added This required refactoring some parts of the autodetection code to handle in a cleaner way. In order to handle To work around that I just feature gated the I didn't change the I think this addresses basically all of the feedback now? |
Two more small changes:
|
@tarcieri @newpavlov Do you intend to merge this? |
I'd generally be in favor but it's definitely a large PR. Sorry it's gone by the wayside. I will hopefully have time to review soon. Also curious to know what @newpavlov thinks. |
Thanks. I would like to resume working the RISC-V and ARMv9 PRs (especially the latter will be relevant soon since Apple Silicon M4 is ARMv9 with SVE2/SME) but prefer to see how this one lands first before putting a lot more effort into those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review!
"vmovdqu ymm14, [{iptr} + 13 * 32]", | ||
"vmovdqu ymm15, [{iptr} + 14 * 32]", | ||
// aes-128 round 0 encrypt | ||
"vmovdqu ymm0 , [{simd_256_keys} + 0 * 32]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would keep the round keys in registers (and adjust number of blocks processed in parallel accordingly). We probably can do it in this case since __m256i
is available on stable.
aes/src/x86/vaes256/aes128.rs
Outdated
pub(crate) unsafe fn parallelize_keys(keys: &RoundKeys<11>) -> Simd256RoundKeys<11> { | ||
let mut v256: [MaybeUninit<__m256i>; 11] = MaybeUninit::uninit().assume_init(); | ||
asm! { | ||
"vbroadcastf128 ymm0 , [{keys} + 0 * 16]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to use vbroadcastf128
instead of vmovdqu ymm0 , [{simd_256_keys} + i * 32]
in the encrypt/decrypt functions? If performance will be comparable, I would prefer to use the latter since it does not spill round keys on stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original version I was using broadcasts (in Rust IIRC, but essentially the same) in the encrypt
/decrypt
functions.
But in the discussion (starting back from here) I refactored the code to store the broadcasts since I understood you wanted to eliminate those as unnecessary operations.
Neither approach made a difference to performance in the included benchmarks.
Are you saying you would prefer to go back to the broadcast on demand approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to pre-broadcast round keys for vaes256 and do broabcasts before each round for vaes512.
With vaes256 you can keep borabcasted keys in backend states as [__mm256i; N]
and pass them into inline assembly blocks as values. Hopefully, the compiler will be able to eliminate stack spilling and the broadcasted round keys will stay strictly in registers (but we would need to inspect generated assembly to check it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the above proposal is too complex for your liking, then you can broadcast round keys before each round for vaes256 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to pre-broadcast round keys for vaes256 and do broabcasts before each round for vaes512.
With vaes256 you can keep borabcasted keys in backend states as
[__mm256i; N]
and pass them into inline assembly blocks as values. Hopefully, the compiler will be able to eliminate stack spilling and the broadcasted round keys will stay strictly in registers (but we would need to inspect generated assembly to check it).
I'm not sure if this makes sense because for the VAES 256 case, there aren't enough registers to hold all the keys while also processing the data (which is why I did the interleaving).
Note also that the keys are already stored like this in the backends:
type RoundKeys<const ROUNDS: usize> = [__m128i; ROUNDS];
#[cfg(target_arch = "x86_64")]
type Simd256RoundKeys<const ROUNDS: usize> = [__m256i; ROUNDS];
#[cfg(target_arch = "x86_64")]
type Simd512RoundKeys<const ROUNDS: usize> = [__m512i; ROUNDS];
They are also only populated on-demand for the first VAES parallel block processing call (which eliminates the overhead in case encrypt/decrypt is only used for a single block):
#[cfg(target_arch = "x86_64")]
impl<'a> BlockBackend for $name_backend::Vaes512<'a, self::$name_backend::mode::Encrypt> {
#[inline]
fn proc_block(&mut self, block: InOut<'_, '_, Block>) {
unsafe {
self::ni::$module::encrypt1(self.keys, block);
}
}
#[inline]
fn proc_par_blocks(&mut self, blocks: InOut<'_, '_, Block64>) {
unsafe {
let simd_512_keys = self.simd_512_keys.get_or_insert_with(|| {
self::vaes512::$module::parallelize_keys(&self.keys)
});
self::vaes512::$module::encrypt64(simd_512_keys, blocks);
}
}
}
I believe this latter point is relevant with regard to whether or not we should pre-broadcast vs. broadcast before each round: the way it is now does eliminate overhead from creating all this data (as measured here).
I can still check whether or not interleaving the loads for VAES 512 (and increasing block count) makes a difference. I suspect it won't on my system, since most of the modifications I've made to the algorithm seem to have no noticeable impact, but it could be better in theory for other systems perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid you misunderstood me regarding implementation of backends. I am now a bit of short on time, so I will try explain myself better in the following days.
I'm not sure if this makes sense because for the VAES 256 case, there aren't enough registers to hold all the keys while also processing the data (which is why I did the interleaving).
There is enough registers for AES-128, i.e. we have 16 registers and 11 rounds keys, thus we can use the remaining 5 to process 10 blocks in parallel.
For AES-192 (13 rounds) we could keep everything in registers, but at the cost of processing only 6 blocks in parallel. We also could load+broadcast round keys on each round and process 30 blocks in parallel. And, of course, we could have solutions in-between (e.g. process 16 blocks in parallel and keep 7 round keys in registers). Processing less blocks in parallel means less exploitation of ILP, but processing more blocks requires addition of extra loads. Round keys are likely to be in L1 cache, but it's still several cycles. But superscalar processors have their own bag of tricks to deal with such situations... To summarize: if there is not measurable performance difference we probably should probably prefer a "prettier" code.
Finally, for AES-256 (15 rounds) we have not choice but to load+broadcast round keys on each round.
I can still check whether or not interleaving the loads for VAES 512 (and increasing block count) makes a difference.
It's likely your bottleneck here is memory throughput, this is why such minor changes don't make any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid you misunderstood me regarding implementation of backends. I am now a bit of short on time, so I will try explain myself better in the following days.
I just refactored the backends to support separate block sizes, as you requested.
Now they look like this:
// Backend structures
mod $name_backend {
use super::*;
pub(crate) mod mode {
pub(crate) struct Encrypt;
pub(crate) struct Decrypt;
}
#[derive(Clone)]
pub(crate) struct Ni<'a, Mode> {
pub(crate) mode: core::marker::PhantomData<Mode>,
pub(crate) keys: &'a RoundKeys<$rounds>,
}
#[derive(Clone)]
#[cfg(target_arch = "x86_64")]
pub(crate) struct Vaes256<'a, Mode> {
pub(crate) mode: core::marker::PhantomData<Mode>,
pub(crate) keys: &'a RoundKeys<$rounds>,
pub(crate) simd_256_keys: Option<Simd256RoundKeys<$rounds>>,
}
#[cfg(target_arch = "x86_64")]
pub(crate) struct Vaes512<'a, Mode> {
pub(crate) mode: core::marker::PhantomData<Mode>,
pub(crate) keys: &'a RoundKeys<$rounds>,
pub(crate) simd_512_keys: Option<Simd512RoundKeys<$rounds>>,
}
}
// For dispatching on the correct backend
#[derive(Clone)]
enum Backend {
Ni,
#[cfg(target_arch = "x86_64")]
Vaes256,
#[cfg(target_arch = "x86_64")]
Vaes512,
}
// For detecting which backend to select
#[derive(Clone)]
struct Features {
#[cfg(target_arch = "x86_64")]
avx: self::features::avx::InitToken,
#[cfg(target_arch = "x86_64")]
avx512f: self::features::avx512f::InitToken,
#[cfg(target_arch = "x86_64")]
vaes: self::features::vaes::InitToken,
}
impl Features {
fn new() -> Self {
Self {
#[cfg(target_arch = "x86_64")]
avx: self::features::avx::init(),
#[cfg(target_arch = "x86_64")]
avx512f: self::features::avx512f::init(),
#[cfg(target_arch = "x86_64")]
vaes: self::features::vaes::init(),
}
}
fn backend(&self) -> Backend {
#[allow(unused_mut)]
let mut backend = Backend::Ni;
#[cfg(target_arch = "x86_64")]
if !cfg!(disable_avx512) && self.avx512f.get() && self.vaes.get() {
backend = self::Backend::Vaes512;
}
#[cfg(target_arch = "x86_64")]
if !cfg!(disable_avx256) && self.avx.get() && self.vaes.get() {
backend = self::Backend::Vaes256;
}
backend
}
}
#[doc=$doc]
#[doc = "block cipher (decrypt-only)"]
#[derive(Clone)]
pub struct $name_dec {
round_keys: RoundKeys<$rounds>,
features: Features,
}
impl BlockCipherDecrypt for $name_dec {
#[inline]
fn decrypt_with_backend(&self, f: impl BlockClosure<BlockSize = U16>) {
let mode = core::marker::PhantomData::<self::$name_backend::mode::Decrypt>;
let keys = &self.round_keys;
match self.features.backend() {
self::Backend::Ni => f.call(&mut $name_backend::Ni { mode, keys }),
#[cfg(target_arch = "x86_64")]
self::Backend::Vaes256 => f.call(&mut $name_backend::Vaes256 {
mode,
keys,
simd_256_keys: None,
}),
#[cfg(target_arch = "x86_64")]
self::Backend::Vaes512 => f.call(&mut $name_backend::Vaes512 {
mode,
keys,
simd_512_keys: None,
}),
}
}
}
#[cfg(target_arch = "x86_64")]
impl<'a> BlockBackend for $name_backend::Vaes512<'a, self::$name_backend::mode::Decrypt> {
#[inline]
fn proc_block(&mut self, block: InOut<'_, '_, Block>) {
unsafe {
self::ni::$module::decrypt1(self.keys, block);
}
}
#[inline]
fn proc_par_blocks(&mut self, blocks: InOut<'_, '_, Block64>) {
unsafe {
let simd_512_keys = self.simd_512_keys.get_or_insert_with(|| {
self::vaes512::$module::parallelize_keys(&self.keys)
});
self::vaes512::$module::decrypt64(simd_512_keys, blocks);
}
}
}
The reason I didn't do this originally was because I was trying to avoid an explosion of complexity in autodetect
, since it's not really organized in such a way that works well with fine-grained feature selection within specific architectures (without adding a bunch of noisy conditionals everywhere).
I realized this was going to be an issue once I started working on the RISC-V and ARMv9 backends which also have multiple features to detect.
What I did to work around this was to remove the get_{enc,dec}_backend
functions from the architecture modules, since exposing them to autodetect
made it impossible to dispatch on different backends within the architecture module (since the type distinctions would escape), at least unless the return type for those methods was changed into a trait object.
This allowed me to refactor the x86
module to handle finer-grained backend dispatching without polluting autodetect
further.
If there's further misunderstanding, you'll have to be more specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is enough registers for AES-128, i.e. we have 16 registers and 11 rounds keys, thus we can use the remaining 5 to process 10 blocks in parallel.
For AES-192 (13 rounds) we could keep everything in registers, but at the cost of processing only 6 blocks in parallel. We also could load+broadcast round keys on each round and process 30 blocks in parallel. And, of course, we could have solutions in-between (e.g. process 16 blocks in parallel and keep 7 round keys in registers). Processing less blocks in parallel means less exploitation of ILP, but processing more blocks requires addition of extra loads. Round keys are likely to be in L1 cache, but it's still several cycles. But superscalar processors have their own bag of tricks to deal with such situations... To summarize: if there is not measurable performance difference we probably should probably prefer a "prettier" code.
Okay I see what you mean.
I will experiment with trying to keep the keys in register.
In agreement with your last point, I'm hesitant about making the implementation more complex than it already is though, especially since I can't measure most of these differences (it would help perhaps to see the benchmarks run on intel since I am only testing on Zen4).
let (iptr, optr) = blocks.into_raw(); | ||
asm! { | ||
// load keys | ||
"vmovdqu32 zmm0 , [{keys} + 0 * 64]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we reload the keys on each call either way, it may be worth to increase number of blocks processed in parallel to 124 and interleave loading like you did for vaes256.
pub(crate) unsafe fn parallelize_keys(keys: &RoundKeys<15>) -> Simd512RoundKeys<15> { | ||
let mut v512: [MaybeUninit<__m512i>; 15] = MaybeUninit::uninit().assume_init(); | ||
asm! { | ||
"vbroadcasti32x4 zmm0 , [{keys} + 0 * 16]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
I've made several changes since the recent feedback:
At this point I would actually prefer not to focus much more on refactoring the algorithms (re: experimenting with block counts, broadcasting, etc). I've put quite a lot of time into this PR already and the performance gains are pretty reasonable I think. There's always room in the future for more fine-tuning. I'm still willing to address remaining design issues though. |
@silvanshade |
@tarcieri @newpavlov Any updates on this? |
@silvanshade why did you close this? It seemed pretty close to complete. |
I closed it because I still haven't gotten a thorough review and discussion about the implementation, even though I've repeatedly addressed all of the smaller feedback to the best of my ability. From my perspective, there is no real evidence that this PR is "close to complete". I thought it was basically complete months ago and asked for feedback then, and waited, and nothing happened. I realize that maintainers are often very busy with other things but I think that it should have been possible by now to get a more concrete idea about whether this is ever likely to be merged and if not, what are the blockers. The last substantive exchange with @newpavlov suggested I fundamentally misunderstood something about the implementation, and that was never clarified. So I just don't think it's a good use of time to continue. If you think otherwise, what would you suggest? |
@newpavlov's last comment, as of two weeks ago, was:
It sounds like he wanted to just do one final pass before merging. @silvanshade can you please reopen and we can get this merged? |
I think it would be more productive to re-open it if or when there's a final review. |
Sorry for the delay! I couldn't find enough time during the previous weekend, so I will try again on this one. Closing PR makes it less visible and increases chances of forgetting about it, so I will reopen. |
@newpavlov Thanks for the update. Unfortunately I've deleted the branch and no longer wish to contribute to this project. |
This PR adds support for using
VAES
intrinsics for theni
backend for theaes
8-fold operations.The change shows a nice speed up on Zen4 CPUs at least.
Benchmarks (Ryzen 7950x):
RUSTFLAGS="-C target-cpu=native" cargo bench
:RUSTFLAGS="-C target-cpu=native" cargo bench --features vaes
:I experimented with changing
ParBlocksSize
to32
and unfolding the loop more for theVAES
case to see if it made a difference, but at least on Zen4 it didn't seem to matter.One thing I noticed is that it is quite important that the
target-cpu
is set correctly, otherwise the performance can be bad:cargo bench --features vaes
:Regarding adding the
vaes
feature to theCargo.toml
, rather than usingcpufeatures
, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.This is partly due to the fact that some of the instructions are gated behind
stdsimd
.Also, as noted in another thread on the Rust forums, there isn't really a way to handle the negation of a case for
target_feature
, so it would be difficult to figure out how to override the selection of the usualni
8-fold operations with thevaes
versions.But if anyone has suggestions on how to structure this better I'd be happy to make those changes.