Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vendor rust-crypto aes/soft implementation #2024

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rohhamh
Copy link

@rohhamh rohhamh commented Apr 10, 2024

I've tried following the instructions at #1886 (comment)

I ran the tests for fixslice64 and they all seem to pass.
However I'm very reluctant about some of the changes made here,
like increasing rd_key's size to 240 and having u32-u64 conversion functions
or using self.inner.rounds to differentiate between the target functions
or sending less than four actual blocks for encryption to the vendored code.


This should fail for fixslice32, I would like it if I could have your opinion on how you think it is best to handle fixslice32 as it takes u32 parameters.

The code in ctr32_encrypt_within is basically my attempt at replicating what aes_nohw_ctr32_encrypt_blocks in aes_nohw.c does, I imagine it's accurate as the tests did pass.

There's also the contribution guide.
I'm not sure if the statement should be added to every commit message or one would suffice,
or how the license would apply to the vendored code.

And finally, this is basically the first time I've done anything in rust or cryptography, so I imagine there would be many ways to improve the code. I'd be very happy to know about them.

rohhamh added 4 commits March 31, 2024 17:32
RustCrypto/block-ciphers commit: 6b263c07436a9861c8456bc73f81ff075673f7a5
allow deadcode for decrypt and 192bit functions
@rohhamh rohhamh changed the title vendor rust-crypto aes/soft implementation https://github.com/briansmith/ring/issues/1886 vendor rust-crypto aes/soft implementation Apr 10, 2024
Copy link
Owner

@briansmith briansmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for doing this. I am sorry to admit that I overlooked that the Rust Crypto key representation was so much larger when I originally suggested this plan. I think it's important for us to find a way to minimize (ideally eliminate) the size overhead, at least for targets that are likely to have 128-bit SIMD and/or hardware instructions available.

@@ -60,7 +60,7 @@
// aes_key_st should be an opaque type, but EVP requires that the size be
// known.
struct aes_key_st {
uint32_t rd_key[4 * (AES_MAXNR + 1)];
uint32_t rd_key[240];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, most fundamentally, we cannot change the AES_KEY structure because the assembly language code needs exactly this layout. So, we'd have to use an enum/union of AES_KEY and another key type, if we need to have more round keys for the new fallback implementation.

4 * (AES_MAXNR + 1) is 60, so this is making the keys 4x the size they were previously. We should spend effort to reduce this. When I originally suggested we use the rust-crypto implementation as the fallback, I did not realize that it had this huge space overhead.

IIRC, in their blog post, the Rust Crypto team said that this implementation is faster than bitslicing when SIMD isn't available, but slower than bitslicing when (128-bit) SIMD is available. Well, in practice almost all targets should be able to use a bitsliced SIMD implementation, and only fall back to this when SIMD isn't available. That means that any significant space overhead for this fallback implementation is very hard to accept.

What's your opinion on this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the delayed response.
in response to your comment I must say that I would personally enjoy it very much if I can help with ring and would be happy to try to fix the issues you find in the PR.

I imagine you mean something like the following by having an enum/union of AES_KEY.
To my understanding, something like this would still cause an overhead as one of the variants would still require allocating a [u64; 120]

// Keep this in sync with `AES_MAXNR` in aes.h.
const MAX_ROUNDS: usize = 14;

// Keep this in sync with AES_KEY in aes.h.
#[repr(C)]
#[derive(Clone)]
pub(super) struct AES_KEY_HW {
    pub rd_key: [u32; 4 * (MAX_ROUNDS + 1)],
    pub rounds: c::uint,
}

#[derive(Clone)]
pub(super) enum AES_NOHW_ROUND_KEY_VARIANT {
    v128([u64; 88]),
    v256([u64; 120]),
}

#[repr(C)]
#[derive(Clone)]
pub(super) struct AES_KEY_NOHW {
    pub rd_key: AES_NOHW_ROUND_KEY_VARIANT,
    pub rounds: u8,
}

#[derive(Clone)]
pub(super) enum AES_KEY {
    HW(AES_KEY_HW),
    NOHW(AES_KEY_NOHW),
}

I'm not sure what other solutions are available,
Do you think using Vec or Box would be valid options for us?

And about this being slower than bitslicing with SIMD, are suggesting to keep the vendored code only as fallback and for when SIMD instructions aren't available?

@@ -267,7 +313,76 @@ impl Key {
}

Implementation::NOHW => {
ctr32_encrypt_blocks!(aes_nohw_ctr32_encrypt_blocks, in_out, src, &self.inner, ctr)
let in_out_len = in_out[src.clone()].len();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logic is this large, then it should be split out into a separate standalone function that is called from here, much like bsaes_ctr32_encrypt_blocks_with_vpaes_key just was.

}

xor_columns(&mut rkeys, rk_off, 8, ror_distance(1, 3));
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like maybe we could split this into two functions so that the above logic initializes an AES_KEY structure as it exists today, which would be stored in the aes::Key. Then whenever we do an encryption/decryption operation, we would first execute the below logic to expand the round keys into a temporary buffer that contains the fixedslice representation. Of course, this would add some overhead to each encryption batch. Then we'd need to measure to see if it would still be faster than the non-SIMD bitsliced implementation.

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it so we won't change the AES_KEY structure of today's and/or have smaller changes to the current codebase?

I'm afraid I'm not well-versed enough in the subject to have a definitive answer, normally, I'd say that if we can avoid the overhead and have the expansion happen only once then we could do just that.
But I imagine it wouldn't matter much if the difference in performance is negligible when compared to what rust-crypto provides at best (and is still faster than what's currently available).

Do you think this could be the best approach?


/// Fully bitsliced AES-192 key schedule to match the fully-fixsliced representation.
#[allow(dead_code)]
pub(crate) fn aes192_key_schedule(key: &[u8; 24]) -> FixsliceKeys192 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a commit to just delete the aes192-specific code.

inv_shift_rows_1(&mut rkeys[i..(i + 8)]);
}
}
#[cfg(not(aes_compact))]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And/or we should consider deleting all the not(aes_compact) code paths and reduce the key size accordingly.

inv_shift_rows_2(&mut state);
}

let mut rk_off = 72;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that many of the round keys in rkeys are only needed for decryption and not encryption? If so, we could reduce the size of rkeys by half by just supporting only encryption?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to check more accurately, but I think that's not the case as the following is part of the code of encryption

    let mut rk_off = 8;
    loop {
        sub_bytes(&mut state);
        mix_columns_1(&mut state);
        add_round_key(&mut state, &rkeys[rk_off..(rk_off + 8)]);
        rk_off += 8;

        #[cfg(aes_compact)]
        {
            shift_rows_2(&mut state);
        }

        if rk_off == 80 {
            break;
        }

@rohhamh
Copy link
Author

rohhamh commented Jul 8, 2024

Is the PR not relevant anymore? @briansmith

@briansmith
Copy link
Owner

Is the PR not relevant anymore? @briansmith

PR #2070 is somewhat of an alternative to this PR that is more conservative because it doesn't increase the size of the key. I think especially in the case where it is more common to have a hardware implementation, avoiding increasing the key size is important.

In the case where there are vector instructions available, my understanding is that the 128-bit SIMD version of the code in PR #2070 (to be done later) will be faster. So I think that it makes sense to merge PR #2070 and also to port the 128-bit SIMD version from BoringSSL to Rust.

I think this PR is still potentially useful for platforms where we're always going to use the fallback implementation and there are no vector instructions available.

And/or, like I mentioned before, if we can figure out a way to keep the key size the same as it is now, and "expand" the key schedule on-demand during each encryption operation, then this would be useful in the case where vector instructions cannot be used, even as the fallback implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants