Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-pass encryption/decryption #74

Open
1 of 5 tasks
tarcieri opened this issue Jan 26, 2020 · 13 comments
Open
1 of 5 tasks

One-pass encryption/decryption #74

tarcieri opened this issue Jan 26, 2020 · 13 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@tarcieri
Copy link
Member

tarcieri commented Jan 26, 2020

Currently all of the AEAD implementations do two passes over the plaintext/ciphertext when encrypting/decrypting respectively: for encryption, they encrypt the plaintext in the first pass, and authenticate it in the second pass. For decryption, it's vice versa.

A better approach is to pick a number of blocks to operate on in parallel and encrypt/authenticate or authenticate/decrypt in a single pass. This has better cache locality, e.g. when we encrypt data, store the resulting ciphertext, then load it again to do authentication, that is pretty much guaranteed to hit L1 cache when doing it in a single pass (and ideally we could hand off values still stored in e.g. SIMD registers)

This is a tracking issue for converting the implementations of these respective algorithms to be one pass. It also might be good to discuss ways we could have a generic implementation of one pass encryption/decryption in the aead crate (especially one specialized for the non-SIV stream-cipher + universal-hash use case) which can be reused across different algorithm implementations.

  • aes-gcm
  • aes-gcm-siv
  • aes-siv
  • chacha20poly1305
  • xsalsa20poly1305

†NOTE: SIV modes by definition cannot support 1-pass encryption (because the first pass generates the synthetic IV, which must be known in advance before encryption can be performed). However, they can support 1-pass decryption, since the IV is known in advance in that case.

@tarcieri tarcieri added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Jan 26, 2020
@tarcieri
Copy link
Member Author

Related issue: generalized AEAD implementations based on stream ciphers RustCrypto/traits#45

@nico-abram
Copy link

As a complete crypto noob who got here from github explore (But not a complete rust noob), would this be feasible? (Or even just a small part, like trying to make decryption for chacha20poly1305 single pass) How robust are existing tests?

Would the change mostly be changing implementations like https://github.com/RustCrypto/AEADs/blob/master/chacha20poly1305/src/cipher.rs#L66-L91 into something more like https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/lib.rs#L317-L347 ?

@tarcieri
Copy link
Member Author

tarcieri commented Apr 17, 2020

Yes, but it also needs to be done in a way that actually improves performance. I've tried to do this change naively a few times (to aes-gcm, mainly, I might still have the code around) and it decreased performance.

I think doing it properly might require keeping the data flowing through XMM registers... at the very least it needs to all stay in L1 cache.

ChaChaPoly is even trickier because this issue is a micro-optimization and so far we don't have an AVX2 backend for Poly1305 (see RustCrypto/universal-hashes#49)

@nico-abram
Copy link

Thanks for the response!

I think I'll give it a shot. Do not let that stop anyone else from trying it since I don't have much hope I'll be able to do much.

How important is performance compiling with avx support vs without? (i.e, would you mostly care about speed when compiling with simd extensions or does the "default" cargo build configuration also matter a lot?)

@newpavlov
Copy link
Member

newpavlov commented Jul 2, 2020

The ccm crate could also use a single pass encryption/decryption.

@str4d
Copy link

str4d commented Oct 19, 2021

Let's take chacha20poly1305 in its current form and look at the AVX2 hot path (ignoring all the autodetect code in chacha20 and poly1305).

chacha20poly1305::cipher:

impl<C> Cipher<C> where C: StreamCipher + StreamCipherSeek,
{
    pub(crate) fn encrypt_in_place_detached(
        mut self,
        associated_data: &[u8],
        buffer: &mut [u8],
    ) -> Result<Tag, Error> {
        // ...

        // Not currently implemented, but imagine we did this:
        for chunk in buffer.chunks_mut(BLOCK_SIZE * 4) {
            self.cipher.apply_keystream(chunk);
            self.mac.update_padded(chunk);
        }

        // ...
    }
}

chacha20::backend::autodetect:

pub(crate) const BUFFER_SIZE: usize = BLOCK_SIZE * 4;

chacha20::chacha:

impl<R: Rounds, MC: MaxCounter> StreamCipher for ChaCha<R, MC> {
    fn try_apply_keystream(&mut self, mut data: &mut [u8]) -> Result<(), LoopError> {
        // ...

        let mut chunks = data.chunks_exact_mut(BUFFER_SIZE);
        for chunk in &mut chunks {
            let counter_with_offset = self.counter_offset.checked_add(counter).unwrap();
            self.block.apply_keystream(counter_with_offset, chunk);
            counter = counter.checked_add(COUNTER_INCR).unwrap();
        }

        // ...
    }
}

chacha20::backend::avx2:

const BLOCKS: usize = 4;

impl<R: Rounds> Core<R> {
    pub fn apply_keystream(&self, counter: u64, output: &mut [u8]) {
        debug_assert_eq!(output.len(), BUFFER_SIZE);

        unsafe {
            let state = State {
                a: self.v0,
                b: self.v1,
                c: self.v2,
                d: iv_setup(self.iv, counter),
            };
            let state = self.rounds(state);

            for i in 0..BLOCKS {
                for (chunk, a) in output[i * BLOCK_SIZE..(i + 1) * BLOCK_SIZE]
                    .chunks_mut(0x10)
                    .zip(
                        [state.a, state.b, state.c, state.d]
                            .iter()
                            .map(|s| s.blocks[i]),
                    )
                {
                    let b = _mm_loadu_si128(chunk.as_ptr() as *const __m128i);
                    let out = _mm_xor_si128(a, b);
                    _mm_storeu_si128(chunk.as_mut_ptr() as *mut __m128i, out);
                }
            }
        }
    }
}

universal_hash:

pub trait UniversalHash {
    fn update_padded(&mut self, data: &[u8]) {
        let mut chunks = data.chunks_exact(Self::BlockSize::to_usize());

        for chunk in &mut chunks {
            self.update(GenericArray::from_slice(chunk));
        }

        // ...
    }
}

poly1305::backend::avx2:

impl State {
    pub(crate) unsafe fn compute_block(&mut self, block: &Block, partial: bool) {
        // ...

        self.cached_blocks[self.num_cached_blocks].copy_from_slice(block);
        if self.num_cached_blocks < 3 {
            self.num_cached_blocks += 1;
            return;
        } else {
            self.num_cached_blocks = 0;
        }

        let p = Aligned4x130::from_blocks(&self.cached_blocks);
        // ...
}

poly1305::backend::avx2::helpers:

impl Aligned4x130 {
    pub(super) unsafe fn from_blocks(src: &[Block; 4]) -> Self {
        // 26-bit mask on each 32-bit word.
        let mask_26 = _mm256_set1_epi32(0x3ffffff);
        // Sets bit 24 of each 32-bit word.
        let set_hibit = _mm256_set1_epi32(1 << 24);

        // - Load the four blocks into the following 32-bit word layout:
        //      [b33, b32, b31, b30, b23, b22, b21, b20]
        //      [b13, b12, b11, b10, b03, b02, b01, b00]
        //
        // - Unpack the upper and lower 64 bits:
        //      [b33, b32, b13, b12, b23, b22, b03, b02]
        //      [b31, b30, b11, b10, b21, b20, b01, b00]
        //
        // - Swap the middle two 64-bit words:
        // a0 = [b33, b32, b23, b22, b13, b12, b03, b02]
        // a1 = [b31, b30, b21, b20, b11, b10, b01, b00]
        let (lo, hi) = src.split_at(2);
        let blocks_23 = _mm256_loadu_si256(hi.as_ptr() as *const _);
        let blocks_01 = _mm256_loadu_si256(lo.as_ptr() as *const _);
        // ...
    }
}

So, the hot path above:

  • Splits the plaintext into 4-block byte chunks.
  • Passes the 4-block byte chunk to chacha20, which:
    • Chunks it into 4-block chunks (no-op).
    • Splits the chunk into individual blocks and:
      • Calls _mm_loadu_si128 on the block to load it into a __m128i.
      • XORs the stream into the __m128i.
      • Stores the __m128i back into the block.
  • Passes the (now-encrypted) 4-block byte chunk to poly1305, which:
    • Splits the chunk into individual blocks.
    • Copies each block into a 4-block cache (reconstructing the chunk).
    • Calls _mm256_loadu_si256 on each half of the 4-block cache.
    • Draws the rest of the polynomiOwl.

So the immediate blocker is that the UniversalHash trait doesn't provide any API to process multiple blocks at a time (StreamCipher::try_apply_keystream allows the implementor to choose the chunking, whereas UniversalHash::update is typed on a single block).

Once that is addressed, poly1305 could directly consume 4-block chunks without using its cache, at which point we would be consistently passing around a 4-block chunk size. Then the question becomes the form in which we pass the chunk around. Sketching out two possible directions:

  • Add an associated type for the chunk to an aead trait, which is constrained to equal an equivalent associated type in cipher and universal-hash. Then chacha20 and poly1305 would separately set it to the same concrete type.
  • Add an an aead::AeadChunk trait, and implement it in chacha20poly1305. Have some way to map it to the chunk inputs of cipher and universal-hash.

@tarcieri
Copy link
Member Author

tarcieri commented Oct 19, 2021

Yeah, it's definitely a drawback that the UniversalHash trait doesn't provide a multi-block API. It's also problematic that data in AEADs is round-tripping through calls like _mm_storeu_si128/_mm_loadu_si128, especially along crate boundaries where it's pretty much guaranteed not to get optimized away even in cases where it's aligned.

These sorts of optimization problems for passing data between stream ciphers and universal hash functions were the impetus for the simd-buffers work I was attempting here:

RustCrypto/utils#221

I abandoned that, but now I wonder if maybe crypto_bigint::UInt might make a reasonable replacement buffer type for these use cases, especially if we ensured they were properly aligned such that they could be safely converted to similarly-structured SIMD types. Then a multi-block API could operate over slices of those structured SIMD buffers with guaranteed alignment.

@newpavlov
Copy link
Member

newpavlov commented Oct 20, 2021

The fundamental issue here is runtime detection. It not only means that optimal number of blocks processed in parallel can change depending on CPU capabilities (and in some cases even on CPU family!), but also that during combination of primitives we need a way to automatically generate a matrix of possible capability combinations. It means that if algorithm 1 is able to process 3 blocks by default and 8 blocks with feature A and algorithm 2 is able to process 2 blocks by default and 6 with feature B, then ideally when combining them we should generate 3 code paths: by default processing 6 blocks, for feature A processing 8 blocks, for feature B processing 6 blocks, and for feature A and B processing 24 blocks. And if algorithms have different block sizes, problems becomes even harder.

Rust does not have good tools for solving this problem and likely will not have them anytime soon. At the very least we would need some kind of function multi-versioning (i.e. an ability to define different function implementations for different target features) with an ability to query available versions at compile time. And ideally we would need trait multi-versioning as well since it's preferable to store chunk size as an associated constant, but allowing public API (via associated constants and types) to change depending on available target features is a sizable can of worms with potentially non-trivial implications.

Defining those combinations manually could work to some extent, but it will be hard to maintain and I don't think compiler will be able to optimize out our cpufeatures-based code even if method is used in a context with enabled target features.

Round-tripping _mm_storeu_si128/_mm_loadu_si128 should not be a big issue since for compiler it's a trivial optimization assuming code gets properly inlined. The issue here is again runtime feature detection, since branching inside MAC/universal hash block processing method acts as an optimization barrier. Without proper inlining and removing the optimization barrier I highly doubt crypto-bigint will have any measurable effect. By caching blocks into stack you would be able to use aligned loads, but the main improvement is to keep data in registers without spilling it anywhere and you would not achieve it using this approach.

I hope to alleviate some issues in the new trait versions. It introduces slice-based block-level traits for hashes/MACs/universal hashes, hides chunk size from public API and instead uses callback-based methods. Not only should it help with inlinining, but also effectively inverses control over iteration. In other words, iteration over blocks is controlled not by higher-level code which combines primitives, but at the cipher level. It means that we can branch once per loop, instead of doing it every chunk (compiler currently is unable to optimize it automatically). Also it means that callbacks (which are used for passing blocks to MAC) are executed in the context with enabled target features and known chunk size.

Unfortunately this approach is still far from ideal. Roughly it results in the following code:

if is_aesni_available() {
    for chunk in blocks.chunks_exact_mut(AESNI_CHUNK) {
        aesni_encrypt(chunk);
        if is_pclmul_available() {
            pclumul_mac(chunk)
        } else {
            default_mac(chunk)
        }
    }
} else {
    for chunk in blocks.chunks_exact_mut(DEFAULT_CHUNK) {
        default_encrypt(chunk);
        if is_pclmul_available() {
            pclumul_mac(chunk)
        } else {
            default_mac(chunk)
        }
    }
}

In other words, if cipher backend does not cover required features for MAC backend we still have the optimization barrier on our hands.

@tarcieri
Copy link
Member Author

tarcieri commented Oct 20, 2021

@newpavlov have you actually tested that the optimizations you expect actually work out in practice, especially considering things like traits defined in two crates, being consumed by a third, where the first crate is using _mm_storeu_si128 and the second is using _mm_loadu_si128? I haven't myself, but I'm skeptical about the degree of inlining which can occur in that sort of 3-crate scenario (which really ends up being more like 6 when you add in the trait crates), especially with a crate as Rust's unit of compilation.

To set a baseline for maximum performance, I think we could move things like the CPU feature tests into the AEAD crates like aes-gcm/chacha20poly1305, and expose a couple/few sets of x86-64 and ARM-specific APIs in crates like aes, chacha20, ghash/polyval, and poly1305 which operate in terms of SIMD registers. We don't even have to ship that, we just need to see what the performance difference is.

Once we're reasonably certain of what a performant implementation looks like, we can experiment with various abstractions, although I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me.

I feel like there are slice-based abstractions missing from universal-hash, similar to the ones I added to address RustCrypto/traits#332, which would also address the problem. I'm not sure we really need any sort of automatic constraint solver to pass around appropriately-sized chunks. We're already writing code which is explicit about the various platforms we support and backends which are detected at runtime, so we can program in an appropriate size for each of those scenarios since they're already factored into relevant modules explicitly. And really, in practice that size is dictated by the stream cipher, at least in the cases we currently care about.

Glossing over a few things, in practice I think the optimal block sizes look like the following:

AES-GCM / AES-GCM-SIV

  • x86/x86_64 w\ AES-NI + CLMUL: 8 x 128-bit blocks
  • ARMv8 w\ crypto extensions: 8 x 128-bit blocks
  • 64-bit portable: 4 x 128-bit blocks
  • 32-bit portable: 2 x 128-bit blocks

ChaCha20Poly1305

  • x86/x86_64 w\ AVX2: 4 x 128-bit blocks
  • x86/x86_64 w\ SSE: 2 x 128-bit blocks
  • ARM w\ NEON: 2 x 128-bit blocks(?)
  • Portable: 1 x 128-bit block

@newpavlov
Copy link
Member

@tarcieri

have you actually tested that the optimizations you expect actually work out in practice

No, I only played a bit with small snippets in godbolt. We may need to abuse #[inline(always)] to achieve this optimization in practice.

I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me.

I am myself far from 100% happy with the result, but right now I don't see a better path forward and, compared to the current design, I think it's definitely an improvement. Could you please comment in the PR on elements which you don't like or do not fully understand? I would appreciate your feedback sooner than later, since I hope to finalize it in the near future.

in practice that size is dictated by the stream cipher, at least in the cases we currently care about.

I agree, this is why the callbacks in my PR are only done on the cipher side, while MACs and universal hashes are left with the slice-based methods. But we are still left with the problem of target feature branching inside chunk iteration. Even if we are to check redundant features such as CLMUL in aes, I don't think that compiler will be able to remove branches in our cpufeatures-based code.

Also do not forget that code with enabled target features can not be currently inlined at all, so we definitely should strive to have chunk processing inside context with same target features.

@tarcieri
Copy link
Member Author

tarcieri commented Oct 20, 2021

Could you please comment in the PR on elements which you don't like or do not fully understand?

Left a comment on the PR. Just generally I'm confused what is happening there.

But we are still left with the problem of target feature branching inside chunk iteration.

That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs.

Then the check can be performed at the level of the entire AEAD, once, at the time the AEAD is initialized, and branched upon at the granularity of large AEAD operations.

The fast path for the entire core can occur within #[target_feature(...)] annotated code which is amenable to inlining, with data flowing in the form of SIMD register types which don't need to rely on inlining for performance, since they're the desired type to begin with and there's no type conversions that need to be optimized away.

@newpavlov
Copy link
Member

That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs.

Such API would have to be unsafe and represented as free-standing methods. It could work, but I think such solution is quite ad hoc and will be hard to extend, i.e. for each somewhat relevant combination we would have to manually write loops for all combinations. Adding a new backend would mean that we will need to update all combinations using this primitive manually.

I guess it could be a practical stop-gap solution and baseline for comparing generic solutions.

@newpavlov
Copy link
Member

After numerous experiments, I think I've found a good solution to this problem could look like, but, unfortunately, it's blocked on lack of rank-2 polymorphism in Rust. I wrote about it here: https://internals.rust-lang.org/t/15875 So I think the callback-based solution explored in the cipher v0.4 PRs is the best option which we have right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants
@tarcieri @newpavlov @str4d @nico-abram and others