Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue introduced by vendor code #288

Merged
merged 3 commits into from
Jun 29, 2023
Merged

Conversation

Taowyoo
Copy link
Collaborator

@Taowyoo Taowyoo commented Jun 28, 2023

Performance issue about unaligned memory access

In mbedtls 3.4.0, upstream refactor the code of accessing unaligned address from using bit calculation to use pointer + memcpy + clang's bitswap.

This is fine in no-sgx environment, but become very slow in fortanix SGX environment.
I think the most possible reason here is call of memcpy.

This PR mainly replace all these new functions with old implementation in https://github.com/Mbed-TLS/mbedtls/blob/v3.3.0/library/common.h

Performance tests result:
Command used to run bench:
cargo +stable bench --no-default-features --features dsa,force_aesni_support,mpi_force_c_code,rdrand,std,time --target=x86_64-fortanix-unknown-sgx

Branch for testing: `yx/mbedtls-9_bench` at e51d4d17591817922928b3f23b6a85de4c3b2501 : 
Benchmarking pbkdf2_hmac
Benchmarking pbkdf2_hmac: Warming up for 3.0000 s
Benchmarking pbkdf2_hmac: Collecting 100 samples in estimated 7.6594 s (200 iterations)
Benchmarking pbkdf2_hmac: Analyzing
pbkdf2_hmac             time:   [38.324 ms 38.332 ms 38.339 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Branch for testing: `yx/mbedtls-11_bench` at 489cad5a6d8dc1042329de5aaf9d0a801a7ba802 : 
Benchmarking pbkdf2_hmac
Benchmarking pbkdf2_hmac: Warming up for 3.0000 s
Benchmarking pbkdf2_hmac: Collecting 100 samples in estimated 14.664 s (100 iterations)
Benchmarking pbkdf2_hmac: Analyzing
pbkdf2_hmac             time:   [146.82 ms 146.86 ms 146.91 ms]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

After this PR, the performance number back to be as same as yx/mbedtls-9_bench one

Performance issue about explicit_bzero (debug mode only)

When providing the explicit_bzero function through rust, there is a big performance downgrade in SGX

For example, when running the pbkdf :
Command: cargo +stable nextest run --test pbkdf --no-default-features --features dsa,force_aesni_support,mpi_force_c_code,rdrand,std,time --target=x86_64-fortanix-unknown-sgx

# debug
PASS [   7.689s] mbedtls::pbkdf test_pbkdf2
# release
PASS [   0.233s] mbedtls::pbkdf test_pbkdf2

After ensure C mbedtls side to not call explicit_bzero, the time reduce to

# debug
PASS [   1.785s] mbedtls::pbkdf test_pbkdf2
# release
PASS [   0.234s] mbedtls::pbkdf test_pbkdf2

This performance issue should be not related to usage of Zeroize crate, because above tests results are reproduced by following minimal implementation of explicit_bzero:

#[no_mangle]
pub unsafe extern "C" fn explicit_bzero(buf: *mut std::ffi::c_void, len: alloc::size_t) {
    // TODO: use `volatile_set_memory` when stabilized
    for i in 0..len {
        let ptr = (buf as *mut u8).add(i);
        core::ptr::write_volatile(ptr, 0u8);
    }
}

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 28, 2023

Maybe I need to export a C define from mbedtls-sys, such as __SGX__. And use this to conditional compile the change above

@Taowyoo Taowyoo force-pushed the yx/perf-fix branch 2 times, most recently from 2977bc8 to 09720b4 Compare June 28, 2023 19:23
Copy link

@xinyufort xinyufort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I don't know if changing the underlying implementation has any nontrivial repercussions...

mbedtls-sys/build/build.rs Outdated Show resolved Hide resolved
@arai-fortanix
Copy link

I haven't looked at this in full detail yet, but one thing I'm not thrilled about is directly conditioning the code on __SGX__ being defined. The existing code seems to be pretty good at defining macros for features/properties like MBEDTLS_EFFICIENT_UNALIGNED_ACCESS and then conditionalizing the code on those macros. The feature macros get set based on things like the defines for the CPU architecture, so there's a level of indirection there.

This is a nice thing to do because at some point in the future there might be some other non-SGX platform that wants to use these alternatives for memory access. And it would be weird if they had to define SGX to get the behavior they want, since since that platform wouldn't have anything to do with SGX.

So I think you want to conditionalize the code on a macro named something like MEMCPY_IS_SLOW or something like that, and we can set the MEMCPY_IS_SLOW macro if __SGX__ is set. Then later someone else can compile with MEMCPY_IS_SLOW or make their own change to the header to automatically turn on MEMCPY_IS_SLOW for their platform as appropriate.

Minor C language thing: identifiers starting with double underscores are reserved for use by the platform. So things like the compiler or the system headers can define idenfiers that start with double underscores. Since this is something that we're defining, it shouldn't use double underscores. The standard library also uses some identifiers with single underscores, so those should be avoided as well.

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 28, 2023

HI @arai-fortanix ,
Yes, it seems the upstream code added the MBEDTLS_EFFICIENT_UNALIGNED_ACCESS and apply such check on mbedtls_xor but not other macros in alignments.h,

One possible solution here is to also add similar checks on MBEDTLS_EFFICIENT_UNALIGNED_ACCESS for family of MBEDTLS_PUT_UINT32_BE macros. And this changes are reasonable and could be up-streamed.

About conditional compiling, I have updated code to always keep these mbedtls_XX_unaligned_XX functions

@jethrogb
Copy link
Member

This PR doesn't make sense to me. The code is clearly intended to get optimized so that there are no actual calls to memcpy. If our generated code still has calls to memcpy, we need to fix how the code is compiled.

Having compiled both Linux and SGX in release mode:

$ objdump -j .text.mbedtls_internal_sha256_process -r target/x86_64-fortanix-unknown-sgx/release/build/mbedtls-sys-auto-f19242cb09f8b77a/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.obj

target/x86_64-fortanix-unknown-sgx/release/build/mbedtls-sys-auto-f19242cb09f8b77a/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.obj:     file format elf64-x86-64

RELOCATION RECORDS FOR [.text.mbedtls_internal_sha256_process]:
OFFSET           TYPE              VALUE 
0000000000000080 R_X86_64_PLT32    memcpy-0x0000000000000004
000000000000009a R_X86_64_PC32     .rodata.K-0x0000000000000004
0000000000000b1e R_X86_64_PLT32    mbedtls_platform_zeroize-0x0000000000000004
$ objdump -j .text.mbedtls_internal_sha256_process -r target/release/build/mbedtls-sys-auto-030256f45744a275/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.o

target/release/build/mbedtls-sys-auto-030256f45744a275/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.o:     file format elf64-x86-64

RELOCATION RECORDS FOR [.text.mbedtls_internal_sha256_process]:
OFFSET           TYPE              VALUE 
0000000000000082 R_X86_64_PC32     .rodata.K-0x0000000000000004
0000000000000b06 R_X86_64_PLT32    mbedtls_platform_zeroize-0x0000000000000004
0000000000000b32 R_X86_64_PLT32    __stack_chk_fail-0x0000000000000004

So it's clear that the compiler fails to properly optimize out the call to memcpy. The fix should be in the compilation, not the code.

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 29, 2023

So the reason why memcpy is not been optimized is:
When compiling, if build script of mbedtls-sys detect the environment is freestanding , see mbedtls-sys/build/features.rs , it will add -fno-builtin to cflags.

Then with -fno-builtin , clang will Disable implicit builtin knowledge of functions which cause memcpy and other builtin functions not been optimized out.

If assuming -fno-builtin is necessary , then there are two solutions:

  1. We should define exactly which builtin functions should be disable one by one through -fno-builtin-<value> to ensure only necessary functions are not been optimized.

  2. According to https://man7.org/linux/man-pages/man1/gcc.1.html:

    With the -fno-builtin-function option only the built-in
    function function is disabled. function must not begin with
    _builtin. If a function is named that is not built-in in
    this version of GCC, this option is ignored. There is no
    corresponding -fbuiltin-function option; if you wish to
    enable built-in functions selectively when using -fno-builtin
    or -ffreestanding, you may define macros such as:

        #define abs(n)          __builtin_abs ((n))
        #define strcpy(d, s)    __builtin_strcpy ((d), (s))
    

    We need to add #define memcpy(dst, src, sz) __builtin_memcpy((dst), (src), (sz)) in mbedtls-sys/vendor/library/alignment.h after #include <string.h> to ensure all calls to memcpy in this file will be optimized.

The second solution is easy to do but need to change vendor code.

CC: @jethrogb @arai-fortanix

@Taowyoo Taowyoo force-pushed the yx/perf-fix branch 2 times, most recently from 7df0f58 to ab4fd4d Compare June 29, 2023 07:28
@jethrogb
Copy link
Member

The second solution is easy to do but need to change vendor code.

We can just add this as a separate include file that we pass to CMake, right? Or, similary, put it in the config.h.

@jethrogb
Copy link
Member

jethrogb commented Jun 29, 2023

Ideally we don't disable builtins that we provide in strcpy. But there's no way to instruct the compiler, "ignore all builtins, except these specific ones"?

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 29, 2023

Hi @jethrogb ,

But there's no way to instruct the compiler, "ignore all builtins, except these specific ones"?

As far as I know after some search on internet, the only way is to add something like #define abs(n) __builtin_abs ((n)) in source code where you want to except some functions.

And as you can see this PR, I just updated, because of the limitation of C macro, this kind of C macros need to:

  1. Add after include header file, so add this in config.h will cause compile error.
  2. Such C macro cannot dealing with C code, if they call memcpy like this: memcpy(a, A_MACRO_EXPAND_TO_TWO_ARGUMENTS(b))

@arai-fortanix
Copy link

We definitely need to disable some builtins, since we don't have a complete standard C library when we're compiling for SGX, and some of the builtins assume C library specifics. I think there's a way around the macro expansion issue, but I'll need to look into some specifics before I can offer a solution to that.

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 29, 2023

Hi @arai-fortanix , @jethrogb
Update to add some code in config.h and add one more cflag to ensure memcpy is always calling builtin functions.
See commit: 48a8d6a

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 29, 2023

Github is break now, https://www.githubstatus.com/

@arai-fortanix
Copy link

I'm not thrilled by forcing an inclusion of string.h. But that makes me wonder, which string.h are we including, and are we including the appropriate support functions for that string.h in the build environment?

The reason I'm asking is that we had a bug in zircon where we were compiling code using glibc's ctype.h header, but we were linking in the MUSL code for ctype. This led to bugs where the glibc code in the ctype header was calling a MUSL function with the same name, but different semantics, which led to code not operating correctly. I want to make sure we don't have a similar problem here.

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 29, 2023

Just fixed the CI.

@Taowyoo
Copy link
Collaborator Author

Taowyoo commented Jun 29, 2023

I'm not thrilled by forcing an inclusion of string.h.

Me too. But this is only way how we could avoid changing vendor code.

I am thinking, is it possible to have a CMakeList.txt on top of mbedtls 's CMakeList.txt to overrride some vendor code without touching the vendor folder

@Taowyoo Taowyoo enabled auto-merge June 29, 2023 20:23
@Taowyoo Taowyoo added this pull request to the merge queue Jun 29, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 29, 2023
@Taowyoo Taowyoo added this pull request to the merge queue Jun 29, 2023
Merged via the queue into master with commit f9fefe0 Jun 29, 2023
@Taowyoo Taowyoo deleted the yx/perf-fix branch June 29, 2023 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants