Fix performance issue introduced by vendor code #288

Taowyoo · 2023-06-28T02:49:34Z

Performance issue about unaligned memory access

In mbedtls 3.4.0, upstream refactor the code of accessing unaligned address from using bit calculation to use pointer + memcpy + clang's bitswap.

This is fine in no-sgx environment, but become very slow in fortanix SGX environment.
I think the most possible reason here is call of memcpy.

This PR mainly replace all these new functions with old implementation in https://github.com/Mbed-TLS/mbedtls/blob/v3.3.0/library/common.h

Performance tests result:
Command used to run bench:
cargo +stable bench --no-default-features --features dsa,force_aesni_support,mpi_force_c_code,rdrand,std,time --target=x86_64-fortanix-unknown-sgx

Branch for testing: `yx/mbedtls-9_bench` at e51d4d17591817922928b3f23b6a85de4c3b2501 : 
Benchmarking pbkdf2_hmac
Benchmarking pbkdf2_hmac: Warming up for 3.0000 s
Benchmarking pbkdf2_hmac: Collecting 100 samples in estimated 7.6594 s (200 iterations)
Benchmarking pbkdf2_hmac: Analyzing
pbkdf2_hmac             time:   [38.324 ms 38.332 ms 38.339 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Branch for testing: `yx/mbedtls-11_bench` at 489cad5a6d8dc1042329de5aaf9d0a801a7ba802 : 
Benchmarking pbkdf2_hmac
Benchmarking pbkdf2_hmac: Warming up for 3.0000 s
Benchmarking pbkdf2_hmac: Collecting 100 samples in estimated 14.664 s (100 iterations)
Benchmarking pbkdf2_hmac: Analyzing
pbkdf2_hmac             time:   [146.82 ms 146.86 ms 146.91 ms]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

After this PR, the performance number back to be as same as yx/mbedtls-9_bench one

Performance issue about `explicit_bzero` (debug mode only)

When providing the explicit_bzero function through rust, there is a big performance downgrade in SGX

For example, when running the pbkdf :
Command: cargo +stable nextest run --test pbkdf --no-default-features --features dsa,force_aesni_support,mpi_force_c_code,rdrand,std,time --target=x86_64-fortanix-unknown-sgx

# debug
PASS [   7.689s] mbedtls::pbkdf test_pbkdf2
# release
PASS [   0.233s] mbedtls::pbkdf test_pbkdf2

After ensure C mbedtls side to not call explicit_bzero, the time reduce to

# debug
PASS [   1.785s] mbedtls::pbkdf test_pbkdf2
# release
PASS [   0.234s] mbedtls::pbkdf test_pbkdf2

This performance issue should be not related to usage of Zeroize crate, because above tests results are reproduced by following minimal implementation of explicit_bzero:

#[no_mangle]
pub unsafe extern "C" fn explicit_bzero(buf: *mut std::ffi::c_void, len: alloc::size_t) {
    // TODO: use `volatile_set_memory` when stabilized
    for i in 0..len {
        let ptr = (buf as *mut u8).add(i);
        core::ptr::write_volatile(ptr, 0u8);
    }
}

Taowyoo · 2023-06-28T17:02:14Z

Maybe I need to export a C define from mbedtls-sys, such as __SGX__. And use this to conditional compile the change above

xinyufort

LGTM, but I don't know if changing the underlying implementation has any nontrivial repercussions...

mbedtls-sys/build/build.rs

arai-fortanix · 2023-06-28T22:11:55Z

I haven't looked at this in full detail yet, but one thing I'm not thrilled about is directly conditioning the code on __SGX__ being defined. The existing code seems to be pretty good at defining macros for features/properties like MBEDTLS_EFFICIENT_UNALIGNED_ACCESS and then conditionalizing the code on those macros. The feature macros get set based on things like the defines for the CPU architecture, so there's a level of indirection there.

This is a nice thing to do because at some point in the future there might be some other non-SGX platform that wants to use these alternatives for memory access. And it would be weird if they had to define SGX to get the behavior they want, since since that platform wouldn't have anything to do with SGX.

So I think you want to conditionalize the code on a macro named something like MEMCPY_IS_SLOW or something like that, and we can set the MEMCPY_IS_SLOW macro if __SGX__ is set. Then later someone else can compile with MEMCPY_IS_SLOW or make their own change to the header to automatically turn on MEMCPY_IS_SLOW for their platform as appropriate.

Minor C language thing: identifiers starting with double underscores are reserved for use by the platform. So things like the compiler or the system headers can define idenfiers that start with double underscores. Since this is something that we're defining, it shouldn't use double underscores. The standard library also uses some identifiers with single underscores, so those should be avoided as well.

Taowyoo · 2023-06-28T23:25:55Z

HI @arai-fortanix ,
Yes, it seems the upstream code added the MBEDTLS_EFFICIENT_UNALIGNED_ACCESS and apply such check on mbedtls_xor but not other macros in alignments.h,

One possible solution here is to also add similar checks on MBEDTLS_EFFICIENT_UNALIGNED_ACCESS for family of MBEDTLS_PUT_UINT32_BE macros. And this changes are reasonable and could be up-streamed.

About conditional compiling, I have updated code to always keep these mbedtls_XX_unaligned_XX functions

jethrogb · 2023-06-28T23:53:08Z

This PR doesn't make sense to me. The code is clearly intended to get optimized so that there are no actual calls to memcpy. If our generated code still has calls to memcpy, we need to fix how the code is compiled.

Having compiled both Linux and SGX in release mode:

$ objdump -j .text.mbedtls_internal_sha256_process -r target/x86_64-fortanix-unknown-sgx/release/build/mbedtls-sys-auto-f19242cb09f8b77a/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.obj

target/x86_64-fortanix-unknown-sgx/release/build/mbedtls-sys-auto-f19242cb09f8b77a/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.obj:     file format elf64-x86-64

RELOCATION RECORDS FOR [.text.mbedtls_internal_sha256_process]:
OFFSET           TYPE              VALUE 
0000000000000080 R_X86_64_PLT32    memcpy-0x0000000000000004
000000000000009a R_X86_64_PC32     .rodata.K-0x0000000000000004
0000000000000b1e R_X86_64_PLT32    mbedtls_platform_zeroize-0x0000000000000004

$ objdump -j .text.mbedtls_internal_sha256_process -r target/release/build/mbedtls-sys-auto-030256f45744a275/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.o

target/release/build/mbedtls-sys-auto-030256f45744a275/out/build/library/CMakeFiles/mbedcrypto.dir/sha256.c.o:     file format elf64-x86-64

RELOCATION RECORDS FOR [.text.mbedtls_internal_sha256_process]:
OFFSET           TYPE              VALUE 
0000000000000082 R_X86_64_PC32     .rodata.K-0x0000000000000004
0000000000000b06 R_X86_64_PLT32    mbedtls_platform_zeroize-0x0000000000000004
0000000000000b32 R_X86_64_PLT32    __stack_chk_fail-0x0000000000000004

So it's clear that the compiler fails to properly optimize out the call to memcpy. The fix should be in the compilation, not the code.

Taowyoo · 2023-06-29T05:39:40Z

So the reason why memcpy is not been optimized is:
When compiling, if build script of mbedtls-sys detect the environment is freestanding , see mbedtls-sys/build/features.rs , it will add -fno-builtin to cflags.

Then with -fno-builtin , clang will Disable implicit builtin knowledge of functions which cause memcpy and other builtin functions not been optimized out.

If assuming -fno-builtin is necessary , then there are two solutions:

We should define exactly which builtin functions should be disable one by one through -fno-builtin-<value> to ensure only necessary functions are not been optimized.
According to https://man7.org/linux/man-pages/man1/gcc.1.html:
With the -fno-builtin-function option only the built-in
function function is disabled. function must not begin with
_builtin. If a function is named that is not built-in in
this version of GCC, this option is ignored. There is no
corresponding -fbuiltin-function option; if you wish to
enable built-in functions selectively when using -fno-builtin
or -ffreestanding, you may define macros such as:
```
    #define abs(n)          __builtin_abs ((n))
    #define strcpy(d, s)    __builtin_strcpy ((d), (s))
```
We need to add #define memcpy(dst, src, sz) __builtin_memcpy((dst), (src), (sz)) in mbedtls-sys/vendor/library/alignment.h after #include <string.h> to ensure all calls to memcpy in this file will be optimized.

The second solution is easy to do but need to change vendor code.

CC: @jethrogb @arai-fortanix

jethrogb · 2023-06-29T14:10:19Z

The second solution is easy to do but need to change vendor code.

We can just add this as a separate include file that we pass to CMake, right? Or, similary, put it in the config.h.

jethrogb · 2023-06-29T14:11:19Z

Ideally we don't disable builtins that we provide in strcpy. But there's no way to instruct the compiler, "ignore all builtins, except these specific ones"?

Taowyoo · 2023-06-29T16:15:04Z

Hi @jethrogb ,

But there's no way to instruct the compiler, "ignore all builtins, except these specific ones"?

As far as I know after some search on internet, the only way is to add something like #define abs(n) __builtin_abs ((n)) in source code where you want to except some functions.

And as you can see this PR, I just updated, because of the limitation of C macro, this kind of C macros need to:

Add after include header file, so add this in config.h will cause compile error.
Such C macro cannot dealing with C code, if they call memcpy like this: memcpy(a, A_MACRO_EXPAND_TO_TWO_ARGUMENTS(b))

arai-fortanix · 2023-06-29T17:35:22Z

We definitely need to disable some builtins, since we don't have a complete standard C library when we're compiling for SGX, and some of the builtins assume C library specifics. I think there's a way around the macro expansion issue, but I'll need to look into some specifics before I can offer a solution to that.

Taowyoo · 2023-06-29T17:58:00Z

Hi @arai-fortanix , @jethrogb
Update to add some code in config.h and add one more cflag to ensure memcpy is always calling builtin functions.
See commit: 48a8d6a

Taowyoo · 2023-06-29T18:14:15Z

Github is break now, https://www.githubstatus.com/

arai-fortanix · 2023-06-29T19:01:31Z

I'm not thrilled by forcing an inclusion of string.h. But that makes me wonder, which string.h are we including, and are we including the appropriate support functions for that string.h in the build environment?

The reason I'm asking is that we had a bug in zircon where we were compiling code using glibc's ctype.h header, but we were linking in the MUSL code for ctype. This led to bugs where the glibc code in the ctype header was calling a MUSL function with the same name, but different semantics, which led to code not operating correctly. I want to make sure we don't have a similar problem here.

Taowyoo · 2023-06-29T19:18:09Z

Just fixed the CI.

Taowyoo · 2023-06-29T19:23:09Z

I'm not thrilled by forcing an inclusion of string.h.

Me too. But this is only way how we could avoid changing vendor code.

I am thinking, is it possible to have a CMakeList.txt on top of mbedtls 's CMakeList.txt to overrride some vendor code without touching the vendor folder

Taowyoo requested review from jethrogb, MihirLuthra, raoulstrackx, arai-fortanix and xinyufort June 28, 2023 02:49

Taowyoo self-assigned this Jun 28, 2023

Taowyoo force-pushed the yx/perf-fix branch 2 times, most recently from 9342154 to 7da3fd3 Compare June 28, 2023 16:22

Taowyoo force-pushed the yx/perf-fix branch 2 times, most recently from 2977bc8 to 09720b4 Compare June 28, 2023 19:23

xinyufort reviewed Jun 28, 2023

View reviewed changes

mbedtls-sys/build/build.rs Outdated Show resolved Hide resolved

Taowyoo force-pushed the yx/perf-fix branch from 09720b4 to 81297a4 Compare June 28, 2023 22:42

Taowyoo force-pushed the yx/perf-fix branch 2 times, most recently from 7df0f58 to ab4fd4d Compare June 29, 2023 07:28

Taowyoo force-pushed the yx/perf-fix branch from ab4fd4d to 48a8d6a Compare June 29, 2023 17:55

arai-fortanix approved these changes Jun 29, 2023

View reviewed changes

Taowyoo enabled auto-merge June 29, 2023 20:23

Taowyoo added 3 commits June 29, 2023 13:35

refactor: update bench test to same as test_pbkdf2

6ba19ce

build: ensure memcpy is optimzed in SGX

6a377f1

build: bump version

cc68a0d

Taowyoo force-pushed the yx/perf-fix branch from 4f14375 to cc68a0d Compare June 29, 2023 20:35

Taowyoo mentioned this pull request Jun 29, 2023

Totally fix performance issue related to cflag -f-no-builtin #290

Closed

Taowyoo added this pull request to the merge queue Jun 29, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 29, 2023

Taowyoo added this pull request to the merge queue Jun 29, 2023

Merged via the queue into master with commit f9fefe0 Jun 29, 2023

Taowyoo deleted the yx/perf-fix branch June 29, 2023 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issue introduced by vendor code #288

Fix performance issue introduced by vendor code #288

Taowyoo commented Jun 28, 2023 •

edited

Loading

Taowyoo commented Jun 28, 2023 •

edited

Loading

xinyufort left a comment

arai-fortanix commented Jun 28, 2023

Taowyoo commented Jun 28, 2023 •

edited

Loading

jethrogb commented Jun 28, 2023

Taowyoo commented Jun 29, 2023 •

edited

Loading

jethrogb commented Jun 29, 2023

jethrogb commented Jun 29, 2023 •

edited

Loading

Taowyoo commented Jun 29, 2023 •

edited

Loading

arai-fortanix commented Jun 29, 2023

Taowyoo commented Jun 29, 2023

Taowyoo commented Jun 29, 2023 •

edited

Loading

arai-fortanix commented Jun 29, 2023

Taowyoo commented Jun 29, 2023

Taowyoo commented Jun 29, 2023

Fix performance issue introduced by vendor code #288

Fix performance issue introduced by vendor code #288

Conversation

Taowyoo commented Jun 28, 2023 • edited Loading

Performance issue about unaligned memory access

Performance issue about explicit_bzero (debug mode only)

Taowyoo commented Jun 28, 2023 • edited Loading

xinyufort left a comment

Choose a reason for hiding this comment

arai-fortanix commented Jun 28, 2023

Taowyoo commented Jun 28, 2023 • edited Loading

jethrogb commented Jun 28, 2023

Taowyoo commented Jun 29, 2023 • edited Loading

jethrogb commented Jun 29, 2023

jethrogb commented Jun 29, 2023 • edited Loading

Taowyoo commented Jun 29, 2023 • edited Loading

arai-fortanix commented Jun 29, 2023

Taowyoo commented Jun 29, 2023

Taowyoo commented Jun 29, 2023 • edited Loading

arai-fortanix commented Jun 29, 2023

Taowyoo commented Jun 29, 2023

Taowyoo commented Jun 29, 2023

Taowyoo commented Jun 28, 2023 •

edited

Loading

Performance issue about `explicit_bzero` (debug mode only)

Taowyoo commented Jun 28, 2023 •

edited

Loading

Taowyoo commented Jun 28, 2023 •

edited

Loading

Taowyoo commented Jun 29, 2023 •

edited

Loading

jethrogb commented Jun 29, 2023 •

edited

Loading

Taowyoo commented Jun 29, 2023 •

edited

Loading

Taowyoo commented Jun 29, 2023 •

edited

Loading