Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS: further performance improvements and cleanups #1064

Open
10 of 14 tasks
krizhanovsky opened this issue Sep 9, 2018 · 6 comments
Open
10 of 14 tasks

TLS: further performance improvements and cleanups #1064

krizhanovsky opened this issue Sep 9, 2018 · 6 comments
Assignees
Labels
enhancement performance TLS Tempesta TLS module and related issues
Milestone

Comments

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Sep 9, 2018

Changeds for #614 have grown significantly, so following tasks are move from #614 scope:

  • Introduce MPI crypto profiles to do only one memory allocation, copy (instead of bunch of MPI initializations), and zeroying per handshake.
  • bignum, ECDSA, (EC)DHE assembly implementations and optimizations
  • Cleanup list of supported curves and algorithms (see and update tls/test_tls_cert.py if necessary).
  • remove dummy_headers and replace GCC SIMD intrinsics with assembly code
  • adjust code according to RFC 8422 which obsoletes RFC 4492
  • remove ifdef (w/ or w/o content) for TTLS_PK_PARSE_EC_EXTENDED, SECG SEC 1.
  • test and probably replace standard memcpy() calls from skb_copy_*() calls by fast_memcpy(). Probably other standard functions using memcpy(), memset() or memcmp() can be accelerated.
  • Evaluate the Linux crypto API memory allocations and reduce all the extra allocations
  • Reuse Karatsuba precomputations for AES in the same TLS connection, TLS performance characterization on modern x86 CPUs Moved to Crypto extensions and performance #1335.
  • Security sensitive branching must be analyzed and protected against Meltdown/Spectre vulnerabilities (e.g. array offsets by a secret).
  • Check all the places requiring explicit memory zeroing, verify that memset()s aren't optimized out
  • need to measure current performance of TLS handshakes with web cache and pure LB/proxy modes, update Redesign of TCP synchronous sending and data caching #391 and related issues
  • Update https://github.com/tempesta-tech/tempesta/wiki/Tempesta-TLS with supported algorithms, ciphersuites and curves.

Not part of the issue, but Linux crypto API also has performance issues, so our effort to get zero-copy TLS starves on the underlying API. E.g. gcmaes_encrypt() in most cases goes through kmalloc() path with 2(!) data copies. See the comment #1064 (comment) - fixing the Linux crypto API issue we can improve large data transfers performance.

Testing

The new crypto routines must be unit tested (see mbedtls/crypto/tests/suites/).

@krizhanovsky krizhanovsky added this to the 0.7 HTTP/2 milestone Sep 9, 2018
@krizhanovsky krizhanovsky self-assigned this Sep 9, 2018
krizhanovsky added a commit that referenced this issue Jan 26, 2019
firstly in DEFINE_TLS_TEST()->kernel_fpu_begin() and secondly in
ttls_ecp_group_free()->ttls_bzero_safe()->kernel_fpu_begin().

The fix moves all the TLS unit tests to test_tls.c from tls/ and
make each test responsible for calling kernel_fpu_{begin,end}().
The crypto routines can be split into 2 groups: called from process
context of Tempesta FW initialization and called in run-time, softirq
context. Only the second group must be called with saved FPU context.
In fact, current crypto routines (covered by the test) don't use SIMD
much and this is going to change in #1064.
krizhanovsky added a commit that referenced this issue Jan 26, 2019
firstly in DEFINE_TLS_TEST()->kernel_fpu_begin() and secondly in
ttls_ecp_group_free()->ttls_bzero_safe()->kernel_fpu_begin().

The fix moves all the TLS unit tests to test_tls.c from tls/ and
make each test responsible for calling kernel_fpu_{begin,end}().
The crypto routines can be split into 2 groups: called from process
context of Tempesta FW initialization and called in run-time, softirq
context. Only the second group must be called with saved FPU context.
In fact, current crypto routines (covered by the test) don't use SIMD
much and this is going to change in #1064.
@krizhanovsky
Copy link
Contributor Author

Current profile under

# ./src/thc-ssl-dos --accept -c ECDHE-ECDSA-AES128-GCM-SHA256 -l 1000 127.0.0.1 443

with configuration

listen 443 proto=https;
server 127.0.0.1:8080;

cache 1;
cache_fulfill * *;

tls_certificate /root/tempesta/etc/tfw-root.crt;
tls_certificate_key /root/tempesta/etc/tfw-root.key;
# wrk sends IP address in SNI, so we test the option here.
tls_match_any_server_name;
    12.12%  [kernel]        [k] memset_erms
    10.69%  [tempesta_tls]  [k] ecp_mod_p256
     7.10%  [kernel]        [k] __kmalloc
     5.58%  [kernel]        [k] kfree
     5.08%  [kernel]        [k] memcpy_erms
     4.99%  [tempesta_tls]  [k] mpi_mul_hlp
     4.48%  [tempesta_tls]  [k] ttls_mpi_copy
     4.44%  [tempesta_tls]  [k] ttls_mpi_cmp_abs
     3.91%  [tempesta_tls]  [k] ttls_mpi_sub_abs
     3.68%  [tempesta_tls]  [k] ttls_mpi_cmp_mpi
     3.25%  [tempesta_tls]  [k] mpi_sub_hlp
     2.71%  [tempesta_tls]  [k] ttls_mpi_free
     2.62%  [tempesta_tls]  [k] ttls_mpi_shift_r
     2.55%  [tempesta_tls]  [k] ttls_mpi_mul_mpi
     2.52%  [kernel]        [k] ___cache_free
     1.81%  [tempesta_tls]  [k] ttls_mpi_bitlen
     1.69%  [tempesta_tls]  [k] ttls_mpi_grow.part.0
     1.21%  [tempesta_tls]  [k] ttls_mpi_shift_l
     0.88%  [tempesta_tls]  [k] ttls_mpi_add_abs
     0.88%  [tempesta_tls]  [k] ecp_modp
     0.83%  [tempesta_tls]  [k] ttls_mpi_lset

And at least x15 times better handshake performance is required.

@krizhanovsky
Copy link
Contributor Author

With 8177b43 memcpy(), memset() and allocation routines are gone from the profile leaving only math in top:

    16.27%  [tempesta_tls]  [k] ecp_mod_p256
     7.70%  [tempesta_tls]  [k] __mpi_mul
     7.27%  [tempesta_tls]  [k] __mpi_sub
     5.39%  [tempesta_tls]  [k] ttls_mpi_sub_abs
     4.92%  [tempesta_tls]  [k] ttls_mpi_shift_r
     3.92%  [tempesta_tls]  [k] ttls_mpi_mul_mpi
     3.61%  [tempesta_tls]  [k] ttls_mpi_cmp_abs
     2.07%  [tempesta_tls]  [k] ttls_mpi_safe_cond_assign
     1.76%  [tempesta_tls]  [k] ttls_mpi_cmp_mpi
     1.48%  [tempesta_tls]  [k] ttls_mpi_add_abs
     1.38%  [tempesta_tls]  [k] ttls_mpi_sub_mpi
     1.36%  [tempesta_tls]  [k] __mpi_alloc
     1.22%  [tempesta_tls]  [k] ttls_mpi_shift_l

@krizhanovsky
Copy link
Contributor Author

Current perf profile with FIPS algorithm for modulo reduction implemented in assembly:

     8.57%  [tempesta_tls]  [k] ecp_mod_p256_x86_64
     4.55%  [tempesta_tls]  [k] ttls_mpi_shift_r
     4.04%  [tempesta_tls]  [k] ttls_mpi_sub_abs
     2.63%  [tempesta_tls]  [k] ttls_mpi_safe_cond_assign
     2.45%  [tempesta_tls]  [k] ttls_mpi_sub_mpi
     2.30%  [tempesta_tls]  [k] ttls_mpi_inv_mod
     2.28%  [tempesta_tls]  [k] ttls_mpi_cmp_mpi
     2.22%  [tempesta_tls]  [k] mpi_mul_x86_64_4

@krizhanovsky krizhanovsky added the TLS Tempesta TLS module and related issues label Apr 27, 2020
krizhanovsky added a commit that referenced this issue Apr 28, 2020
@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Aug 10, 2020

With #1405 we outperform Nginx/OpenSSL in about 50%. Tested against 1CPU KVM virtual machine with the benchmark with https://github.com/tempesta-tech/tls-perf run from the host as

$ ./tls-perf -l 1000 -t 2 -T 10 192.168.100.4 443

Nginx/OpenSSL:

 TOTAL:           SECONDS 9; HANDSHAKES 13849
 HANDSHAKES/sec:  MAX 1955; AVG 1410; 95P 993; MIN 993
 LATENCY (ms):    MIN 334; AVG 1323; 95P 1445; MAX 1546

Tempesta FW:

 TOTAL:           SECONDS 10; HANDSHAKES 25781
 HANDSHAKES/sec:  MAX 3038; AVG 2575; 95P 2092; MIN 2092
 LATENCY (ms):    MIN 118; AVG 310; 95P 577; MAX 2155

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Aug 11, 2020

Profiled web cache performance through TLS against 19KB data (index.html of tempesta-tech.com):

    11.70%  [kernel]        [k] _encrypt_by_8_new8
     5.21%  [kernel]        [k] scatterwalk_copychunks
     2.38%  [kernel]        [k] skb_release_data
     1.54%  [kernel]        [k] get_page_from_freelist
     1.47%  [kernel]        [k] free_hot_cold_page
     1.46%  [kernel]        [k] __alloc_skb
     1.45%  [kernel]        [k] __kmalloc
     1.31%  [tempesta_fw]   [k] tfw_tls_encrypt
     1.26%  [kernel]        [k] aesni_gcm_precomp_avx_gen2

Pure proxying of 2-bytes file doesn't expose any copying issues (note that the backend Apache HTTPD was also running on the same VM with Tempesta FW):

    11.39%  [kernel]               [k] irq_entries_start
     1.30%  [kernel]               [k] thread_group_cputime
     1.29%  [tempesta_fw]          [k] tfw_http_parse_resp
     0.91%  [kernel]               [k] syscall_return_via_sysret
     0.86%  libapr-1.so.0.6.5      [.] apr_palloc
     0.77%  libc-2.28.so           [.] __strlen_avx2
     0.68%  [kernel]               [k] kmem_cache_alloc
     0.66%  libc-2.28.so           [.] __memmove_avx_unaligned_erms
     0.64%  [tempesta_fw]          [k] __parse_http_date
     0.59%  libpcre.so.3.13.3      [.] pcre_exec
     0.57%  [unknown]              [k] 0xfffffe000000601e
     0.53%  [tempesta_fw]          [k] tfw_http_parse_req

krizhanovsky added a commit that referenced this issue Jan 1, 2021
krizhanovsky added a commit that referenced this issue Jan 1, 2021
@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jan 18, 2021

Still in progress, current benchmarks aginst Nginx with OpenSSL and WolfSSL (1CPU VM, tls-perf is running from the host):

Nginx-1.14.2/OpenSSL-1.1.1d

$ ./tls-perf -l 1000 -t 2 -T 10 -q 192.168.100.4 8443
( All peers are active, start to gather statistics )
========================================
 TOTAL:           SECONDS 9; HANDSHAKES 16405
 HANDSHAKES/sec:  MAX 1957; AVG 1696; 95P 1064; MIN 1064
 LATENCY (ms):    MIN 1050; AVG 1336; 95P 1465; MAX 2080

Nginx-1.17.8/WolfSSL

$ ./tls-perf -l 1000 -t 2 -T 10 -q 192.168.100.4 9443
( All peers are active, start to gather statistics )
========================================
 TOTAL:           SECONDS 9; HANDSHAKES 21902
 HANDSHAKES/sec:  MAX 3319; AVG 2240; 95P 1830; MIN 1830
 LATENCY (ms):    MIN 440; AVG 707; 95P 910; MAX 1641

Tempesta TLS

$ ./tls-perf -l 1000 -t 2 -T 10 -q 192.168.100.4 443
( All peers are active, start to gather statistics )
========================================
 TOTAL:           SECONDS 10; HANDSHAKES 33404
 HANDSHAKES/sec:  MAX 3488; AVG 3337; 95P 3092; MIN 3092
 LATENCY (ms):    MIN 76; AVG 241; 95P 656; MAX 1209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement performance TLS Tempesta TLS module and related issues
Projects
None yet
Development

No branches or pull requests

1 participant