Skip to content
Alexander K edited this page Apr 17, 2022 · 7 revisions

The following document describes HTTPS performance analysis and benchmark.

NOTE since we're still in progress of TLS performance optimization work the page is under construction. Check our FOSDEM'21 talk for demo and performance comparison with Nginx/OpenSSL and Nginx/WolfSSL.

Hardware

We used 3 servers, all of them connected together with 10Gbps links:

  1. Device Under Test: Intex Xeon E3-1240v5 (4 cores, 8 ht), Mellanox MT26448 (10Gbps), 32GB RAM;

  2. Load generator 1: Intel Xeon E5-1650v3 (6 cores, 12 ht), Mellanox MT26448 (10Gbps), 64GB RAM;

  3. Load generator 2: Intel Xeon E5-2670 (8 cores, 16 ht), Mellanox MT26448 (10Gbps), 32GB RAM;

Both load generators are used for the same roles and can be replaced with a single server if it's powerful enough to load the DUT.

Software

Device Undeer Test

  • Debian 9 Stretch

The web-server is one of the following:

  • Linux kernel 4.16 (from official debian-backports repository)
  • Nginx Mainline (from official Nginx repository for Debian 9) which is proven to be fastest userspace web-server
  • OpenSSL 1.1.0f

or

  • Tempesta's patched Linux kernel 4.14
  • Tempesta FW

Load Generators

  • Debian 9 Stretch
  • Linux kernel 4.16 (from official debian-backports repository)
  • Slightly patched thc-ssl-dos utility for performing TLS DoS attacks. Single-line patch is used to make single TLS handshake in single TCP connection; another five-line patch allows to define used cipher. from command line.
  • Yandex Tank
  • Wrk

Linux Configuration

Linux is configured in the same way for all the servers.

echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl -w fs.file-max=5000000
sysctl -w net.core.netdev_max_backlog=1000000
sysctl -w net.core.somaxconn=131072
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.default.rp_filter=0
sysctl -w net.ipv4.ip_local_port_range='1024 65535'
sysctl -w net.ipv4.tcp_congestion_control=highspeed
sysctl -w net.ipv4.tcp_ecn=0
sysctl -w net.ipv4.tcp_fastopen=1
sysctl -w net.ipv4.tcp_fin_timeout=10
sysctl -w net.ipv4.tcp_low_latency=1
sysctl -w net.ipv4.tcp_max_orphans=1000000
sysctl -w net.ipv4.tcp_max_syn_backlog=131072
sysctl -w net.ipv4.tcp_max_tw_buckets=2000000
sysctl -w net.ipv4.tcp_sack=0
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_window_scaling=1
sysctl -w vm.percpu_pagelist_fraction=8

Load generation scripts

Unfortunately thc-ssl-dos doesn't support multithreading thus a simple script is used to spawn one process per CPU core:

#!/bin/bash

procs=`nproc`
for i in `seq 1 ${procs}`;
	do
		../thc-tls-dos/src/thc-ssl-dos -l 200 		\
			-c "ECDHE-%KEY%-AES256-GCM-SHA384"	\
			--accept %DUT_IP% %DUT_PORT_TEST% > dos_log_${i} &
	done

-l 200 option sets 200 concurrent connections between threads. Not a big number, but it's sufficient, since the tool immediately closes TCP connection after TLS handshake was performed.

-c "ECDHE-%KEY%-AES256-GCM-SHA384" sets cipher used to establish TLS connection. %KEY% parameter is replaced by two possible values: RSA and ECDSA.

%DUT_IP% and %DUT_PORT_TEST% parameters are replaced with the DUT IP address and web-server port respectively.

thc-tls-tool prints statistics to standard output stream every second, to collect the stats across all processes a simple script is used:

#/bin/bash

procs=`nproc`
rm dos_log_all
for i in `seq 1 ${procs}`;
do
	tail -n 2 dos_log_${i} | head -n 1 >> dos_log_all
done

res=`perl -ne '/\[(\d+)/ && print "$1 + ";' dos_log_all`
echo "Total handshakes per second: `perl -e "print $res 0;"`"

The next configuration is used to generate load by Yandex tank:

phantom:
  address: %DUT_IP%:%DUT_PORT_TEST%
  ssl: true
  headers:
    - "[Connection: close]"
  uris:
    - /%URI%
  load_profile:
    load_type: rps
    schedule: const(6000, 10m)
console:
  enabled: true
telegraf:
  enabled: false

%DUT_IP% and %DUT_PORT_TEST% parameters are replaced with the DUT IP address and web-server port respectively. %URI% - requested uri. The Connection: close header is added to close the connection right after the response is sent.

Alternatively, wrk can be used:

wrk -c 1000 -t `nproc` -d 10m -H "Connection: close" https://%DUT_IP%:`%DUT_PORT_TEST%`/%URI

The main difference between wrk and Yandex Tank is constant load feature. While the wrk tries to load the web server as fast as possible, Yandex Tank generates constant load which can give more predictable results.

Nginx configuration as follows:

pid /tmp/nginx_tls_test.pid ;

worker_processes auto;

events {
    multi_accept on;
    worker_connections 65535;
    use epoll;
}
worker_rlimit_nofile 65535;

http {
    keepalive_timeout 65;
    keepalive_requests 100000;

    sendfile         on;
    tcp_nopush       on;
    tcp_nodelay      on;

    open_file_cache max=1000;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors off;

    # [ debug | info | notice | warn | error | crit | alert | emerg ]
    # Fully disable log messages.
    error_log /dev/null emerg;

    # Disable access log altogether.
    access_log off;

    server {
        listen %DUT_PORT_N%;
        listen %DUT_PORT_N_SSL% ssl;

        location / {
            root /var/www/html/ ;
        }
    }

    # SSL configuration.

    ssl_certificate /tmp/tfw-root.crt;
    ssl_certificate_key /tmp/tfw-root.key;

    ssl_session_timeout 5m;

    # Disable old protocols.
    ssl_protocols TLSv1.2;
    # Use only modern ciphers.
    ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
    ssl_prefer_server_ciphers on;

    # Not used: OSCP stampling. Affects handshake time, but benchmarkers
    # doesn't validate certificate anyway.

    # Not used: ssl_session_cache, ssl_session_tickets. New TCP connection
    # - new handshake.
    # Simulate tonns of unique clients to stress test the handshakes.
    ssl_session_cache off;
    ssl_session_tickets off;
}

Tempesta FW configuration:

server 127.0.0.1:%DUT_PORT_N%;
vhost default {
    proxy_pass default;
}

cache 2;
cache_fulfill * *;

listen %DUT_PORT_T%;
listen  %DUT_PORT_T_SSL% proto=https;


tls_certificate /tmp/tfw-root.crt;
tls_certificate_key /tmp/tfw-root.key;

Resources on the servers:

  • /0 - File withe single 0 symbol. The shortest response body possible for the GET request
  • / - 20Kb text file, ordinary html page.

TLS Certificates

The same certificates and keys are used for both Tempesta and Nginx. Since clients doesn't check the authority of the certificate, self-signed certificates are used.

The certificates can be generated in the following way:

#!/bin/bash

SUBJ="/C=US/ST=Washington/L=Seattle/O=Tempesta Technologies Inc./OU=Testing/CN=tempesta-tech.com/emailAddress=info@tempesta-tech.com"
KEY_NAME="tfw-root.key"
CERT_NAME="tfw-root.crt"

echo Generating RSA key...

mkdir -p RSA
cd RSA
openssl req -new -days 365 -nodes -x509					\
	-newkey rsa:2048						\
	-subj "${SUBJ}" -keyout ${KEY_NAME} -out ${CERT_NAME}
cd ..

echo Generating ECDSA key...

mkdir -p ECDSA
cd ECDSA
openssl req -new -days 365 -nodes -x509					\
	-newkey ec -pkeyopt ec_paramgen_curve:prime256v1		\
	-subj "${SUBJ}" -keyout ${KEY_NAME} -out ${CERT_NAME}
cd ..

echo Done.

Tests description

The TLS performance is the only subject for the current research, so both servers are configured to server requests from cache. The cache is populated by preliminary curl requests. Target scenario: single TCP connection -> new HTTPS session from a new client, full handshake is performed, no session resumption is allowed. Thus options normally used to optimize TLS handshake performance with the known clients: SSL sessions cache, SSL tickets and OSCP stapling - are disabled.

In all tests perf report is analized to find possible performance bottle necks.

TLS Handshakes DoS

Load generator: thc-ssl-dos. A client opens connection to the server, performs a TLS handshake and closes the connection. No data is sent over the TLS connection.

Expected behaviour: DUT is overloaded with expensive TLS handshake operations, Number of established TLS connections (handshakes) per second is relatively small. The same server with the same configuration can handle much more requests on active (established) connections.

Test aims: analyse performance issues during handshake handling; find out estimated DoS throughput to overload the server.

Continuous TLS load

Load generator: wrk or Yandex Tank.

Expected behaviour: server with in-place (de-)encryption has better throughput.

Test aims: analyse performance issues during serving persistent connections with multiple requests in a single connection; determine performance in response-per-second metrics.

Results

TLS Handshakes DoS

Nginx

We've achieved 15'000 handshakes/second on our servers using ECDHE-ECDSA-AES256-GCM-SHA384 cipher, curve chosen for ECDHE and ECDSA - prime256v1 (secp256r1). RSA authentication algorithm shows much less performance: 4'900 handshakes/second.

Perf top:

Overhead  Shared Object             Symbol
   9.11%  libcrypto.so.1.1          [.] __ecp_nistz256_mul_montx
   7.80%  libc-2.24.so              [.] _int_malloc
   7.03%  libcrypto.so.1.1          [.] __ecp_nistz256_sqr_montx
   3.54%  libcrypto.so.1.1          [.] sha512_block_data_order_avx2
   3.05%  libcrypto.so.1.1          [.] BN_div
   2.43%  libc-2.24.so              [.] _int_free
   1.89%  libcrypto.so.1.1          [.] OPENSSL_cleanse
   1.61%  libc-2.24.so              [.] malloc_consolidate
   1.49%  libcrypto.so.1.1          [.] ecp_nistz256_avx2_gather_w7
   1.41%  libc-2.24.so              [.] malloc
   1.24%  libcrypto.so.1.1          [.] ecp_nistz256_point_doublex
   1.20%  libcrypto.so.1.1          [.] ecp_nistz256_ord_sqr_montx
   1.01%  libcrypto.so.1.1          [.] __ecp_nistz256_sub_fromx
   1.00%  libcrypto.so.1.1          [.] BN_lshift
   0.87%  libcrypto.so.1.1          [.] BN_num_bits_word
   0.86%  libcrypto.so.1.1          [.] bn_correct_top
   0.84%  libcrypto.so.1.1          [.] BN_CTX_get
   0.81%  libc-2.24.so              [.] __memset_avx2_unaligned_erms
   0.77%  libc-2.24.so              [.] free
   0.74%  libcrypto.so.1.1          [.] __ecp_nistz256_mul_by_2x
   0.71%  libcrypto.so.1.1          [.] BN_rshift
   0.59%  libcrypto.so.1.1          [.] BN_uadd
   0.59%  libcrypto.so.1.1          [.] int_bn_mod_inverse
   0.54%  libc-2.24.so              [.] __memmove_avx_unaligned_erms
   0.53%  libcrypto.so.1.1          [.] aesni_ecb_encrypt
   0.53%  libcrypto.so.1.1          [.] BN_num_bits
   0.52%  [mlx4_core]               [k] mlx4_eq_int
   0.52%  libcrypto.so.1.1          [.] ecp_nistz256_point_addx
   0.51%  libcrypto.so.1.1          [.] ecp_nistz256_point_add_affinex
   0.51%  [mlx4_en]                 [k] mlx4_en_process_rx_cq
   0.50%  libcrypto.so.1.1          [.] BN_set_word
   0.47%  libcrypto.so.1.1          [.] ecp_nistz256_sqr_mont
   0.45%  libcrypto.so.1.1          [.] bn_mul_words
   0.44%  libcrypto.so.1.1          [.] BN_CTX_end
   0.40%  libcrypto.so.1.1          [.] __ecp_nistz256_add_tox
   0.40%  libcrypto.so.1.1          [.] ecp_nistz256_avx2_gather_w5
   0.39%  libcrypto.so.1.1          [.] BN_mul
   0.38%  libcrypto.so.1.1          [.] EVP_MD_CTX_reset
   0.38%  libcrypto.so.1.1          [.] CRYPTO_zalloc
   0.38%  libcrypto.so.1.1          [.] ecp_nistz256_points_mul
   0.38%  libc-2.24.so              [.] __memset_avx2_erms
   0.36%  libcrypto.so.1.1          [.] BN_CTX_start
   0.35%  libcrypto.so.1.1          [.] bn_wexpand
   0.34%  libcrypto.so.1.1          [.] bn_expand2
   0.34%  libcrypto.so.1.1          [.] bn_sub_words
   0.33%  libcrypto.so.1.1          [.] __ecp_nistz256_subx
   0.30%  libcrypto.so.1.1          [.] CRYPTO_free
   0.28%  libcrypto.so.1.1          [.] bn_add_words
   0.28%  libcrypto.so.1.1          [.] EVP_MD_CTX_copy_ex
   0.28%  libcrypto.so.1.1          [.] BN_rshift1
   0.28%  libcrypto.so.1.1          [.] CRYPTO_malloc
   0.27%  libcrypto.so.1.1          [.] SHA512_Final
   0.27%  libssl.so.1.1             [.] SSL3_RECORD_clear
   0.25%  libssl.so.1.1             [.] tls12_shared_sigalgs
   0.25%  libssl.so.1.1             [.] state_machine
   0.25%  libcrypto.so.1.1          [.] EVP_EncryptUpdate

Flamegraph:

Nginx and OpenSSL Handshakes handling

Full SVG download

Tempesta FW

TODO: a new revision of Tempesta TLS is no be tested, but not released yet.

Continuous TLS load

Concurrent connections: 16384 File size: 8b or 20Kb DUT load: 100%

Nginx

For 8 bytes file: ~327'000 Requests/sec

Perf top:

Overhead  Shared Object             Symbol
   6.24%  libc-2.24.so              [.] _int_malloc
   1.64%  [kernel]                  [k] syscall_return_via_sysret
   1.59%  libssl.so.1.1             [.] ssl3_get_record
   1.41%  [mlx4_core]               [k] mlx4_eq_int
   1.38%  [mlx4_en]                 [k] mlx4_en_process_rx_cq
   1.30%  [kernel]                  [k] __fget_light
   1.23%  [kernel]                  [k] tcp_recvmsg
   0.99%  libssl.so.1.1             [.] do_ssl3_write
   0.84%  [kernel]                  [k] sock_poll
   0.84%  libcrypto.so.1.1          [.] EVP_MD_CTX_md
   0.83%  libc-2.24.so              [.] _int_free
   0.81%  [kernel]                  [k] tcp_ack
   0.79%  nginx                     [.] ngx_rbtree_insert_timer_value
   0.79%  nginx                     [.] ngx_vslprintf
   0.77%  libcrypto.so.1.1          [.] aesni_encrypt
   0.77%  [mlx4_en]                 [k] mlx4_en_xmit
   0.77%  [kernel]                  [k] copy_user_enhanced_fast_string
   0.76%  nginx                     [.] ngx_http_create_request
   0.75%  libc-2.24.so              [.] __memset_avx2_unaligned_erms
   0.71%  nginx                     [.] ngx_ssl_send_chain
   0.69%  [kernel]                  [k] __inet_lookup_established
   0.68%  libssl.so.1.1             [.] ssl_read_internal
   0.68%  nginx                     [.] ngx_open_cached_file
   0.68%  nginx                     [.] ngx_epoll_process_events
   0.67%  [mlx4_en]                 [k] mlx4_en_process_tx_cq
   0.65%  libc-2.24.so              [.] malloc_consolidate
   0.65%  nginx                     [.] ngx_http_header_filter
   0.65%  [kernel]                  [k] tcp_transmit_skb
   0.64%  libcrypto.so.1.1          [.] aes_gcm_cipher
   0.64%  nginx                     [.] ngx_http_keepalive_handler
   0.63%  [kernel]                  [k] tcp_write_xmit
   0.63%  nginx                     [.] ngx_http_parse_request_line
   0.62%  nginx                     [.] ngx_output_chain
   0.61%  [kernel]                  [k] native_irq_return_iret
   0.61%  [kernel]                  [k] copy_page_to_iter
   0.60%  [kernel]                  [k] _raw_spin_lock
   0.58%  libc-2.24.so              [.] __memmove_avx_unaligned_erms
   0.56%  [kernel]                  [k] __qdisc_run
   0.56%  libssl.so.1.1             [.] tls1_enc
   0.55%  libcrypto.so.1.1          [.] bio_read_intern
   0.55%  libcrypto.so.1.1          [.] gcm_ghash_avx
   0.54%  [kernel]                  [k] __x86_indirect_thunk_rax
   0.54%  libcrypto.so.1.1          [.] ERR_clear_error
   0.52%  [kernel]                  [k] inet_recvmsg
   0.51%  [kernel]                  [k] tcp_sendmsg_locked
   0.51%  nginx                     [.] ngx_http_parse_header_line
   0.50%  nginx                     [.] ngx_reusable_connection
   0.49%  nginx                     [.] ngx_http_write_filter
   0.49%  [kernel]                  [k] fsnotify
   0.47%  libcrypto.so.1.1          [.] aes_gcm_ctrl
   0.47%  libcrypto.so.1.1          [.] EVP_CIPHER_CTX_cipher
   0.45%  [kernel]                  [k] __list_del_entry_valid
   0.45%  nginx                     [.] ngx_ssl_recv
   0.43%  libcrypto.so.1.1          [.] aesni_ctr32_encrypt_blocks
   0.43%  [kernel]                  [k] pfifo_fast_dequeue
   0.43%  [kernel]                  [k] rw_verify_area

Flamegraph:

Nginx and OpenSSL 8b response handling

Full SVG download

For 20Kb bytes file: ~55'000 Requests/sec

Perf top:

Overhead  Shared Object             Symbol
   6.48%  libcrypto.so.1.1          [.] _aesni_ctr32_ghash_6x
   3.05%  libc-2.24.so              [.] __memmove_avx_unaligned_erms
   2.42%  [kernel]                  [k] copy_user_enhanced_fast_string
   2.39%  [mlx4_en]                 [k] mlx4_en_process_rx_cq
   2.03%  [kernel]                  [k] pfifo_fast_dequeue
   1.89%  [mlx4_core]               [k] mlx4_eq_int
   1.77%  libc-2.24.so              [.] _int_malloc
   1.76%  [kernel]                  [k] tcp_ack
   1.68%  [kernel]                  [k] skb_release_data
   1.63%  [kernel]                  [k] __inet_lookup_established
   1.52%  [kernel]                  [k] tcp_transmit_skb
   1.29%  [kernel]                  [k] tcp_wfree
   1.16%  [mlx4_en]                 [k] mlx4_en_xmit
   1.11%  [kernel]                  [k] tcp_write_xmit
   1.08%  [mlx4_en]                 [k] mlx4_en_process_tx_cq
   1.08%  [kernel]                  [k] kmem_cache_free
   1.08%  [kernel]                  [k] _raw_spin_lock
   0.85%  [kernel]                  [k] kfree
   0.83%  [kernel]                  [k] napi_consume_skb
   0.80%  [kernel]                  [k] skb_split
   0.80%  [kernel]                  [k] tcp_check_space
   0.79%  [kernel]                  [k] tcp_v4_rcv
   0.76%  [kernel]                  [k] native_irq_return_iret
   0.76%  [kernel]                  [k] mod_timer
   0.71%  [kernel]                  [k] memcpy_erms
   0.67%  [kernel]                  [k] __qdisc_run
   0.64%  [kernel]                  [k] __x86_indirect_thunk_rax
   0.63%  [kernel]                  [k] tcp_rcv_established
   0.63%  [kernel]                  [k] ip_queue_xmit
   0.63%  [kernel]                  [k] ___slab_alloc
   0.61%  [kernel]                  [k] netif_skb_features
   0.59%  [kernel]                  [k] __alloc_skb
   0.55%  [kernel]                  [k] tcp_md5_do_lookup
   0.54%  [mlx4_en]                 [k] mlx4_en_free_tx_desc
   0.53%  [kernel]                  [k] syscall_return_via_sysret
   0.53%  [kernel]                  [k] _raw_spin_lock_irqsave
   0.52%  [kernel]                  [k] tcp_init_tso_segs
   0.51%  libssl.so.1.1             [.] ssl3_get_record
   0.50%  [kernel]                  [k] tcp_v4_early_demux
   0.49%  [kernel]                  [k] ip_finish_output2
   0.48%  [kernel]                  [k] tcp_write_timer_handler
   0.47%  [kernel]                  [k] __fget_light
   0.47%  [kernel]                  [k] __kmalloc_node_track_caller
   0.46%  [kernel]                  [k] inet_gro_receive
   0.44%  [kernel]                  [k] __skb_clone
   0.44%  libssl.so.1.1             [.] do_ssl3_write
   0.44%  [kernel]                  [k] __slab_free
   0.41%  [kernel]                  [k] __list_del_entry_valid
   0.41%  [kernel]                  [k] __netdev_pick_tx
   0.41%  [kernel]                  [k] tcp_sendmsg_locked
   0.39%  [kernel]                  [k] __dev_queue_xmit
   0.38%  [kernel]                  [k] skb_clone
   0.38%  [kernel]                  [k] ipv4_dst_check
   0.37%  [kernel]                  [k] pfifo_fast_enqueue
   0.36%  [kernel]                  [k] ipv4_mtu
   0.36%  [kernel]                  [k] __netif_receive_skb_core

Flamegraph:

Nginx and OpenSSL 20kb response handling

Full SVG download

Tempesta FW

TODO: a new revision of Tempesta TLS is no be tested, but not released yet.

Resources

Clone this wiki locally