-
Notifications
You must be signed in to change notification settings - Fork 103
HTTPS performance
The following document describes HTTPS performance analysis and benchmark.
NOTE since we're still in progress of TLS performance optimization work the page is under construction. Check our FOSDEM'21 talk for demo and performance comparison with Nginx/OpenSSL and Nginx/WolfSSL.
We used 3 servers, all of them connected together with 10Gbps links:
-
Device Under Test: Intex Xeon E3-1240v5 (4 cores, 8 ht), Mellanox MT26448 (10Gbps), 32GB RAM;
-
Load generator 1: Intel Xeon E5-1650v3 (6 cores, 12 ht), Mellanox MT26448 (10Gbps), 64GB RAM;
-
Load generator 2: Intel Xeon E5-2670 (8 cores, 16 ht), Mellanox MT26448 (10Gbps), 32GB RAM;
Both load generators are used for the same roles and can be replaced with a single server if it's powerful enough to load the DUT.
- Debian 9 Stretch
The web-server is one of the following:
- Linux kernel 4.16 (from official debian-backports repository)
- Nginx Mainline (from official Nginx repository for Debian 9) which is proven to be fastest userspace web-server
- OpenSSL 1.1.0f
or
- Tempesta's patched Linux kernel 4.14
- Tempesta FW
- Debian 9 Stretch
- Linux kernel 4.16 (from official debian-backports repository)
- Slightly patched thc-ssl-dos utility for performing TLS DoS attacks. Single-line patch is used to make single TLS handshake in single TCP connection; another five-line patch allows to define used cipher. from command line.
- Yandex Tank
- Wrk
Linux is configured in the same way for all the servers.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl -w fs.file-max=5000000
sysctl -w net.core.netdev_max_backlog=1000000
sysctl -w net.core.somaxconn=131072
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.default.rp_filter=0
sysctl -w net.ipv4.ip_local_port_range='1024 65535'
sysctl -w net.ipv4.tcp_congestion_control=highspeed
sysctl -w net.ipv4.tcp_ecn=0
sysctl -w net.ipv4.tcp_fastopen=1
sysctl -w net.ipv4.tcp_fin_timeout=10
sysctl -w net.ipv4.tcp_low_latency=1
sysctl -w net.ipv4.tcp_max_orphans=1000000
sysctl -w net.ipv4.tcp_max_syn_backlog=131072
sysctl -w net.ipv4.tcp_max_tw_buckets=2000000
sysctl -w net.ipv4.tcp_sack=0
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_window_scaling=1
sysctl -w vm.percpu_pagelist_fraction=8
Unfortunately thc-ssl-dos
doesn't support multithreading thus a simple script
is used to spawn one process per CPU core:
#!/bin/bash
procs=`nproc`
for i in `seq 1 ${procs}`;
do
../thc-tls-dos/src/thc-ssl-dos -l 200 \
-c "ECDHE-%KEY%-AES256-GCM-SHA384" \
--accept %DUT_IP% %DUT_PORT_TEST% > dos_log_${i} &
done
-l 200
option sets 200 concurrent connections between threads. Not a big
number, but it's sufficient, since the tool immediately closes TCP connection
after TLS handshake was performed.
-c "ECDHE-%KEY%-AES256-GCM-SHA384"
sets cipher used to establish TLS connection.
%KEY%
parameter is replaced by two possible values: RSA
and ECDSA
.
%DUT_IP%
and %DUT_PORT_TEST%
parameters are replaced with the DUT IP address
and web-server port respectively.
thc-tls-tool
prints statistics to standard output stream every second, to
collect the stats across all processes a simple script is used:
#/bin/bash
procs=`nproc`
rm dos_log_all
for i in `seq 1 ${procs}`;
do
tail -n 2 dos_log_${i} | head -n 1 >> dos_log_all
done
res=`perl -ne '/\[(\d+)/ && print "$1 + ";' dos_log_all`
echo "Total handshakes per second: `perl -e "print $res 0;"`"
The next configuration is used to generate load by Yandex tank:
phantom:
address: %DUT_IP%:%DUT_PORT_TEST%
ssl: true
headers:
- "[Connection: close]"
uris:
- /%URI%
load_profile:
load_type: rps
schedule: const(6000, 10m)
console:
enabled: true
telegraf:
enabled: false
%DUT_IP%
and %DUT_PORT_TEST%
parameters are replaced with the DUT IP address
and web-server port respectively.
%URI% - requested uri. The Connection: close
header is added to close the
connection right after the response is sent.
Alternatively, wrk
can be used:
wrk -c 1000 -t `nproc` -d 10m -H "Connection: close" https://%DUT_IP%:`%DUT_PORT_TEST%`/%URI
The main difference between wrk and Yandex Tank is constant load feature. While the wrk tries to load the web server as fast as possible, Yandex Tank generates constant load which can give more predictable results.
Nginx configuration as follows:
pid /tmp/nginx_tls_test.pid ;
worker_processes auto;
events {
multi_accept on;
worker_connections 65535;
use epoll;
}
worker_rlimit_nofile 65535;
http {
keepalive_timeout 65;
keepalive_requests 100000;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
open_file_cache max=1000;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors off;
# [ debug | info | notice | warn | error | crit | alert | emerg ]
# Fully disable log messages.
error_log /dev/null emerg;
# Disable access log altogether.
access_log off;
server {
listen %DUT_PORT_N%;
listen %DUT_PORT_N_SSL% ssl;
location / {
root /var/www/html/ ;
}
}
# SSL configuration.
ssl_certificate /tmp/tfw-root.crt;
ssl_certificate_key /tmp/tfw-root.key;
ssl_session_timeout 5m;
# Disable old protocols.
ssl_protocols TLSv1.2;
# Use only modern ciphers.
ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
ssl_prefer_server_ciphers on;
# Not used: OSCP stampling. Affects handshake time, but benchmarkers
# doesn't validate certificate anyway.
# Not used: ssl_session_cache, ssl_session_tickets. New TCP connection
# - new handshake.
# Simulate tonns of unique clients to stress test the handshakes.
ssl_session_cache off;
ssl_session_tickets off;
}
Tempesta FW configuration:
server 127.0.0.1:%DUT_PORT_N%;
vhost default {
proxy_pass default;
}
cache 2;
cache_fulfill * *;
listen %DUT_PORT_T%;
listen %DUT_PORT_T_SSL% proto=https;
tls_certificate /tmp/tfw-root.crt;
tls_certificate_key /tmp/tfw-root.key;
Resources on the servers:
-
/0
- File withe single0
symbol. The shortest response body possible for the GET request -
/
- 20Kb text file, ordinary html page.
The same certificates and keys are used for both Tempesta and Nginx. Since clients doesn't check the authority of the certificate, self-signed certificates are used.
The certificates can be generated in the following way:
#!/bin/bash
SUBJ="/C=US/ST=Washington/L=Seattle/O=Tempesta Technologies Inc./OU=Testing/CN=tempesta-tech.com/emailAddress=info@tempesta-tech.com"
KEY_NAME="tfw-root.key"
CERT_NAME="tfw-root.crt"
echo Generating RSA key...
mkdir -p RSA
cd RSA
openssl req -new -days 365 -nodes -x509 \
-newkey rsa:2048 \
-subj "${SUBJ}" -keyout ${KEY_NAME} -out ${CERT_NAME}
cd ..
echo Generating ECDSA key...
mkdir -p ECDSA
cd ECDSA
openssl req -new -days 365 -nodes -x509 \
-newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
-subj "${SUBJ}" -keyout ${KEY_NAME} -out ${CERT_NAME}
cd ..
echo Done.
The TLS performance is the only subject for the current research, so both
servers are configured to server requests from cache. The cache is populated
by preliminary curl
requests. Target scenario: single TCP connection ->
new HTTPS session from a new client, full handshake is performed, no session
resumption is allowed. Thus options normally used to optimize
TLS handshake performance with the known clients: SSL sessions cache,
SSL tickets and OSCP stapling - are disabled.
In all tests perf
report is analized to find possible performance bottle necks.
Load generator: thc-ssl-dos
. A client opens connection to the server,
performs a TLS handshake and closes the connection. No data is sent over the
TLS connection.
Expected behaviour: DUT is overloaded with expensive TLS handshake operations, Number of established TLS connections (handshakes) per second is relatively small. The same server with the same configuration can handle much more requests on active (established) connections.
Test aims: analyse performance issues during handshake handling; find out estimated DoS throughput to overload the server.
Load generator: wrk or Yandex Tank.
Expected behaviour: server with in-place (de-)encryption has better throughput.
Test aims: analyse performance issues during serving persistent connections with multiple requests in a single connection; determine performance in response-per-second metrics.
We've achieved 15'000 handshakes/second on our servers using
ECDHE-ECDSA-AES256-GCM-SHA384
cipher, curve chosen for ECDHE and ECDSA -
prime256v1
(secp256r1
). RSA authentication algorithm shows much less
performance: 4'900 handshakes/second.
Perf top:
Overhead Shared Object Symbol
9.11% libcrypto.so.1.1 [.] __ecp_nistz256_mul_montx
7.80% libc-2.24.so [.] _int_malloc
7.03% libcrypto.so.1.1 [.] __ecp_nistz256_sqr_montx
3.54% libcrypto.so.1.1 [.] sha512_block_data_order_avx2
3.05% libcrypto.so.1.1 [.] BN_div
2.43% libc-2.24.so [.] _int_free
1.89% libcrypto.so.1.1 [.] OPENSSL_cleanse
1.61% libc-2.24.so [.] malloc_consolidate
1.49% libcrypto.so.1.1 [.] ecp_nistz256_avx2_gather_w7
1.41% libc-2.24.so [.] malloc
1.24% libcrypto.so.1.1 [.] ecp_nistz256_point_doublex
1.20% libcrypto.so.1.1 [.] ecp_nistz256_ord_sqr_montx
1.01% libcrypto.so.1.1 [.] __ecp_nistz256_sub_fromx
1.00% libcrypto.so.1.1 [.] BN_lshift
0.87% libcrypto.so.1.1 [.] BN_num_bits_word
0.86% libcrypto.so.1.1 [.] bn_correct_top
0.84% libcrypto.so.1.1 [.] BN_CTX_get
0.81% libc-2.24.so [.] __memset_avx2_unaligned_erms
0.77% libc-2.24.so [.] free
0.74% libcrypto.so.1.1 [.] __ecp_nistz256_mul_by_2x
0.71% libcrypto.so.1.1 [.] BN_rshift
0.59% libcrypto.so.1.1 [.] BN_uadd
0.59% libcrypto.so.1.1 [.] int_bn_mod_inverse
0.54% libc-2.24.so [.] __memmove_avx_unaligned_erms
0.53% libcrypto.so.1.1 [.] aesni_ecb_encrypt
0.53% libcrypto.so.1.1 [.] BN_num_bits
0.52% [mlx4_core] [k] mlx4_eq_int
0.52% libcrypto.so.1.1 [.] ecp_nistz256_point_addx
0.51% libcrypto.so.1.1 [.] ecp_nistz256_point_add_affinex
0.51% [mlx4_en] [k] mlx4_en_process_rx_cq
0.50% libcrypto.so.1.1 [.] BN_set_word
0.47% libcrypto.so.1.1 [.] ecp_nistz256_sqr_mont
0.45% libcrypto.so.1.1 [.] bn_mul_words
0.44% libcrypto.so.1.1 [.] BN_CTX_end
0.40% libcrypto.so.1.1 [.] __ecp_nistz256_add_tox
0.40% libcrypto.so.1.1 [.] ecp_nistz256_avx2_gather_w5
0.39% libcrypto.so.1.1 [.] BN_mul
0.38% libcrypto.so.1.1 [.] EVP_MD_CTX_reset
0.38% libcrypto.so.1.1 [.] CRYPTO_zalloc
0.38% libcrypto.so.1.1 [.] ecp_nistz256_points_mul
0.38% libc-2.24.so [.] __memset_avx2_erms
0.36% libcrypto.so.1.1 [.] BN_CTX_start
0.35% libcrypto.so.1.1 [.] bn_wexpand
0.34% libcrypto.so.1.1 [.] bn_expand2
0.34% libcrypto.so.1.1 [.] bn_sub_words
0.33% libcrypto.so.1.1 [.] __ecp_nistz256_subx
0.30% libcrypto.so.1.1 [.] CRYPTO_free
0.28% libcrypto.so.1.1 [.] bn_add_words
0.28% libcrypto.so.1.1 [.] EVP_MD_CTX_copy_ex
0.28% libcrypto.so.1.1 [.] BN_rshift1
0.28% libcrypto.so.1.1 [.] CRYPTO_malloc
0.27% libcrypto.so.1.1 [.] SHA512_Final
0.27% libssl.so.1.1 [.] SSL3_RECORD_clear
0.25% libssl.so.1.1 [.] tls12_shared_sigalgs
0.25% libssl.so.1.1 [.] state_machine
0.25% libcrypto.so.1.1 [.] EVP_EncryptUpdate
Flamegraph:
TODO: a new revision of Tempesta TLS is no be tested, but not released yet.
Concurrent connections: 16384 File size: 8b or 20Kb DUT load: 100%
For 8 bytes file: ~327'000 Requests/sec
Perf top:
Overhead Shared Object Symbol
6.24% libc-2.24.so [.] _int_malloc
1.64% [kernel] [k] syscall_return_via_sysret
1.59% libssl.so.1.1 [.] ssl3_get_record
1.41% [mlx4_core] [k] mlx4_eq_int
1.38% [mlx4_en] [k] mlx4_en_process_rx_cq
1.30% [kernel] [k] __fget_light
1.23% [kernel] [k] tcp_recvmsg
0.99% libssl.so.1.1 [.] do_ssl3_write
0.84% [kernel] [k] sock_poll
0.84% libcrypto.so.1.1 [.] EVP_MD_CTX_md
0.83% libc-2.24.so [.] _int_free
0.81% [kernel] [k] tcp_ack
0.79% nginx [.] ngx_rbtree_insert_timer_value
0.79% nginx [.] ngx_vslprintf
0.77% libcrypto.so.1.1 [.] aesni_encrypt
0.77% [mlx4_en] [k] mlx4_en_xmit
0.77% [kernel] [k] copy_user_enhanced_fast_string
0.76% nginx [.] ngx_http_create_request
0.75% libc-2.24.so [.] __memset_avx2_unaligned_erms
0.71% nginx [.] ngx_ssl_send_chain
0.69% [kernel] [k] __inet_lookup_established
0.68% libssl.so.1.1 [.] ssl_read_internal
0.68% nginx [.] ngx_open_cached_file
0.68% nginx [.] ngx_epoll_process_events
0.67% [mlx4_en] [k] mlx4_en_process_tx_cq
0.65% libc-2.24.so [.] malloc_consolidate
0.65% nginx [.] ngx_http_header_filter
0.65% [kernel] [k] tcp_transmit_skb
0.64% libcrypto.so.1.1 [.] aes_gcm_cipher
0.64% nginx [.] ngx_http_keepalive_handler
0.63% [kernel] [k] tcp_write_xmit
0.63% nginx [.] ngx_http_parse_request_line
0.62% nginx [.] ngx_output_chain
0.61% [kernel] [k] native_irq_return_iret
0.61% [kernel] [k] copy_page_to_iter
0.60% [kernel] [k] _raw_spin_lock
0.58% libc-2.24.so [.] __memmove_avx_unaligned_erms
0.56% [kernel] [k] __qdisc_run
0.56% libssl.so.1.1 [.] tls1_enc
0.55% libcrypto.so.1.1 [.] bio_read_intern
0.55% libcrypto.so.1.1 [.] gcm_ghash_avx
0.54% [kernel] [k] __x86_indirect_thunk_rax
0.54% libcrypto.so.1.1 [.] ERR_clear_error
0.52% [kernel] [k] inet_recvmsg
0.51% [kernel] [k] tcp_sendmsg_locked
0.51% nginx [.] ngx_http_parse_header_line
0.50% nginx [.] ngx_reusable_connection
0.49% nginx [.] ngx_http_write_filter
0.49% [kernel] [k] fsnotify
0.47% libcrypto.so.1.1 [.] aes_gcm_ctrl
0.47% libcrypto.so.1.1 [.] EVP_CIPHER_CTX_cipher
0.45% [kernel] [k] __list_del_entry_valid
0.45% nginx [.] ngx_ssl_recv
0.43% libcrypto.so.1.1 [.] aesni_ctr32_encrypt_blocks
0.43% [kernel] [k] pfifo_fast_dequeue
0.43% [kernel] [k] rw_verify_area
Flamegraph:
For 20Kb bytes file: ~55'000 Requests/sec
Perf top:
Overhead Shared Object Symbol
6.48% libcrypto.so.1.1 [.] _aesni_ctr32_ghash_6x
3.05% libc-2.24.so [.] __memmove_avx_unaligned_erms
2.42% [kernel] [k] copy_user_enhanced_fast_string
2.39% [mlx4_en] [k] mlx4_en_process_rx_cq
2.03% [kernel] [k] pfifo_fast_dequeue
1.89% [mlx4_core] [k] mlx4_eq_int
1.77% libc-2.24.so [.] _int_malloc
1.76% [kernel] [k] tcp_ack
1.68% [kernel] [k] skb_release_data
1.63% [kernel] [k] __inet_lookup_established
1.52% [kernel] [k] tcp_transmit_skb
1.29% [kernel] [k] tcp_wfree
1.16% [mlx4_en] [k] mlx4_en_xmit
1.11% [kernel] [k] tcp_write_xmit
1.08% [mlx4_en] [k] mlx4_en_process_tx_cq
1.08% [kernel] [k] kmem_cache_free
1.08% [kernel] [k] _raw_spin_lock
0.85% [kernel] [k] kfree
0.83% [kernel] [k] napi_consume_skb
0.80% [kernel] [k] skb_split
0.80% [kernel] [k] tcp_check_space
0.79% [kernel] [k] tcp_v4_rcv
0.76% [kernel] [k] native_irq_return_iret
0.76% [kernel] [k] mod_timer
0.71% [kernel] [k] memcpy_erms
0.67% [kernel] [k] __qdisc_run
0.64% [kernel] [k] __x86_indirect_thunk_rax
0.63% [kernel] [k] tcp_rcv_established
0.63% [kernel] [k] ip_queue_xmit
0.63% [kernel] [k] ___slab_alloc
0.61% [kernel] [k] netif_skb_features
0.59% [kernel] [k] __alloc_skb
0.55% [kernel] [k] tcp_md5_do_lookup
0.54% [mlx4_en] [k] mlx4_en_free_tx_desc
0.53% [kernel] [k] syscall_return_via_sysret
0.53% [kernel] [k] _raw_spin_lock_irqsave
0.52% [kernel] [k] tcp_init_tso_segs
0.51% libssl.so.1.1 [.] ssl3_get_record
0.50% [kernel] [k] tcp_v4_early_demux
0.49% [kernel] [k] ip_finish_output2
0.48% [kernel] [k] tcp_write_timer_handler
0.47% [kernel] [k] __fget_light
0.47% [kernel] [k] __kmalloc_node_track_caller
0.46% [kernel] [k] inet_gro_receive
0.44% [kernel] [k] __skb_clone
0.44% libssl.so.1.1 [.] do_ssl3_write
0.44% [kernel] [k] __slab_free
0.41% [kernel] [k] __list_del_entry_valid
0.41% [kernel] [k] __netdev_pick_tx
0.41% [kernel] [k] tcp_sendmsg_locked
0.39% [kernel] [k] __dev_queue_xmit
0.38% [kernel] [k] skb_clone
0.38% [kernel] [k] ipv4_dst_check
0.37% [kernel] [k] pfifo_fast_enqueue
0.36% [kernel] [k] ipv4_mtu
0.36% [kernel] [k] __netif_receive_skb_core
Flamegraph:
TODO: a new revision of Tempesta TLS is no be tested, but not released yet.