Skip to content

HTTP transactions performance

Alexander Krizhanovsky edited this page Oct 8, 2023 · 13 revisions

In this test a load generator establishes a new TCP connection with an HTTP server for each HTTP request. I.e. all the HTTP requests contain header

Connection: close

The main purpose of the test is to cause significant load onto the Linux TCP/IP stack and check how the stack and Tempesta FW scale. In this test case we run Tempesta FW and Nginx inside a virtual machine.

While the Linux TCP/IP stack does scale, it's a real issue to get an appropriate hardware setup, which can deliver small enough overhead for the small packets workload.

Virtualization

The single step of the benchmark involves only 2 HTTP messages (a request and a response correspondingly) processing, while, generally speaking, there are 3 TCP connection handshake segments, 4 connection closing segments, and 2 data segments ACKnoledges (the TCP/IP stack coalesces some of the segments though). The main property of the benchmark is processing many small packets.

While the most basic VM setup can easily deliver 10Gbps throughput, many small packets is a well-known problem for the modern virtualization solutions. See Hardware virtualization performance wiki page for recommendations how to efficiently set up a virtual machine for such workloads.

The misbeliefs

There is a misbelief that the Linux kernel TCP/IP stack does not scale. The 'not scale' could be even 4 CPUs. For example watch this video from F5 or see the F-stack benchmarks which shows that Nginx can scale only for 40% from 1 CPU to 12!

There are discussion issues:

The most our concern about the benchmarks was the absence of data about the tests environment and generally inability to reproduce the benchmarks. In this page we do our best to provide as much data as possible to get the reproducible results. We appreciate if you file an issue in case of the inability to reproduce the results.

Testing environment

The hardware

We used two servers. Server 1:

  • Intel Xeon CPU E3-1240v5 (4 cores, 8 hyperthreads)
  • 32GB RAM
  • Mellanox ConnectX-2 Ethernet 10Gbps network adapter
  • Debian 9.12 (Linux 4.19.0-0.bpo.8-amd6)

Server 2:

  • Intel Xeon CPU E3-1240v5 (4 cores, 8 hyperthreads)
  • 32GB RAM
  • Mellanox ConnectX-2 Ethernet 10Gbps network adapter
  • Ubuntu 16.04.6 LTS (Linux 4.4.0-97-generic)

sysctl settings

In all the tests We used the same sysctl settings for the SUT (system under test). We used hardware and VM-base load generates, which also used similar sysctl settings.

    sysctl -w net.ipv4.tcp_max_tw_buckets=32
    sysctl -w net.ipv4.tcp_max_orphans=32
    sysctl -w net.ipv4.tcp_tw_reuse=1
    sysctl -w net.ipv4.tcp_fin_timeout=1

Since the hardware load generator has Linux 4.4, we also used

sysctl -w net.ipv4.tcp_tw_recycle=1

on it. These settings are required to faster release sockets. Otherwise too many TIME-WAIT sockets are produced and the system (either server or client) spends a lot of time for looping in __inet_check_established() (see more in bug report Poor __inet_check_established() implementation).

Plus to these settings, Tempesta FW's start script implies following sysctl's:

# Tempesta builds socket buffers by itself, don't cork TCP segments.
sysctl -w net.ipv4.tcp_autocorking=0 >/dev/null
# Sotfirqs are doing more work, so increase input queues.
sysctl -w net.core.netdev_max_backlog=10000 >/dev/null
sysctl -w net.core.somaxconn=131072 >/dev/null
sysctl -w net.ipv4.tcp_max_syn_backlog=131072 >/dev/null

, so Nginx also benefits from the settings.

The system under test (SUT) VM

In all the tests we used the same VM running on the 1st server with Debian 9.12 and Tempesta kernel 4.14.32-tfw (we also tried Nginx with the native Debian kernel and it didn't show any performance differences).

The load generator

Either a 4 vCPU VM with Debian 9.12 (Linux kernel 4.19.0) running on the same server 1 (VM-to-VM tests). Or a separate hardware server 2 with Ubuntu 16.04.6 LTS (Linux 4.4.0-97-generic) for tests Hardware-to-VM tests.

Benchmark tool

The Apache HTTP server benchmarking tool , ab, is still a single-threaded tool, which doesn't suite to benchmark high performance multi-process/thread HTTP servers. Moreover, ab -n 100000 -c 10000 can't efficiently handle 10K and more connections, so Nginx and Tempesta FW are underutilized.

For the non-keepalive test we used wrk with -H 'Connection: close' command line attribute option.

Nginx

In all the tests we used the same Nginx 1.18.0 configuration. This is almost the same Nginx configuration with several well-known performance tuning options, as was verified by the Nginx development team:

user www-data;
worker_processes auto;
worker_cpu_affinity auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
    worker_connections   65536;
    use epoll;
    multi_accept on;
    accept_mutex off;
}
worker_rlimit_nofile    1000000;

http {
    keepalive_timeout 600;
    keepalive_requests 10000000;
    sendfile         on;
    tcp_nopush       on;
    tcp_nodelay      on;

    open_file_cache max=1000 inactive=3600s;
    open_file_cache_valid 3600s;
    open_file_cache_min_uses 2;
    open_file_cache_errors off;

    error_log /dev/null emerg;
    access_log off;

    server {
	listen 9090 backlog=131072 deferred reuseport fastopen=4096;

        location / {
            root /var/www/html;
        }
    }
}

The data files:

# ls -l /var/www/html/
total 4
-rw-r--r-- 1 root root 600 Jun 10 15:54 index.html

We use the same file size of 600 bytes as F-stack uses in their the most impressive test against Nginx and the Linux TCP/IP stack.

Tempesta FW

The current master version (commit f6946bdefc016944216e297d94e216baab84bf98, with the HTTP/2 performance regression, which has about 20% worse performance than a normal Tempesta FW build). Configuration file:

listen 80;

srv_group default {
    server 127.0.0.1:9090;
}
vhost default {
    proxy_pass default;
}

cache 1;
cache_fulfill * *;

http_chain {
    -> default;
}

Tempesta FW fetches the data file from the Nginx and stores it in the cache.

VM-to-VM on the same host

Two virtual KVM machines were deployed on the 1st server.

    +----------------------------------------------------+
    |                    [Server 1]                      |
    |                    v                               |
    | +--------------+   i                               |
    | |   [SUT VM]   |   r      +----------------------+ |
    | |              |   t      | [Load generation VM] | |
    | |   Nginx -----*-- i      |                      | |
    | |     ^        |   o <=== *         wrk          | |
    | |     |        |   -      |                      | |
    | | Tempesta FW -*-- n      +----------------------+ |
    | +--------------+   e                               |
    |                    t                               |
    +----------------------------------------------------+

The number of virtual CPUs for them were changed between the tests using libvirt interface, e.g. for 4 CPUs:

    <cputune>
      <vcpupin vcpu='0' cpuset='0'/>
      <vcpupin vcpu='1' cpuset='1'/>
      <vcpupin vcpu='2' cpuset='2'/>
      <vcpupin vcpu='3' cpuset='3'/>
    </cputune>

Both the VMs use virtio-net NICs:

# ethtool -i ens2|grep driver
driver: virtio_net

The libvirt configuration:

    <interface type='network'>
      <mac address='52:54:00:ea:4b:97'/>
      <source network='routed'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>

To be more precise the machines are running with following options (Debian9 is the load generator and TempestaPerfTest is the system under the test):

# ps -waef|grep 'Debian9\|TempestaPerfTest'
libvirt+  3545     1 13 17:07 ?        00:42:02 qemu-system-x86_64 -enable-kvm -name guest=Debian9,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-12-Debian9/master-key.aes -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,umip=on,xsaves=on,pdpe1gb=on -m 2048 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid bca6ec61-316d-464e-8fc4-b67054bb26ba -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=29,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x3.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x3 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x3.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x3.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/libvirt/images/tempesta-perf-test-clone.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -fsdev local,security_model=passthrough,id=fsdev-fs0,path=/opt/tempesta/tempesta-vm -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=tempesta,bus=pci.0,addr=0x7 -netdev tap,fds=31:32:33:34,id=hostnet0,vhost=on,vhostfds=35:36:37:38 -device virtio-net-pci,mq=on,vectors=10,netdev=hostnet0,id=net0,mac=52:54:00:ea:4b:97,bus=pci.0,addr=0x2 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -add-fd set=9,fd=40 -chardev file,id=charserial1,path=/dev/fdset/9,append=on -device isa-serial,chardev=charserial1,id=serial1 -chardev socket,id=charchannel0,fd=39,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
libvirt+  4131     1 18 20:13 ?        00:24:17 qemu-system-x86_64 -enable-kvm -name guest=TempestaPerfTest,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-14-TempestaPerfTest/master-key.aes -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,umip=on,xsaves=on,pdpe1gb=on -m 2048 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid ead2372f-af65-47d2-9d1a-e83e286c159f -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=28,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x3.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x3 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x3.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x3.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/libvirt/images/tempesta-perf-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -fsdev local,security_model=passthrough,id=fsdev-fs0,path=/opt/tempesta/tempesta-vm -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=tempesta,bus=pci.0,addr=0x7 -netdev tap,fds=31:32:33:34,id=hostnet0,vhost=on,vhostfds=35:36:37:38 -device virtio-net-pci,mq=on,vectors=10,netdev=hostnet0,id=net0,mac=52:54:00:07:88:51,bus=pci.0,addr=0x2 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -add-fd set=9,fd=40 -chardev file,id=charserial1,path=/dev/fdset/9,append=on -device isa-serial,chardev=charserial1,id=serial1 -chardev socket,id=charchannel0,fd=39,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

VM-to-VM: 1 CPU SUT

Nginx:

# wrk --latency -H 'Connection: close' -c 8192 -d 30 -t 8 http://192.168.200.80:9090/
Running 30s test @ http://192.168.200.80:9090/
  8 threads and 8192 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    49.46ms  113.73ms   1.88s    89.24%
    Req/Sec     1.70k   726.97     6.68k    71.25%
  Latency Distribution
     50%   17.98ms
     75%   20.20ms
     90%  234.95ms
     99%  465.98ms
  405901 requests in 30.09s, 322.45MB read
  Socket errors: connect 0, read 0, write 0, timeout 579
Requests/sec:  13489.36
Transfer/sec:     10.72MB

Tempesta FW:

# wrk --latency -H 'Connection: close' -c 8192 -d 30 -t 8 http://192.168.200.80/
Running 30s test @ http://192.168.200.80/
  8 threads and 8192 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    37.40ms   88.27ms   1.81s    91.36%
    Req/Sec     1.76k   823.02     9.36k    71.88%
  Latency Distribution
     50%   14.31ms
     75%   15.73ms
     90%   19.14ms
     99%  446.43ms
  419715 requests in 30.09s, 356.24MB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:  13947.89
Transfer/sec:     11.84MB

VM-to-VM: 2 CPUs SUT

The load generation VM has 4 virtual CPUs and the SUT VM has 2 virtual CPUs.

Nginx:

# wrk --latency -H 'Connection: close' -c 8192 -d 30 -t 8 http://192.168.200.80:9090/
Running 30s test @ http://192.168.200.80:9090/
  8 threads and 8192 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    63.85ms  162.82ms   1.98s    91.65%
    Req/Sec     2.37k   584.55     5.34k    68.52%
  Latency Distribution
     50%   24.56ms
     75%   28.57ms
     90%   39.13ms
     99%  861.46ms
  563342 requests in 30.08s, 447.52MB read
  Socket errors: connect 0, read 1, write 0, timeout 2215
Requests/sec:  18727.89
Transfer/sec:     14.88MB

Tempesta FW:

# wrk --latency -H 'Connection: close' -c 8192 -d 30 -t 8 http://192.168.200.80/
Running 30s test @ http://192.168.200.80/
  8 threads and 8192 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.37ms   17.84ms 465.51ms   99.42%
    Req/Sec     6.01k     4.49k   14.92k    46.67%
  Latency Distribution
     50%    6.80ms
     75%   11.09ms
     90%   14.04ms
     99%   19.59ms
  1256004 requests in 30.07s, 1.04GB read
  Socket errors: connect 7179, read 60, write 0, timeout 0
Requests/sec:  41775.59
Transfer/sec:     35.54MB

VM-to-VM: 4 CPUs SUT

The 4 vCPUs setup looks very similar to the F5 and F-Stack tests.

The results for Nginx are:

# wrk --latency -H 'Connection: close' -c 8192 -d 30 -t 8 http://192.168.200.80:9090/
Running 30s test @ http://192.168.200.80/
  8 threads and 8192 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.02ms   48.66ms   1.74s    99.12%
    Req/Sec     7.69k     2.05k   16.31k    71.06%
  Latency Distribution
     50%    6.16ms
     75%   10.63ms
     90%   16.31ms
     99%   49.30ms
  1832094 requests in 30.10s, 1.52GB read
  Socket errors: connect 0, read 13133, write 0, timeout 13
Requests/sec:  60865.96
Transfer/sec:     51.72MB

Tempesta FW:

# wrk --latency -H 'Connection: close' -c 8192 -d 30 -t 8 http://192.168.200.80/
Running 30s test @ http://192.168.200.80:9090/
  8 threads and 8192 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.89ms   15.06ms 924.32ms   95.50%
    Req/Sec     8.48k     2.07k   18.37k    71.04%
  Latency Distribution
     50%    8.38ms
     75%   13.60ms
     90%   20.25ms
     99%   40.00ms
  2015012 requests in 30.10s, 1.56GB read
  Socket errors: connect 0, read 11684, write 0, timeout 23
Requests/sec:  66954.88
Transfer/sec:     53.19MB

While that the load generator and the SUT have the same CPU power, so wrk can't produce enough load: we can check this with top on the host - the host server is fully loaded. The load generation VM has PID 3545 and uses the same CPU as the SUT VM with Nginx (PID 4131):

# top -b
top - 20:39:57 up 1 day, 22:13,  4 users,  load average: 9.03, 4.39, 2.64
Tasks: 188 total,   8 running, 180 sleeping,   0 stopped,   0 zombie
%Cpu(s): 65.6 us, 25.4 sy,  0.0 ni,  9.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32058.4 total,  24400.9 free,   6056.4 used,   1601.1 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  25510.1 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3545 libvirt+  20   0 3724052 526820  21000 S 275.0   1.6  27:15.28 qemu-system-x86
 4131 libvirt+  20   0 3262420   1.6g  21132 S 275.0   5.2  16:32.54 qemu-system-x86
 3551 root      20   0       0      0      0 R  25.0   0.0   1:41.21 vhost-3545
 4136 root      20   0       0      0      0 R  25.0   0.0   1:20.74 vhost-4131
 4138 root      20   0       0      0      0 R  25.0   0.0   1:19.12 vhost-4131
 3549 root      20   0       0      0      0 R  18.8   0.0   1:41.26 vhost-3545
 3550 root      20   0       0      0      0 S  18.8   0.0   1:39.53 vhost-3545
 3552 root      20   0       0      0      0 R  18.8   0.0   1:43.35 vhost-3545
 4135 root      20   0       0      0      0 R  18.8   0.0   1:19.22 vhost-4131
 4137 root      20   0       0      0      0 R  18.8   0.0   1:19.67 vhost-4131

Or little bit less for Tempesta FW:

# top -b
top - 20:55:01 up 1 day, 22:28,  4 users,  load average: 4.07, 2.21, 2.53
Tasks: 195 total,   8 running, 187 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.0 us, 37.0 sy,  0.0 ni, 14.2 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
MiB Mem :  32058.4 total,  24358.7 free,   6097.1 used,   1602.6 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  25469.4 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3545 libvirt+  20   0 3724052 566296  21000 S 268.8   1.7  38:26.09 qemu-system-x86
 4131 libvirt+  20   0 3460056   1.6g  21132 S 231.2   5.2  23:53.14 qemu-system-x86
 3549 root      20   0       0      0      0 R  25.0   0.0   2:19.51 vhost-3545
 4135 root      20   0       0      0      0 R  25.0   0.0   1:59.48 vhost-4131
 4137 root      20   0       0      0      0 S  25.0   0.0   2:00.91 vhost-4131
 4138 root      20   0       0      0      0 R  25.0   0.0   1:59.67 vhost-4131
 3550 root      20   0       0      0      0 R  18.8   0.0   2:16.52 vhost-3545
 3552 root      20   0       0      0      0 R  18.8   0.0   2:22.02 vhost-3545
 4136 root      20   0       0      0      0 R  18.8   0.0   2:02.14 vhost-4131
 3551 root      20   0       0      0      0 R  12.5   0.0   2:19.66 vhost-3545

Hardware-to-VM

In this test case we run wrk on a separate server. Nginx and Tempesta FW are listening on separate sockets inside a VM on the first server.

    +-----------------------+      +--------------+
    |    [Server 1]     NIC |      |  [Server 2]  |
    | +--------------+   +--* <=== * NIC          |
    | |   [SUT VM]   |   m  |      |              |
    | |              |   a  |      |      wrk     |
    | |   Nginx -----*-- c  |      +--------------+
    | |     ^        |   v  |
    | |     |        |   t  |
    | | Tempesta FW -*-- a  |
    | +--------------+   p  |
    +-----------------------+

Macvtap virtual interface is used to attach the VM to the server NIC directly. We configure number of the interface queues equal to the virtual CPUs in each test:

    <interface type='direct'>
      <mac address='52:54:00:07:88:51'/>
      <source dev='eth2' mode='private'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>

Ping time from the server 2 to the VM (we used small network load instead of disabling C-states):

# ping -qc 5 172.16.0.200
PING 172.16.0.200 (172.16.0.200) 56(84) bytes of data.

--- 172.16.0.200 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3998ms
rtt min/avg/max/mdev = 0.043/0.060/0.068/0.012 ms

The network throughput:

# iperf3 -c 172.16.0.200 -p 5000
Connecting to host 172.16.0.200, port 5000
[  4] local 172.16.0.101 port 46758 connected to 172.16.0.200 port 5000
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.06 GBytes  9.14 Gbits/sec   20    684 KBytes       
[  4]   1.00-2.00   sec  1.09 GBytes  9.32 Gbits/sec    0    686 KBytes       
[  4]   2.00-3.00   sec  1.09 GBytes  9.35 Gbits/sec    0    686 KBytes       
[  4]   3.00-4.00   sec  1.09 GBytes  9.33 Gbits/sec    0    687 KBytes       
[  4]   4.00-5.00   sec  1.09 GBytes  9.36 Gbits/sec    0    689 KBytes       
[  4]   5.00-6.00   sec  1.09 GBytes  9.37 Gbits/sec    0    689 KBytes       
[  4]   6.00-7.00   sec  1.09 GBytes  9.36 Gbits/sec    0    724 KBytes       
[  4]   7.00-8.00   sec  1.09 GBytes  9.36 Gbits/sec    0    752 KBytes       
[  4]   8.00-9.00   sec  1.09 GBytes  9.34 Gbits/sec    0    764 KBytes       
[  4]   9.00-10.00  sec  1.09 GBytes  9.34 Gbits/sec    0    773 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec   20             sender
[  4]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec                  receiver

Now we use the sam SUT VM on the same 1st server, but we use the 2nd server instead of a VM to generate workload with wrk:

1 CPU VM

Nginx:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200:9090/
Running 30s test @ http://172.16.0.200:9090/
  16 threads and 8192 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    85.96ms  219.14ms   1.73s    91.60%
    Req/Sec     1.25k   786.01     5.83k    73.10%
  Latency Distribution
     50%    7.38ms
     75%   10.40ms
     90%  218.75ms
     99%    1.10s 
  39835 requests in 2.03s, 31.65MB read
Requests/sec:  19644.42
Transfer/sec:     15.61MB

Tempesta FW:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200/
Running 30s test @ http://172.16.0.200/
  16 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    44.68ms  123.44ms   1.70s    85.79%
    Req/Sec     1.58k   613.33    11.94k    76.34%
  Latency Distribution
     50%    5.80ms
     75%    6.90ms
     90%  208.40ms
     99%  421.52ms
  755829 requests in 30.06s, 641.53MB read
  Socket errors: connect 0, read 0, write 0, timeout 1338
Requests/sec:  25140.11
Transfer/sec:     21.34MB

2 CPUs VM

Nginx:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200:9090/
Running 30s test @ http://172.16.0.200:9090/
  16 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.23ms  153.62ms   2.00s    87.33%
    Req/Sec     2.22k   475.19    10.61k    79.10%
  Latency Distribution
     50%   23.06ms
     75%  121.31ms
     90%  279.46ms
     99%  601.62ms
  1063509 requests in 30.10s, 844.86MB read
  Socket errors: connect 0, read 568, write 0, timeout 315
Requests/sec:  35332.82
Transfer/sec:     28.07MB

Tempesta FW:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200/
Running 30s test @ http://172.16.0.200/
  16 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    35.47ms  106.19ms   2.00s    88.99%
    Req/Sec     3.51k     0.86k   10.87k    74.44%
  Latency Distribution
     50%    4.41ms
     75%    8.07ms
     90%  204.60ms
     99%  438.66ms
  1677128 requests in 30.10s, 1.39GB read
  Socket errors: connect 0, read 708, write 0, timeout 4388
Requests/sec:  55719.40
Transfer/sec:     47.29MB

4 CPUs VM

Nginx:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200:9090/
Running 30s test @ http://172.16.0.200:9090/
  16 threads and 16384 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    33.59ms  101.61ms   1.92s    91.52%
    Req/Sec     5.18k     1.71k   11.37k    62.63%
  Latency Distribution
     50%    6.82ms
     75%   14.34ms
     90%   23.79ms
     99%  490.17ms
  1374094 requests in 16.71s, 1.07GB read
  Socket errors: connect 0, read 0, write 0, timeout 317
Requests/sec:  82227.96
Transfer/sec:     65.32MB

Tempesta FW:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200/
Running 30s test @ http://172.16.0.200/
  16 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.60ms   88.15ms   1.82s    90.86%
    Req/Sec     7.07k     1.16k   22.23k    73.76%
  Latency Distribution
     50%    3.59ms
     75%    5.45ms
     90%   11.89ms
     99%  411.26ms
  3378447 requests in 30.10s, 2.81GB read
  Socket errors: connect 0, read 47, write 0, timeout 72
Requests/sec: 112244.48
Transfer/sec:     95.48MB

And here we reach the KVM interruptions bottleneck. Since our hardware doesn't support vAPIC we had to stop our tests. perf kvm stat report (see Interruptions & network performance wiki for the problem description) is

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

  EXTERNAL_INTERRUPT    5073570    75.37%     2.02%      0.22us   3066.52us      0.96us ( +-   0.49% )
       EPT_MISCONFIG    1029496    15.29%     1.78%      0.34us   1795.92us      4.19us ( +-   0.35% )
           MSR_WRITE     279208     4.15%     0.16%      0.28us   5695.07us      1.36us ( +-   2.52% )
                 HLT     194422     2.89%    95.74%      0.30us 1504068.39us   1192.90us ( +-   4.03% )
   PENDING_INTERRUPT      89818     1.33%     0.03%      0.32us    189.53us      0.70us ( +-   0.83% )
   PAUSE_INSTRUCTION      40905     0.61%     0.26%      0.26us   1390.91us     15.39us ( +-   1.82% )
    PREEMPTION_TIMER      17384     0.26%     0.01%      0.44us    183.21us      1.49us ( +-   1.47% )
      IO_INSTRUCTION       5482     0.08%     0.01%      1.75us    186.08us      3.26us ( +-   1.19% )
               CPUID        972     0.01%     0.00%      0.30us      5.29us      0.66us ( +-   1.94% )
            MSR_READ        104     0.00%     0.00%      0.49us      2.54us      0.94us ( +-   3.29% )
       EXCEPTION_NMI          6     0.00%     0.00%      0.37us      0.78us      0.59us ( +-   9.68% )
Clone this wiki locally