Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nginx benchmark results in the CPS test #519

Open
krizhanovsky opened this issue Jun 9, 2020 · 13 comments
Open

Nginx benchmark results in the CPS test #519

krizhanovsky opened this issue Jun 9, 2020 · 13 comments

Comments

@krizhanovsky
Copy link

krizhanovsky commented Jun 9, 2020

Hi,

we're testing our in-kernel HTTPS proxy against Nginx and compare our results with kernel-bypass proxies, so I came to your project.

I noticed in your performance data https://github.com/F-Stack/f-stack/blob/dev/CPS.png that Nginx on top of the Linux TCP/IP stack doesn't scale at all with increasing CPU number - why? Even having some hard lock contention, I would not expect to see absolutely flat performance curve for the Linux kernel and Nginx. For me it, seems there is some misconfiguration for Nginx... Could you please share the Nginx configuration file for the test? I appreciate much If you could show perf top for Nginx/Linux.

Also we found quite problematic to generate enough load to test high-performance HTTP server. For our case we needed more than 40 cores and 2 10G NICs for wrk to cause enough load to reach 100% of resources on our server on 4 cores. What did you use to get the maximum results for 20 cores?

Thanks in advance!

@krizhanovsky
Copy link
Author

Well, I noticed the numbers difference for throughput: 0.34 for 1 CPU and up to 0.48 on 12 CPUs, so the difference is about 40%. Assuming that the RPS ratio is the same, the curve is still too flat...

Also https://github.com/F-Stack/f-stack#nginx-testing-result says that you used Linux 3.10.104, which was releases in October 2016 and is just a patch level of the original 3.10 from 2013. Having, that there were a lot of scalability improvements in the Linux TCP/IP stack during these 7 year, I'm wondering if you have performance comparison with the newer Linux TCP/IP stacks?

@vincentmli
Copy link

Hi,

we're testing our in-kernel HTTPS proxy against Nginx and compare our results with kernel-bypass proxies, so I come to your project.

I noticed in your data https://github.com/F-Stack/f-stack/blob/dev/CPS.png that Nginx on top of the Linux TCP/IP stack doesn't scale at all with increasing CPU number - why? Even having some hard lock contention, I would expect to see absolutely flat performance curve for the Linux kernel and the Nginx.

it is most likely the bottleneck of interrupt since the driver in Linux kernel runs in interrupt and poll mode together (NAPI), I have a video to show that:

https://youtu.be/d0vPUwJT1mw, at 1:34, the ksoftirqd is 100% CPU usage when under load, yet Nginx CPU usage is still ok

Also we found quite problematic to generate enough load to test high-performance HTTP server. For our case we needed more than 40 cores and 2 10G NICs for wrk to cause enough load to reach 100% of resources on our server on 4 cores. What did you use to get the maximum results for 20 cores?

the DPDK/mTCP project ported a multithread version of apache bench that could do high-performance HTTP server load test. I also made a PR HTTP SSL load test mtcp-stack/mtcp#285, the apache bench statistics seems broken, but it does the job of high load/speed of web server load test

@krizhanovsky
Copy link
Author

Hi @vincentmli ,

thank you very much for sharing the video - I really enjoyed to watch it (it was also quite interesting to learn more about BIG-IP traffic handling).

Now I see that the benchmark which I cared about
CPS
is for Nginx running without listen reuseport, which is a Nginx misconfiguration if one benchmarks connections per second. See the Nginx post and an LWN article:

The first of the traditional approaches is to have a single listener thread that accepts all incoming connections and then passes these off to other threads for processing. The problem with this approach is that the listening thread can become a bottleneck in extreme cases. In early discussions on SO_REUSEPORT, Tom noted that he was dealing with applications that accepted 40,000 connections per second.

The right benchmark result is
CPS_Reuseport
, where Nginx does scale on the Linux TCP/IP stack.

Next question is the Nginx configuration files for F-stack and the Linux TCP/IP stack cases. I had a look onto https://github.com/F-Stack/f-stack/blob/dev/app/nginx-1.16.1/conf/nginx.conf and there is an issue with the filesystem. You have switched sendfile() and access_log off and use static string for 600 byte response instead of a file. Do you know any real Nginx setup not using filesystem at all? In worst case I'd expect to see Nginx files on tmpfs - more or less usable case. But I reckon the numbers won't be so nice for F-stack if it uses real filesystem access.

Which configuration files have been used for the benchmark? What was the Linux sysctl settings? Which steps were made to optimize Nginx and the Linux TCP/IP stack to make the comparison fair? Was virtio-net multiqueue used for the Linux TCP/IP stack benchmarks?

@vincentmli
Copy link

Next question is the Nginx configuration files for F-stack and the Linux TCP/IP stack cases. I had a look onto https://github.com/F-Stack/f-stack/blob/dev/app/nginx-1.16.1/conf/nginx.conf and there is an issue with the filesystem. You have switched sendfile() and access_log off and use static string for 600 byte response instead of a file. Do you know any real Nginx setup not using filesystem at all? In worst case I'd expect to see Nginx files on tmpfs - more or less usable case. But I reckon the numbers won't be so nice for F-stack if it uses real filesystem access.

F-Stack improvements are on the NIC driver level, userspace DPDK poll mode driver vs Linux kernel Interrupt/Poll (NAPI), you get the DPDK benefit from F-Stack. The problem with DPDK is lack of mature TCP/IP stack, F-Stack glues the FreeBSD TCP/IP stack with DPDK together to solve the TCP/IP stack problem (F-Stack also has done some custom work in FreeBSD TCP/IP stack to fit in DPDK model as I understand it).

The sendfile and access_log are Nginx configuration that should be irrelevant to F-Stack, F-Stack is for Network I/O improvement, not for filesystem I/O like sendfile/access_log, though it would be interesting to test with and without sendfile/access_log, slow filesystem I/O could potentially affect network I/O if network is waiting for data from filesystem to transmit

Which configuration files have been used for the benchmark? What was the Linux sysctl settings? Which steps were made to optimize Nginx and the Linux TCP/IP stack to make the comparison fair? Was virtio-net multiqueue used for the Linux TCP/IP stack benchmarks?

I can't speak for F-Stack guys since I am just an observer, Linux TCP/IP stack is very complex stack and kind of bloated (in my opinion :)). I believe the F-Stack benchmark test is based on physical hardware, not VM virtio-net in KVM/Qemu, virtio-net does not support RSS, so you can only run F-Stack in single core and single queue. You could run SR-IOV VF that support RSS offload in hardware NIC with RSS support to scale in multi core VM with multi queue.

@krizhanovsky
Copy link
Author

krizhanovsky commented Jun 17, 2020

Hi @vincentmli ,

(F-Stack also has done some custom work in FreeBSD TCP/IP stack to fit in DPDK model as I understand it)

Well, I did some observations. E.g. there is a problem with sockets hash table in Linux. I cheched the F-stack and it seems the same problem exists. The hash is struct inpcbinfo, declared in freebsd/netinet/in_pcb.h and scanned for example by a in_pcblookup_mbuf() call from tcp_input() function. We see quite the similar read lock as for Linux in the hash lookup function:

        static struct inpcb *
        in_pcblookup_hash(...)
        {
            struct inpcb *inp;

            INP_HASH_RLOCK(pcbinfo);
            inp = in_pcblookup_hash_locked(...);
            ...

The sendfile and access_log are Nginx configuration that should be irrelevant to F-Stack, F-Stack is for Network I/O improvement

Agree. I mentioned the filesystem I/O because, even in pure non-caching proxy mode, Nginx is practically unusable, so the benchmarks are somewhat theoretical.

You could run SR-IOV VF that support RSS offload in hardware NIC with RSS support to scale in multi core VM with multi queue.

Unfortunately, a the moment I have no SR-IOV capable NICs to test, but I'm wondering if SR-IOV can be used in a VM the same way as a physical NIC on a hardware server? I.e. it seems using SR-IOV we can coalesce interrupts on the NIC inside a VM and tune ksoftirqd threads for polling mode. This way we get very close to DPDK solution, but which doesn't burn out power while the system is idle.

@vincentmli
Copy link

Unfortunately, a the moment I have no SR-IOV capable NICs to test, but I'm wondering if SR-IOV can be used in a VM the same way as a physical NIC on a hardware server? I.e. it seems using SR-IOV we can coalesce interrupts on the NIC inside a VM and tune ksoftirqd threads for polling mode. This way we get very close to DPDK solution, but which doesn't burn out power while the system is idle.

DPDK also support interrupt, there is an example
https://github.com/DPDK/dpdk/tree/master/examples/l3fwd-power

@krizhanovsky
Copy link
Author

That's not a real hardware interruption - the example uses epoll(7) (in Linux) system call for polling.

@vincentmli
Copy link

That's not a real hardware interruption - the example uses epoll(7) (in Linux) system call for polling.

https://github.com/DPDK/dpdk/blob/master/examples/l3fwd-power/main.c#L860
turn on/off hardware interrupt

@krizhanovsky
Copy link
Author

Yeah, I meant that you still need to go to the kernel and back (i.e. make 2 context switches) if you use epoll(). There is no interruption handlers in the user space.

@vincentmli
Copy link

Yeah, I meant that you still need to go to the kernel and back (i.e. make 2 context switches) if you use epoll(). There is no interruption handlers in the user space.

as far as I understand it from reading the code, DPDK is handling the interrupt from userspace, epoll is just for event polling, not interrupt handling https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/linux/eal_interrupts.c#L1167

@krizhanovsky
Copy link
Author

Casually I came to this thread again with the question about interrupts handling with DPDK and I found the answer in a StackOverflow discussion https://stackoverflow.com/questions/53892565/dpdk-interrupts-rather-than-polling , So there is no real interrupts handling in DPDK.

@osevan
Copy link

osevan commented Sep 16, 2021

Hi,

we're testing our in-kernel HTTPS proxy against Nginx and compare our results with kernel-bypass proxies, so I came to your project.

I noticed in your performance data https://github.com/F-Stack/f-stack/blob/dev/CPS.png that Nginx on top of the Linux TCP/IP stack doesn't scale at all with increasing CPU number - why? Even having some hard lock contention, I would not expect to see absolutely flat performance curve for the Linux kernel and Nginx. For me it, seems there is some misconfiguration for Nginx... Could you please share the Nginx configuration file for the test? I appreciate much If you could show perf top for Nginx/Linux.

Also we found quite problematic to generate enough load to test high-performance HTTP server. For our case we needed more than 40 cores and 2 10G NICs for wrk to cause enough load to reach 100% of resources on our server on 4 cores. What did you use to get the maximum results for 20 cores?

Thanks in advance!

Could you show me the price for newest kernel and latest gcc and llvm compiler compatibility with tempesta patches for one webserver what I have?

I couldn't find any prices.

I tested tempesta 2 years ago and it was cool with module compiling and so on.

Please share us the price

@krizhanovsky
Copy link
Author

Hi @osevan ,

thank you for the request! Could you please drop me a message to ak@tempesta-tech.com or, better, schedule a call https://calendly.com/tempesta-tech/30min , so we can discuss your scenario and talk about Tempesta FW abilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants