-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation on Fedora CoreOS after upgrade from F32 to F33 #755
Comments
Thanks for the report. Does the network degradation affect host-to-host communications too, or is it only visible on the overlay networks? From a quick look at the report, I have a few doubts:
Making a random guess, DNS is the usual suspect when there is a sudden noticeable increase in latency. The common scenario is that, without aggressive connection reusing, something in the stack tries to perform a DNS resolution on each new connection. It may not be noticeable in most cases, but it may suddenly spike due to changes in how the stack handles caching / search-domains / retries / negative responses / etc. |
Thank you for the quick response. We didn't test host-to-host, as according to
Regarding DNS, we are looking into it. Our applications are actually geared towards aggressive connection reuse via connection pooling. We will try to collect some metrics to prove it. We also collected some flame graphs in F32 and F33 using perf tools to detect deviations between the OS versions, but we didn't find significant differencies. For us it looks like everything runs a bit slower in F33. |
We enabled DNS metrics and compared it between F32 and F33 without any significant differences, especially concerning the number of packets and response times. For illustration of the gap between F32 and F33 have a look at the CPU load (yellow is F33, blue is F32) running two workers processing same type of jobs and the throughput (6x workers, 5x workers running F32, 1x worker running F33, each pair running the same type of a job) But maybe we are onto something (which could also explain the performance drop as we moved from CoreOS to Fedora CoreOS #542 - not the network but the overall performance part). During our investigation of low level OS settings and metrics (scheduling, executed time-slices per CPU, system and user load etc.) we discovered that Could it be, that CoreOS defaults were optimized for max. server performance and Fedora CoreOS is more geared towards desktop workloads and energy efficiency? Do you have any recommendations to optimize for max. server performance and minimal latencies? |
Re-running our load and performance tests several times with |
Describe the bug
We are running load and performance tests with our Java application in a test Kubernetes cluster regularly. After upgrading from 32.20201104.3.0 to 33.20201201.3.0 (and up to 33.20210201.3.0) we discovered a severe performance degradation in application response times. As we were upgrading Kubernetes (https://typhoon.psdn.io) along the way, we weren't sure if it was related to the Kubernetes version. We re-provisioned 32.20201104.3.0 with Kubernetes 1.20.2 and the performance degradation went away. We also tested 33.20210117.3.2 with Kubernetes 1.20.2 and different network fabrics (Calico, Flannel) without any positive effect on the performance degradation. So, it looks like something in F33 breaks it for us. It also looks like the CPU load increased in F33.
We would be happy to run further tests to narrow down the issue, but we are out of ideas how to proceed. Any hints/ideas would be highly appreciated.
System details
Ignition config
https://gist.github.com/wkruse/107e3be2ffb2ead7c26a955fe8f0b0e8
Additional infos
Some
iperf3
test resultsWe also tested with different MTU sizes (1440, 1460, 1480, 1500) without any positive effect.
The text was updated successfully, but these errors were encountered: