-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accurate measurement of pod startup latency #31143
Comments
I think option (1) is sufficient for node performance tests, since the conservative QPS limit is usually set for larger clusters, and is not that meaningful in a node-centric test. |
@coufon a couple of questions:
|
@huang195 @coufon is working on node performance benchmark, which includes a group of performance node e2e test and a performance dashboard analyzing resource usage and operation latency summaries across different builds and details in a specific build. We've already added a continuous running node performance test suite, and will start a node performance dashboard to analyze and publish the data. The node performance dashboard is a pod. We'll later put it into FYI, related feature kubernetes/enhancements#83. |
@huang195 The answers are:
|
@Random-Liu @coufon thanks. It's great to have a such a tool to visualize end-to-end operation time on a node. |
Close this issue as we already support different QPS limits in node-e2e-density test, see this PR: #32250 |
In the current density tests pod startup latency is measured as the duration from creating pods in test to observing pods running at apiserver. The problem is that when we create a large number of pods, the additional latency caused by Kubelet QPS limit (default 5) is very large. As a result, we can not see the actual kubelet performance from the test, and we report a under-estimated node performance (e.g. latency, throughput).
In order to have a better test focused on node performance, options includes:
I tried 1) and 2) and here are some results:
The density test runs on GCE n1-standard-1 node. It creates 105 pods and measures the e2e latency. Before build 60 QPS limit is 5, after that QPS limit is 60. As shown in Figure 1, the e2e latency drops from ~110s to ~60s:
<img src="https://cloud.githubusercontent.com/assets/11655397/17865944/7ced6248-6859-11e6-9755-aeae166bf905.png" width="70%", height="70%">
Figure 1. Pod creation latency
We also observe that CPU usage increases due to larger QPS, as shown in Figure 2 (a)(b):
<img src="https://cloud.githubusercontent.com/assets/11655397/17866101/199e92ce-685a-11e6-8e39-9fc3d699e248.png" width="70%", height="70%">
(a) Kubelet
<img src="https://cloud.githubusercontent.com/assets/11655397/17866128/34eba4cc-685a-11e6-96ee-6c3d162e46c6.png" width="70%", height="70%">
(b) Docker
Figure 3. CPU usage
Another problem about the QPS limit is the fluctuation of measurement. We can see large latency fluctuation in Figure 1 before build 60. For example, build 54 is ~35s slower than build 55. But if we look into the time series data, the main cause is the tardiness of observing pod status being running, but not bottleneck in kubelet, docker.
This additional latency can be removed by increasing QPS limit in test, as shown in Figure 4(c) for build 68.
<img src="https://cloud.githubusercontent.com/assets/11655397/17866387/3250b38c-685b-11e6-87e2-18ce94fc9e1c.png" width="70%", height="70%">
(a) build 54
<img src="https://cloud.githubusercontent.com/assets/11655397/17866445/605aeef0-685b-11e6-8f78-da9f3c9019e7.png" width="70%", height="70%">
(b) build 55
<img src="https://cloud.githubusercontent.com/assets/11655397/17867084/db50fe4a-685d-11e6-8e0b-b2e4b41a3c7c.png" width="70%", height="70%">
(c) build 68
Figure 4. Time series data of creating pods.
The curve gives the number of pods arriving at a certain probe ('create_test ': create pod in test; 'running_test': observe that pod is running in test; 'firstSeen': pod configuration arrives at kubelet SyncLoop; 'container': arrives at container manager in kubelet syncPod; 'running': observe that pod is running in kubelet SyncLoop)
The text was updated successfully, but these errors were encountered: