Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accurate measurement of pod startup latency #31143

Closed
coufon opened this issue Aug 22, 2016 · 7 comments
Closed

Accurate measurement of pod startup latency #31143

coufon opened this issue Aug 22, 2016 · 7 comments
Labels
area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@coufon
Copy link
Contributor

coufon commented Aug 22, 2016

In the current density tests pod startup latency is measured as the duration from creating pods in test to observing pods running at apiserver. The problem is that when we create a large number of pods, the additional latency caused by Kubelet QPS limit (default 5) is very large. As a result, we can not see the actual kubelet performance from the test, and we report a under-estimated node performance (e.g. latency, throughput).

In order to have a better test focused on node performance, options includes:

  1. Increase QPS limit for node performance test;
  2. Add some probes to report latency inside kubelet;
  3. Use the pod startup latency monitored by Prometheus inside kubelet. The problem is that Prometheus metric is not reset at the beginning of tests. We need to add a reset http handler to kubelet metric endpoint.

I tried 1) and 2) and here are some results:

The density test runs on GCE n1-standard-1 node. It creates 105 pods and measures the e2e latency. Before build 60 QPS limit is 5, after that QPS limit is 60. As shown in Figure 1, the e2e latency drops from ~110s to ~60s:

<img src="https://cloud.githubusercontent.com/assets/11655397/17865944/7ced6248-6859-11e6-9755-aeae166bf905.png" width="70%", height="70%">
Figure 1. Pod creation latency

We also observe that CPU usage increases due to larger QPS, as shown in Figure 2 (a)(b):

<img src="https://cloud.githubusercontent.com/assets/11655397/17866101/199e92ce-685a-11e6-8e39-9fc3d699e248.png" width="70%", height="70%">
(a) Kubelet
<img src="https://cloud.githubusercontent.com/assets/11655397/17866128/34eba4cc-685a-11e6-96ee-6c3d162e46c6.png" width="70%", height="70%">
(b) Docker
Figure 3. CPU usage

Another problem about the QPS limit is the fluctuation of measurement. We can see large latency fluctuation in Figure 1 before build 60. For example, build 54 is ~35s slower than build 55. But if we look into the time series data, the main cause is the tardiness of observing pod status being running, but not bottleneck in kubelet, docker.

This additional latency can be removed by increasing QPS limit in test, as shown in Figure 4(c) for build 68.

<img src="https://cloud.githubusercontent.com/assets/11655397/17866387/3250b38c-685b-11e6-87e2-18ce94fc9e1c.png" width="70%", height="70%">
(a) build 54
<img src="https://cloud.githubusercontent.com/assets/11655397/17866445/605aeef0-685b-11e6-8f78-da9f3c9019e7.png" width="70%", height="70%">
(b) build 55
<img src="https://cloud.githubusercontent.com/assets/11655397/17867084/db50fe4a-685d-11e6-8e0b-b2e4b41a3c7c.png" width="70%", height="70%">
(c) build 68
Figure 4. Time series data of creating pods.

The curve gives the number of pods arriving at a certain probe ('create_test ': create pod in test; 'running_test': observe that pod is running in test; 'firstSeen': pod configuration arrives at kubelet SyncLoop; 'container': arrives at container manager in kubelet syncPod; 'running': observe that pod is running in kubelet SyncLoop)

@k8s-github-robot k8s-github-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 22, 2016
@coufon
Copy link
Contributor Author

coufon commented Aug 22, 2016

@dchen1107

@yujuhong
Copy link
Contributor

I think option (1) is sufficient for node performance tests, since the conservative QPS limit is usually set for larger clusters, and is not that meaningful in a node-centric test.
On the other hand, option (2) or (3) will be more useful in a cluster performance test.

@huang195
Copy link
Contributor

@coufon a couple of questions:

  1. what pods are you using that they take 60-110s to create and start up?
  2. are you using a tool to collect the latency data and plot them or is that from some kind of dashboard?

@Random-Liu
Copy link
Member

Random-Liu commented Aug 30, 2016

are you using a tool to collect the latency data and plot them or is that from some kind of dashboard?

@huang195 @coufon is working on node performance benchmark, which includes a group of performance node e2e test and a performance dashboard analyzing resource usage and operation latency summaries across different builds and details in a specific build.

We've already added a continuous running node performance test suite, and will start a node performance dashboard to analyze and publish the data.

The node performance dashboard is a pod. We'll later put it into contrib/ repo, and integrate it with our containerized node level test #30122. Users should be able to run it in their environment easily in the future. :)

FYI, related feature kubernetes/enhancements#83.

@coufon
Copy link
Contributor Author

coufon commented Aug 30, 2016

@coufon a couple of questions:

  1. what pods are you using that they take 60-110s to create and start up?
  2. are you using a tool to collect the latency data and plot them or is that from some kind of dashboard?

@huang195 The answers are:

  1. The pods just run pause image which does nothing;
  2. The tools are node e2e density test (test/e2e_node/density_test.go) and kubelet tracing (not merged to main branch yet). The visualization is node performance dashboard. We will release these tools with documentation soon.

@Random-Liu Random-Liu added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Aug 30, 2016
@huang195
Copy link
Contributor

@Random-Liu @coufon thanks. It's great to have a such a tool to visualize end-to-end operation time on a node.

@coufon
Copy link
Contributor Author

coufon commented Sep 13, 2016

Close this issue as we already support different QPS limits in node-e2e-density test, see this PR: #32250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

5 participants