metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983

grunlab · 2022-03-16T10:44:56Z

Since I've upgraded metrics-server from 0.5.x to 0.6.x, metrics-server container is restarting frequently (it was not the case in 0.5.x):

Kubernetes version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:32:02Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

Metrics-server version:

$ kubectl exec metrics-server-7997cd54bc-vmvv2 -- /metrics-server --version
v0.6.1

For example, 15 restarts in 25 days:

$ kubectl get pod -n metrics-server
NAME                              READY   STATUS    RESTARTS       AGE
metrics-server-7997cd54bc-vmvv2   1/1     Running   15 (34h ago)   25d

metrics-server container logs just before the restart:

I0315 19:51:11.737837       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="31.374773495s" maxDuration="22.5s"
I0315 19:51:21.740150       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="41.377078294s" maxDuration="22.5s"
I0315 19:51:31.737233       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="51.374183641s" maxDuration="22.5s"
I0315 19:51:31.759734       1 object_count_tracker.go:84] "StorageObjectCountTracker pruner is exiting"
I0315 19:51:31.760192       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0315 19:51:31.760274       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0315 19:51:31.760312       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0315 19:51:31.760422       1 secure_serving.go:311] Stopped listening on [::]:4443
I0315 19:51:31.760528       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0315 19:51:31.760617       1 dynamic_serving_content.go:145] "Shutting down controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"

Is the error "metric collection didn't finish on time" normal ?
Can I increase the maxDuration of 22.5s ?

Thank you for your support.

The text was updated successfully, but these errors were encountered:

serathius · 2022-03-16T14:16:01Z

First time I see issue with the probes, so it's very unusual. This probe is meant to check if the metric collection pipeline is working. Metrics Server runs dedicated goroutine that is expected to collect and save metrics every 15s, this probe fails collection loop haven't finished within last 22.5s (1.5 times 15s). There is no way to increase the maxDuration as it's proportional to metric-resolution.

This is meant to allow recovery from deadlock in metric collection code as it restarts metrics server. There should be no big impact for your cluster as 15 restarts for 25 days is pretty rare (availability >99.5%), however it means there is a bug in the code so I would appreciate if you can help us find the cause.

Could you enable higher log verbosity so we can see more full view of events happening before the restart? Just add -v=6 to command line flags. I'm looking for logs Scraping metrics, Storing metrics, Scraping cycle complete. Last log from those should inform on which part of the scraping is deadlocking.

serathius · 2022-03-16T14:17:27Z

Looks like an interesting bug
/cc @yangjunmyfm192085 @dgrisonnet
cc @shuaich

grunlab · 2022-03-16T15:08:17Z

OK, i've increased the log verbosity to 6.
I'm now waiting the next container restart.

grunlab · 2022-03-16T18:11:08Z

OK, the container has restarted 3 times in between:

$ kubectl get pod
NAME                              READY   STATUS    RESTARTS      AGE
metrics-server-685cfd8bdb-tzv9l   1/1     Running   3 (58m ago)   139m

Container start : Wed, 16 Mar 2022 15:06:39
1st restart : Wed, 16 Mar 2022 16:23:47
2nd restart : Wed, 16 Mar 2022 16:26:26
3rd restart : Wed, 16 Mar 2022 16:27:36

Please find attached the logs (including the 3 restarts)
metrics-server.zip

Hope this will help.

yangjunmyfm192085 · 2022-03-17T00:29:55Z

Looks like an interesting bug /cc @yangjunmyfm192085 @dgrisonnet cc @shuaich

Yeah, I will also try to analyze it

yangjunmyfm192085 · 2022-03-17T03:50:36Z

From the log, it prints

'I0316 16:22:54.629852 1 round_trippers.go:553] GET https://10.125.49.156:10250/metrics/resource 200 OK in 21 milliseconds
I0316 16:22:54.609715 1 round_trippers.go:553] GET https://10.125.57.29:10250/metrics/resource 200 OK in 41 milliseconds
I0316 16:22:54.609545 1 round_trippers.go:553] GET https://10.125.49.145:10250/metrics/resource 200 OK in 32 milliseconds
I0316 16:22:54.609317 1 round_trippers.go:553] GET https://10.125.57.31:10250/metrics/resource 200 OK in 27 milliseconds
I0316 16:22:54.608613 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8687"
I0316 16:22:54.584143 1 round_trippers.go:553] GET https://10.125.49.253:10250/metrics/resource 200 OK in 9 milliseconds
I0316 16:22:54.582050 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8328"
I0316 16:22:54.580512 1 round_trippers.go:553] GET https://10.125.49.254:10250/metrics/resource 200 OK in 13 milliseconds
I0316 16:22:54.577120 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8674"
I0316 16:22:54.574790 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-9363"
I0316 16:22:54.568485 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8325"
I0316 16:22:54.567136 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-9361"
I0316 16:22:54.562893 1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6

But Scrape finished is not printed.

Most likely the channel responseChannel is not full of six data. Which may block in the function collectNode. But what caused it has not been fully analyzed.
But there is a strange phenomenon, every time there is a problem, it will be accompanied by a large number of prints as follows

I0316 16:22:59.043319 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.025958 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="197.592µs" userAgent="Go-http-client/ 2.0" audit-ID="7422b428-bb6e-4ec7-a364-54abbe36efe8" srcIP="10.121.194.65:34646" resp=200
I0316 16:22:59.025891 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.024333 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="194.62µs" userAgent="Go-http-client/ 2.0" audit-ID="eeecd65f-f3e4-43e7-9172-af5cae1d5463" srcIP="10.121.194.65:34646" resp=200
I0316 16:22:59.024271 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.024021 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="280.324µs" userAgent="Go-http-client/ 2.0" audit-ID="b75ca5f9-9678-472a-9530-bb27229b6ffd" srcIP="10.121.194.65:34646" resp=200
I0316 16:22:59.023881 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.023487 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="719.679µs" userAgent="Go-http-client/ 2.0" audit-ID="ed4a3f0f-d3d1-4fcd-b34b-b1f7df522557" srcIP="10.121.194.65:34646" resp=200

I don't know if it's related.

I will continue to analyze in depth.
@grunlab I wonder if your environment is normal? Is the utilization of cpu and memory not high?

grunlab · 2022-03-17T08:46:26Z

@yangjunmyfm192085 : I'm currently testing metrics-server 0.6.x on a staging cluster where there is no real activity. I use this cluster to test upgrades before upgrading my production environment.

The test cluster is composed of 3 masters and 3 workers hosted on small VMs (2 cores / 4GB RAM)

Cluster nodes:

$ kubectl get node -o wide
NAME              STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION                CONTAINER-RUNTIME
fr0-vsiaas-8325   Ready    control-plane,master   300d   v1.23.4   10.125.57.29    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-8328   Ready    control-plane,master   300d   v1.23.4   10.125.57.31    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-8674   Ready    control-plane,master   300d   v1.23.4   10.125.49.145   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-8687   Ready    <none>                 300d   v1.23.4   10.125.49.156   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-9361   Ready    <none>                 300d   v1.23.4   10.125.49.254   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-9363   Ready    <none>                 292d   v1.23.4   10.125.49.253   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2

Nodes current load:

$ kubectl top node
NAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
fr0-vsiaas-8325   286m         14%    2342Mi          63%
fr0-vsiaas-8328   274m         13%    2374Mi          64%
fr0-vsiaas-8674   267m         13%    2295Mi          62%
fr0-vsiaas-8687   139m         6%     1920Mi          52%
fr0-vsiaas-9361   102m         5%     1756Mi          47%
fr0-vsiaas-9363   117m         5%     1915Mi          52%

Metrics-server load:

$ kubectl top pod
NAME                              CPU(cores)   MEMORY(bytes)
metrics-server-7997cd54bc-hhh45   6m           17Mi

I noticed that metrics-server container restarts happen more or less when surrounding pods (in order namespaces) are switching in terminating. Maybe metrics-server is trying to get metrics from containers that are not existing anymore which can potentially explain the "metric collection didn't finish on time" ?

If you need more logs (maybe at cri-o or kubelet level), don't hesitate to sollicitate me.

yangjunmyfm192085 · 2022-03-17T09:02:41Z

@grunlab Thank you for your information and analysis.
I will continue to analyze if the pods restart cause this problem.
thank you

yangjunmyfm192085 · 2022-03-17T09:11:22Z

One thing that doesn't match your analysis is that access to https://x.x.x.x:10250/metrics/resource in the log are all ok. So there should be no data Scrape timeout problem

'I0316 16:22:54.629852 1 round_trippers.go:553] GET https://10.125.49.156:10250/metrics/resource 200 OK in 21 milliseconds
I0316 16:22:54.609715 1 round_trippers.go:553] GET https://10.125.57.29:10250/metrics/resource 200 OK in 41 milliseconds
I0316 16:22:54.609545 1 round_trippers.go:553] GET https://10.125.49.145:10250/metrics/resource 200 OK in 32 milliseconds
I0316 16:22:54.609317 1 round_trippers.go:553] GET https://10.125.57.31:10250/metrics/resource 200 OK in 27 milliseconds
I0316 16:22:54.584143 1 round_trippers.go:553] GET https://10.125.49.253:10250/metrics/resource 200 OK in 9 milliseconds
I0316 16:22:54.580512 1 round_trippers.go:553] GET https://10.125.49.254:10250/metrics/resource 200 OK in 13 milliseconds

yangjunmyfm192085 · 2022-03-19T02:34:27Z

After a recent analysis, it is most likely that, _, err = io.Copy(buf, response.Body) , this operation is blocking, Can you keep watching the io for a while? @grunlab @serathius

serathius · 2022-03-21T12:01:51Z

I expect that response.Body can block until whole response was not returned, however I expect that this can should still return if the request times out. @yangjunmyfm192085 Can you double check if we correctly pass timeout or/and context with timeout when making request.

yangjunmyfm192085 · 2022-03-21T14:07:19Z

Yeah. To verify this scenario. I set up an unreachable url to test, and sent two requests in total, one with req.WithContext(ctx) and one without WithContext(ctx). The one with it can return normally, and the one without it will get stuck. This proves
response, err := kc.client.Do(req.WithContext(ctx)) can still return if the request times out.

yangjunmyfm192085 · 2022-03-21T14:09:03Z

The case of response.Body is not validated. I will design a scene test and verify it tomorrow.

yangjunmyfm192085 · 2022-03-22T06:41:09Z

@serathius , I did the following verification today:

Use the command tc qdisc del dev ens3 root netem loss xx% on node 175.1.1.7 to set the network packet loss rate
Observe the following types of log output, all of which can detect timeout

Type 1:(Types of analysis we focus on)


I0322 06:18:54.731831       1 server.go:136] "Scraping metrics"
I0322 06:18:54.731944       1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6
I0322 06:18:54.736179       1 scraper.go:137] "Scraping node" node="175.1.1.4"
I0322 06:18:54.743209       1 scraper.go:137] "Scraping node" node="175.1.1.6"
I0322 06:18:54.747620       1 round_trippers.go:553] GET https://175.1.1.6:10250/metrics/resource 200 OK in 4 milliseconds
I0322 06:18:54.749455       1 scraper.go:137] "Scraping node" node="175.1.1.3"
I0322 06:18:54.751015       1 scraper.go:137] "Scraping node" node="175.1.1.2"
I0322 06:18:54.755402       1 round_trippers.go:553] GET https://175.1.1.3:10250/metrics/resource 200 OK in 5 milliseconds
I0322 06:18:54.756747       1 round_trippers.go:553] GET https://175.1.1.4:10250/metrics/resource 200 OK in 20 milliseconds
I0322 06:18:54.760521       1 round_trippers.go:553] GET https://175.1.1.2:10250/metrics/resource 200 OK in 9 milliseconds
I0322 06:18:54.765058       1 scraper.go:137] "Scraping node" node="175.1.1.7"
I0322 06:18:54.767635       1 scraper.go:137] "Scraping node" node="175.1.1.5"
I0322 06:18:54.832036       1 round_trippers.go:553] GET https://175.1.1.5:10250/metrics/resource 200 OK in 64 milliseconds
I0322 06:19:01.156008       1 round_trippers.go:553] GET https://175.1.1.7:10250/metrics/resource 200 OK in 6390 milliseconds
I0322 06:19:03.492500       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0322 06:19:03.492545       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0322 06:19:03.492664       1 shared_informer.go:270] caches populated
I0322 06:19:03.492846       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="1.028498ms" userAgent="kube-probe/1.22" audit-ID="8a194b7b-5933-4ece-8a98-97e48b4b4824" srcIP="175.1.1.2:43136" resp=200
I0322 06:19:03.497867       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0322 06:19:03.497891       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0322 06:19:03.498026       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="310.031µs" userAgent="kube-probe/1.22" audit-ID="79afaef7-cfa9-4b09-a40b-b03e82f98f5c" srcIP="175.1.1.2:43138" resp=200
E0322 06:19:08.233861       1 scraper.go:140] "Failed to scrape node" err="failed to read response body - context deadline exceeded" node="175.1.1.7"
I0322 06:19:08.233965       1 scraper.go:172] "Scrape finished" duration="13.501951875s" nodeCount=5 podCount=55
I0322 06:19:08.233990       1 server.go:139] "Storing metrics"

Type2:


I0322 06:05:54.732545       1 server.go:136] "Scraping metrics"
I0322 06:05:54.732692       1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6
I0322 06:05:54.737022       1 scraper.go:137] "Scraping node" node="175.1.1.2"
I0322 06:05:54.749244       1 round_trippers.go:553] GET https://175.1.1.2:10250/metrics/resource 200 OK in 11 milliseconds
I0322 06:05:54.753256       1 scraper.go:137] "Scraping node" node="175.1.1.5"
I0322 06:05:54.754775       1 scraper.go:137] "Scraping node" node="175.1.1.7"
I0322 06:05:54.760902       1 scraper.go:137] "Scraping node" node="175.1.1.6"
I0322 06:05:54.762340       1 scraper.go:137] "Scraping node" node="175.1.1.4"
I0322 06:05:54.764666       1 round_trippers.go:553] GET https://175.1.1.6:10250/metrics/resource 200 OK in 3 milliseconds
I0322 06:05:54.770056       1 scraper.go:137] "Scraping node" node="175.1.1.3"
I0322 06:05:54.777329       1 round_trippers.go:553] GET https://175.1.1.4:10250/metrics/resource 200 OK in 14 milliseconds
I0322 06:05:54.787128       1 round_trippers.go:553] GET https://175.1.1.3:10250/metrics/resource 200 OK in 16 milliseconds
I0322 06:05:54.830223       1 round_trippers.go:553] GET https://175.1.1.5:10250/metrics/resource 200 OK in 76 milliseconds
I0322 06:05:54.831374       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:00:48.001162496 +0000 UTC Timestamp:2022-03-22 06:05:53.414 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:05:54.831421       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-svd7x"
I0322 06:05:54.831499       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:00:52.994269696 +0000 UTC Timestamp:2022-03-22 06:05:35.57 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:05:54.831519       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-wtkkr"
I0322 06:05:54.831537       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:00:51.989779712 +0000 UTC Timestamp:2022-03-22 06:05:31.442 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:05:54.831553       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-979rh"
I0322 06:06:03.489525       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0322 06:06:03.489611       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0322 06:06:03.489767       1 shared_informer.go:270] caches populated
I0322 06:06:03.490503       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="1.626018ms" userAgent="kube-probe/1.22" audit-ID="5571f982-c327-466f-a5e3-45651a15eb1b" srcIP="175.1.1.2:40562" resp=200
I0322 06:06:03.500569       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0322 06:06:03.500606       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0322 06:06:03.500968       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="409.68µs" userAgent="kube-probe/1.22" audit-ID="4bf8a843-331a-4a33-959c-40698d48e17f" srcIP="175.1.1.2:40564" resp=200
I0322 06:06:08.235057       1 round_trippers.go:553] GET https://175.1.1.7:10250/metrics/resource  in 13479 milliseconds
E0322 06:06:08.235405       1 scraper.go:140] "Failed to scrape node" err="Get \"https://175.1.1.7:10250/metrics/resource\": context deadline exceeded" node="175.1.1.7"
I0322 06:06:08.235524       1 scraper.go:172] "Scrape finished" duration="13.502779912s" nodeCount=5 podCount=62
I0322 06:06:08.235565       1 server.go:139] "Storing metrics"
I0322 06:06:08.236173       1 server.go:144] "Scraping cycle complete"
I0322 06:06:09.732621       1 server.go:136] "Scraping metrics"

Type3:

I0322 06:09:24.732053       1 server.go:136] "Scraping metrics"
I0322 06:09:24.732202       1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6
I0322 06:09:24.734745       1 scraper.go:137] "Scraping node" node="175.1.1.7"
I0322 06:09:24.752423       1 scraper.go:137] "Scraping node" node="175.1.1.2"
I0322 06:09:24.758202       1 scraper.go:137] "Scraping node" node="175.1.1.5"
I0322 06:09:24.767895       1 scraper.go:137] "Scraping node" node="175.1.1.4"
I0322 06:09:24.778185       1 scraper.go:137] "Scraping node" node="175.1.1.6"
I0322 06:09:24.778997       1 round_trippers.go:553] GET https://175.1.1.5:10250/metrics/resource 200 OK in 20 milliseconds
I0322 06:09:24.780647       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:03:54.989614848 +0000 UTC Timestamp:2022-03-22 06:08:55.403 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.780703       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-ctqwf"
I0322 06:09:24.780726       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:19.002719488 +0000 UTC Timestamp:2022-03-22 06:09:09.554 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.780753       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-87kg5"
I0322 06:09:24.780976       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:23.059934464 +0000 UTC Timestamp:2022-03-22 06:09:12.852 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781017       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-zghvd"
I0322 06:09:24.781034       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:09.99024384 +0000 UTC Timestamp:2022-03-22 06:09:02.62 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781048       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-pkbcz"
I0322 06:09:24.781064       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:21.99191424 +0000 UTC Timestamp:2022-03-22 06:09:16.611 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781152       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-9nl4v"
I0322 06:09:24.781170       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:07.988933888 +0000 UTC Timestamp:2022-03-22 06:09:07.516 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781222       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-j9sf2"
I0322 06:09:24.781239       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:03:51.99124352 +0000 UTC Timestamp:2022-03-22 06:09:14.357 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781253       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-2hbqz"
I0322 06:09:24.781265       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:18.992357888 +0000 UTC Timestamp:2022-03-22 06:08:51.78 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781278       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-wg4t6"
I0322 06:09:24.781308       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:03:51.990928896 +0000 UTC Timestamp:2022-03-22 06:09:08.047 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781322       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-b9c5z"
I0322 06:09:24.781465       1 scraper.go:137] "Scraping node" node="175.1.1.3"
I0322 06:09:24.790980       1 round_trippers.go:553] GET https://175.1.1.4:10250/metrics/resource 200 OK in 22 milliseconds
I0322 06:09:24.799395       1 round_trippers.go:553] GET https://175.1.1.3:10250/metrics/resource 200 OK in 17 milliseconds
I0322 06:09:24.807997       1 round_trippers.go:553] GET https://175.1.1.6:10250/metrics/resource 200 OK in 29 milliseconds
I0322 06:09:24.810657       1 round_trippers.go:553] GET https://175.1.1.2:10250/metrics/resource 200 OK in 58 milliseconds
I0322 06:09:33.493199       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0322 06:09:33.493339       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0322 06:09:33.493696       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="1.25371ms" userAgent="kube-probe/1.22" audit-ID="387b5689-d686-4891-b8be-9ccb6ac89b18" srcIP="175.1.1.2:41258" resp=200
I0322 06:09:33.507234       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0322 06:09:33.507298       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0322 06:09:33.507419       1 shared_informer.go:270] caches populated
I0322 06:09:33.507601       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="590.117µs" userAgent="kube-probe/1.22" audit-ID="bd3829b7-f9c5-48c9-a775-02a4b85f3d9e" srcIP="175.1.1.2:41256" resp=200
I0322 06:09:37.896336       1 round_trippers.go:553] GET https://175.1.1.7:10250/metrics/resource 401 Unauthorized in 13161 milliseconds
E0322 06:09:37.896905       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="175.1.1.7"
I0322 06:09:37.897026       1 scraper.go:172] "Scrape finished" duration="13.164760946s" nodeCount=5 podCount=52
I0322 06:09:37.897055       1 server.go:139] "Storing metrics"
I0322 06:09:37.897675       1 server.go:144] "Scraping cycle complete"

yangjunmyfm192085 · 2022-03-28T02:22:28Z

Hi, @grunlab, @serathius , I haven't made any progress on this issue lately.
Does this problem still exist in the environment? Could metrics-servcer get metrics normally if no Pod is restarting? @grunlab

grunlab · 2022-03-28T13:10:31Z

@yangjunmyfm192085
The problem is still present.
Metrics-server container mainly restart when other pods are deployed/stopped on the cluster but it also happen (not so often) when nothing special happen on the cluster.
I suspect that the issue could be linked to some ponctual disk IO perf issues on the VMs hosting the cluster (VMs are hosted on a shared ESX infrastruture)
In between, I've deployed on each nodes a monitoring of the disk IO in order to see if there is a relation between bad disk IO perf and metrics-server container restart !? ... we will see.
I keep you informed when I have some figures to share with you.

evheniyt · 2022-03-28T18:49:08Z

I also have a problem with metrics-server restarts
kubernetes version: v1.23.1
metrics-server version 0.6.1

I0328 18:33:58.224925       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0328 18:33:58.224942       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0328 18:33:58.224987       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="1m57.873242235s" maxDuration="1m30s"
I0328 18:33:58.225000       1 healthz.go:257] metric-collection-timely check failed: livez
[-]metric-collection-timely failed: metric collection didn't finish on time
I0328 18:33:58.225046       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="171.062µs" userAgent="kube-probe/1.23" audit-ID="fee3cd19-4487-4afc-b8fb-74a171c29206" srcIP="10.244.0.1:42424" resp=0
I0328 18:33:58.225138       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0328 18:33:58.225150       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0328 18:33:58.225195       1 shared_informer.go:270] caches populated
I0328 18:33:58.225243       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="740.728µs" userAgent="kube-probe/1.23" audit-ID="ffdc70c8-b9c5-4d48-8ba9-a34829778622" srcIP="10.244.0.1:42422" resp=200
I0328 18:33:58.233404       1 genericapiserver.go:470] [graceful-termination] RunPreShutdownHooks has completed
I0328 18:33:58.233407       1 genericapiserver.go:419] "[graceful-termination] shutdown event" name="ShutdownInitiated"
I0328 18:33:58.233423       1 object_count_tracker.go:84] "StorageObjectCountTracker pruner is exiting"
I0328 18:33:58.233430       1 genericapiserver.go:422] "[graceful-termination] shutdown event" name="AfterShutdownDelayDuration"
I0328 18:33:58.233438       1 reflector.go:225] Stopping reflector *v1.Node (0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233453       1 genericapiserver.go:460] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
I0328 18:33:58.233500       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0328 18:33:58.233574       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233506       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0328 18:33:58.233513       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0328 18:33:58.233518       1 secure_serving.go:311] Stopped listening on [::]:8443
I0328 18:33:58.233520       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0328 18:33:58.233581       1 dynamic_serving_content.go:145] "Shutting down controller" name="serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key"
I0328 18:33:58.233588       1 reflector.go:225] Stopping reflector *v1.PartialObjectMetadata (0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233622       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233676       1 genericapiserver.go:448] "[graceful-termination] shutdown event" name="HTTPServerStoppedListening"
I0328 18:33:58.233679       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:59.309314       1 genericapiserver.go:477] [graceful-termination] apiserver is exiting

Attaching full logs
metrics.logs.zip

yangjunmyfm192085 · 2022-03-29T05:16:08Z

The logs show similar failures.

grunlab · 2022-04-14T12:23:09Z

FYI, since I've deployed the Disk IO monitoring on all the cluster nodes, I didn't find any correlation between Disk IO pics and metrics-server container restart ... :-(

yangjunmyfm192085 · 2022-04-14T13:45:18Z

Thanks for sharing the information

yangjunmyfm192085 · 2022-04-18T04:15:51Z

Hi, @grunlab , Could you help to confirm whether the issue is similar to the following?
#907 (comment)

grunlab · 2022-04-20T13:15:20Z

@yangjunmyfm192085

The issue looks not similar to #907

I'm getting the containers resources instantly on each cluster nodes when running this:

kubectl proxy
kubectl get --raw /api/v1/nodes/<cluster_nodes>/proxy/metrics/resource

I've also tried to increase the --metric-resolution from 15s to 30s ... same issue, metrics-server container is still restarting frequently :-(

yangjunmyfm192085 · 2022-04-20T14:50:38Z

ok, so it looks like it might not be the same issue

AndrewSav · 2022-05-31T06:42:16Z

Good to know that the fix is released, but do we know what was causing the issue? Specifically, why only some people is observing the issue and others are not?

grunlab · 2022-12-13T16:23:13Z

FYI, no more container restart since upgraded to 0.6.2 :-)

serathius added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2022

yangjunmyfm192085 mentioned this issue Apr 19, 2022

metric server not able to deploy on kind (M1 Mac) #907

Closed

XiaoXiaoSN mentioned this issue Apr 26, 2022

fix: Handle error while parsing node metrics #1009

Merged

k8s-ci-robot closed this as completed in #1009 Apr 27, 2022

yangjunmyfm192085 mentioned this issue Apr 28, 2022

Incorrect metrics of container might be generated when pod start or stop kubernetes/kubernetes#109679

Closed

serathius mentioned this issue Jun 6, 2022

Prepared for Release v0.6.2 #1037

Closed

serathius mentioned this issue Jun 6, 2022

metrics-server changes with k8s v1.24.0 #1031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983

metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983

grunlab commented Mar 16, 2022 •

edited

Loading

serathius commented Mar 16, 2022 •

edited

Loading

serathius commented Mar 16, 2022

grunlab commented Mar 16, 2022 •

edited

Loading

grunlab commented Mar 16, 2022 •

edited

Loading

yangjunmyfm192085 commented Mar 17, 2022

yangjunmyfm192085 commented Mar 17, 2022 •

edited

Loading

grunlab commented Mar 17, 2022 •

edited

Loading

yangjunmyfm192085 commented Mar 17, 2022

yangjunmyfm192085 commented Mar 17, 2022

yangjunmyfm192085 commented Mar 19, 2022

serathius commented Mar 21, 2022

yangjunmyfm192085 commented Mar 21, 2022

yangjunmyfm192085 commented Mar 21, 2022

yangjunmyfm192085 commented Mar 22, 2022

yangjunmyfm192085 commented Mar 28, 2022

grunlab commented Mar 28, 2022 •

edited

Loading

evheniyt commented Mar 28, 2022

yangjunmyfm192085 commented Mar 29, 2022

grunlab commented Apr 14, 2022

yangjunmyfm192085 commented Apr 14, 2022

yangjunmyfm192085 commented Apr 18, 2022

grunlab commented Apr 20, 2022

yangjunmyfm192085 commented Apr 20, 2022

AndrewSav commented May 31, 2022

grunlab commented Dec 13, 2022

metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983

metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983

Comments

grunlab commented Mar 16, 2022 • edited Loading

serathius commented Mar 16, 2022 • edited Loading

serathius commented Mar 16, 2022

grunlab commented Mar 16, 2022 • edited Loading

grunlab commented Mar 16, 2022 • edited Loading

yangjunmyfm192085 commented Mar 17, 2022

yangjunmyfm192085 commented Mar 17, 2022 • edited Loading

grunlab commented Mar 17, 2022 • edited Loading

yangjunmyfm192085 commented Mar 17, 2022

yangjunmyfm192085 commented Mar 17, 2022

yangjunmyfm192085 commented Mar 19, 2022

serathius commented Mar 21, 2022

yangjunmyfm192085 commented Mar 21, 2022

yangjunmyfm192085 commented Mar 21, 2022

yangjunmyfm192085 commented Mar 22, 2022

yangjunmyfm192085 commented Mar 28, 2022

grunlab commented Mar 28, 2022 • edited Loading

evheniyt commented Mar 28, 2022

yangjunmyfm192085 commented Mar 29, 2022

grunlab commented Apr 14, 2022

yangjunmyfm192085 commented Apr 14, 2022

yangjunmyfm192085 commented Apr 18, 2022

grunlab commented Apr 20, 2022

yangjunmyfm192085 commented Apr 20, 2022

AndrewSav commented May 31, 2022

grunlab commented Dec 13, 2022

grunlab commented Mar 16, 2022 •

edited

Loading

serathius commented Mar 16, 2022 •

edited

Loading

grunlab commented Mar 16, 2022 •

edited

Loading

grunlab commented Mar 16, 2022 •

edited

Loading

yangjunmyfm192085 commented Mar 17, 2022 •

edited

Loading

grunlab commented Mar 17, 2022 •

edited

Loading

grunlab commented Mar 28, 2022 •

edited

Loading