Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983

Closed
grunlab opened this issue Mar 16, 2022 · 25 comments · Fixed by #1009
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@grunlab
Copy link

grunlab commented Mar 16, 2022

Since I've upgraded metrics-server from 0.5.x to 0.6.x, metrics-server container is restarting frequently (it was not the case in 0.5.x):

Kubernetes version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:32:02Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

Metrics-server version:

$ kubectl exec metrics-server-7997cd54bc-vmvv2 -- /metrics-server --version
v0.6.1

For example, 15 restarts in 25 days:

$ kubectl get pod -n metrics-server
NAME                              READY   STATUS    RESTARTS       AGE
metrics-server-7997cd54bc-vmvv2   1/1     Running   15 (34h ago)   25d

metrics-server container logs just before the restart:

I0315 19:51:11.737837       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="31.374773495s" maxDuration="22.5s"
I0315 19:51:21.740150       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="41.377078294s" maxDuration="22.5s"
I0315 19:51:31.737233       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="51.374183641s" maxDuration="22.5s"
I0315 19:51:31.759734       1 object_count_tracker.go:84] "StorageObjectCountTracker pruner is exiting"
I0315 19:51:31.760192       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0315 19:51:31.760274       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0315 19:51:31.760312       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0315 19:51:31.760422       1 secure_serving.go:311] Stopped listening on [::]:4443
I0315 19:51:31.760528       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0315 19:51:31.760617       1 dynamic_serving_content.go:145] "Shutting down controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"

Is the error "metric collection didn't finish on time" normal ?
Can I increase the maxDuration of 22.5s ?

Thank you for your support.

@serathius
Copy link
Contributor

serathius commented Mar 16, 2022

First time I see issue with the probes, so it's very unusual. This probe is meant to check if the metric collection pipeline is working. Metrics Server runs dedicated goroutine that is expected to collect and save metrics every 15s, this probe fails collection loop haven't finished within last 22.5s (1.5 times 15s). There is no way to increase the maxDuration as it's proportional to metric-resolution.

This is meant to allow recovery from deadlock in metric collection code as it restarts metrics server. There should be no big impact for your cluster as 15 restarts for 25 days is pretty rare (availability >99.5%), however it means there is a bug in the code so I would appreciate if you can help us find the cause.

Could you enable higher log verbosity so we can see more full view of events happening before the restart? Just add -v=6 to command line flags. I'm looking for logs Scraping metrics, Storing metrics, Scraping cycle complete. Last log from those should inform on which part of the scraping is deadlocking.

@serathius serathius added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2022
@serathius
Copy link
Contributor

Looks like an interesting bug
/cc @yangjunmyfm192085 @dgrisonnet
cc @shuaich

@grunlab
Copy link
Author

grunlab commented Mar 16, 2022

OK, i've increased the log verbosity to 6.
I'm now waiting the next container restart.

@grunlab
Copy link
Author

grunlab commented Mar 16, 2022

OK, the container has restarted 3 times in between:

$ kubectl get pod
NAME                              READY   STATUS    RESTARTS      AGE
metrics-server-685cfd8bdb-tzv9l   1/1     Running   3 (58m ago)   139m

Container start : Wed, 16 Mar 2022 15:06:39
1st restart : Wed, 16 Mar 2022 16:23:47
2nd restart : Wed, 16 Mar 2022 16:26:26
3rd restart : Wed, 16 Mar 2022 16:27:36

Please find attached the logs (including the 3 restarts)
metrics-server.zip

Hope this will help.

@yangjunmyfm192085
Copy link
Contributor

Looks like an interesting bug /cc @yangjunmyfm192085 @dgrisonnet cc @shuaich

Yeah, I will also try to analyze it

@yangjunmyfm192085
Copy link
Contributor

yangjunmyfm192085 commented Mar 17, 2022

From the log, it prints

'I0316 16:22:54.629852 1 round_trippers.go:553] GET https://10.125.49.156:10250/metrics/resource 200 OK in 21 milliseconds
I0316 16:22:54.609715 1 round_trippers.go:553] GET https://10.125.57.29:10250/metrics/resource 200 OK in 41 milliseconds
I0316 16:22:54.609545 1 round_trippers.go:553] GET https://10.125.49.145:10250/metrics/resource 200 OK in 32 milliseconds
I0316 16:22:54.609317 1 round_trippers.go:553] GET https://10.125.57.31:10250/metrics/resource 200 OK in 27 milliseconds
I0316 16:22:54.608613 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8687"
I0316 16:22:54.584143 1 round_trippers.go:553] GET https://10.125.49.253:10250/metrics/resource 200 OK in 9 milliseconds
I0316 16:22:54.582050 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8328"
I0316 16:22:54.580512 1 round_trippers.go:553] GET https://10.125.49.254:10250/metrics/resource 200 OK in 13 milliseconds
I0316 16:22:54.577120 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8674"
I0316 16:22:54.574790 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-9363"
I0316 16:22:54.568485 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-8325"
I0316 16:22:54.567136 1 scraper.go:137] "Scraping node" node="fr0-vsiaas-9361"
I0316 16:22:54.562893 1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6

But Scrape finished is not printed.

Most likely the channel responseChannel is not full of six data. Which may block in the function collectNode. But what caused it has not been fully analyzed.
But there is a strange phenomenon, every time there is a problem, it will be accompanied by a large number of prints as follows

I0316 16:22:59.043319 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.025958 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="197.592µs" userAgent="Go-http-client/ 2.0" audit-ID="7422b428-bb6e-4ec7-a364-54abbe36efe8" srcIP="10.121.194.65:34646" resp=200
I0316 16:22:59.025891 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.024333 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="194.62µs" userAgent="Go-http-client/ 2.0" audit-ID="eeecd65f-f3e4-43e7-9172-af5cae1d5463" srcIP="10.121.194.65:34646" resp=200
I0316 16:22:59.024271 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.024021 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="280.324µs" userAgent="Go-http-client/ 2.0" audit-ID="b75ca5f9-9678-472a-9530-bb27229b6ffd" srcIP="10.121.194.65:34646" resp=200
I0316 16:22:59.023881 1 handler.go:143] metrics-server: GET "/apis/metrics.k8s.io/v1beta1" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0316 16:22:59.023487 1 httplog.go:129] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1" latency="719.679µs" userAgent="Go-http-client/ 2.0" audit-ID="ed4a3f0f-d3d1-4fcd-b34b-b1f7df522557" srcIP="10.121.194.65:34646" resp=200

I don't know if it's related.

I will continue to analyze in depth.
@grunlab I wonder if your environment is normal? Is the utilization of cpu and memory not high?

@grunlab
Copy link
Author

grunlab commented Mar 17, 2022

@yangjunmyfm192085 : I'm currently testing metrics-server 0.6.x on a staging cluster where there is no real activity. I use this cluster to test upgrades before upgrading my production environment.

The test cluster is composed of 3 masters and 3 workers hosted on small VMs (2 cores / 4GB RAM)

Cluster nodes:

$ kubectl get node -o wide
NAME              STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION                CONTAINER-RUNTIME
fr0-vsiaas-8325   Ready    control-plane,master   300d   v1.23.4   10.125.57.29    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-8328   Ready    control-plane,master   300d   v1.23.4   10.125.57.31    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-8674   Ready    control-plane,master   300d   v1.23.4   10.125.49.145   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-8687   Ready    <none>                 300d   v1.23.4   10.125.49.156   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-9361   Ready    <none>                 300d   v1.23.4   10.125.49.254   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2
fr0-vsiaas-9363   Ready    <none>                 292d   v1.23.4   10.125.49.253   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.59.1.el7.x86_64   cri-o://1.23.2

Nodes current load:

$ kubectl top node
NAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
fr0-vsiaas-8325   286m         14%    2342Mi          63%
fr0-vsiaas-8328   274m         13%    2374Mi          64%
fr0-vsiaas-8674   267m         13%    2295Mi          62%
fr0-vsiaas-8687   139m         6%     1920Mi          52%
fr0-vsiaas-9361   102m         5%     1756Mi          47%
fr0-vsiaas-9363   117m         5%     1915Mi          52%

Metrics-server load:

$ kubectl top pod
NAME                              CPU(cores)   MEMORY(bytes)
metrics-server-7997cd54bc-hhh45   6m           17Mi

I noticed that metrics-server container restarts happen more or less when surrounding pods (in order namespaces) are switching in terminating. Maybe metrics-server is trying to get metrics from containers that are not existing anymore which can potentially explain the "metric collection didn't finish on time" ?

If you need more logs (maybe at cri-o or kubelet level), don't hesitate to sollicitate me.

@yangjunmyfm192085
Copy link
Contributor

@grunlab Thank you for your information and analysis.
I will continue to analyze if the pods restart cause this problem.
thank you

@yangjunmyfm192085
Copy link
Contributor

One thing that doesn't match your analysis is that access to https://x.x.x.x:10250/metrics/resource in the log are all ok. So there should be no data Scrape timeout problem

'I0316 16:22:54.629852 1 round_trippers.go:553] GET https://10.125.49.156:10250/metrics/resource 200 OK in 21 milliseconds
I0316 16:22:54.609715 1 round_trippers.go:553] GET https://10.125.57.29:10250/metrics/resource 200 OK in 41 milliseconds
I0316 16:22:54.609545 1 round_trippers.go:553] GET https://10.125.49.145:10250/metrics/resource 200 OK in 32 milliseconds
I0316 16:22:54.609317 1 round_trippers.go:553] GET https://10.125.57.31:10250/metrics/resource 200 OK in 27 milliseconds
I0316 16:22:54.584143 1 round_trippers.go:553] GET https://10.125.49.253:10250/metrics/resource 200 OK in 9 milliseconds
I0316 16:22:54.580512 1 round_trippers.go:553] GET https://10.125.49.254:10250/metrics/resource 200 OK in 13 milliseconds

@yangjunmyfm192085
Copy link
Contributor

After a recent analysis, it is most likely that, _, err = io.Copy(buf, response.Body) , this operation is blocking, Can you keep watching the io for a while? @grunlab @serathius

@serathius
Copy link
Contributor

I expect that response.Body can block until whole response was not returned, however I expect that this can should still return if the request times out. @yangjunmyfm192085 Can you double check if we correctly pass timeout or/and context with timeout when making request.

@yangjunmyfm192085
Copy link
Contributor

Yeah. To verify this scenario. I set up an unreachable url to test, and sent two requests in total, one with req.WithContext(ctx) and one without WithContext(ctx). The one with it can return normally, and the one without it will get stuck. This proves
response, err := kc.client.Do(req.WithContext(ctx)) can still return if the request times out.

@yangjunmyfm192085
Copy link
Contributor

The case of response.Body is not validated. I will design a scene test and verify it tomorrow.

@yangjunmyfm192085
Copy link
Contributor

@serathius , I did the following verification today:

  • Use the command tc qdisc del dev ens3 root netem loss xx% on node 175.1.1.7 to set the network packet loss rate
  • Observe the following types of log output, all of which can detect timeout

Type 1:(Types of analysis we focus on)


I0322 06:18:54.731831       1 server.go:136] "Scraping metrics"
I0322 06:18:54.731944       1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6
I0322 06:18:54.736179       1 scraper.go:137] "Scraping node" node="175.1.1.4"
I0322 06:18:54.743209       1 scraper.go:137] "Scraping node" node="175.1.1.6"
I0322 06:18:54.747620       1 round_trippers.go:553] GET https://175.1.1.6:10250/metrics/resource 200 OK in 4 milliseconds
I0322 06:18:54.749455       1 scraper.go:137] "Scraping node" node="175.1.1.3"
I0322 06:18:54.751015       1 scraper.go:137] "Scraping node" node="175.1.1.2"
I0322 06:18:54.755402       1 round_trippers.go:553] GET https://175.1.1.3:10250/metrics/resource 200 OK in 5 milliseconds
I0322 06:18:54.756747       1 round_trippers.go:553] GET https://175.1.1.4:10250/metrics/resource 200 OK in 20 milliseconds
I0322 06:18:54.760521       1 round_trippers.go:553] GET https://175.1.1.2:10250/metrics/resource 200 OK in 9 milliseconds
I0322 06:18:54.765058       1 scraper.go:137] "Scraping node" node="175.1.1.7"
I0322 06:18:54.767635       1 scraper.go:137] "Scraping node" node="175.1.1.5"
I0322 06:18:54.832036       1 round_trippers.go:553] GET https://175.1.1.5:10250/metrics/resource 200 OK in 64 milliseconds
I0322 06:19:01.156008       1 round_trippers.go:553] GET https://175.1.1.7:10250/metrics/resource 200 OK in 6390 milliseconds
I0322 06:19:03.492500       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0322 06:19:03.492545       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0322 06:19:03.492664       1 shared_informer.go:270] caches populated
I0322 06:19:03.492846       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="1.028498ms" userAgent="kube-probe/1.22" audit-ID="8a194b7b-5933-4ece-8a98-97e48b4b4824" srcIP="175.1.1.2:43136" resp=200
I0322 06:19:03.497867       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0322 06:19:03.497891       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0322 06:19:03.498026       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="310.031µs" userAgent="kube-probe/1.22" audit-ID="79afaef7-cfa9-4b09-a40b-b03e82f98f5c" srcIP="175.1.1.2:43138" resp=200
E0322 06:19:08.233861       1 scraper.go:140] "Failed to scrape node" err="failed to read response body - context deadline exceeded" node="175.1.1.7"
I0322 06:19:08.233965       1 scraper.go:172] "Scrape finished" duration="13.501951875s" nodeCount=5 podCount=55
I0322 06:19:08.233990       1 server.go:139] "Storing metrics"

Type2:


I0322 06:05:54.732545       1 server.go:136] "Scraping metrics"
I0322 06:05:54.732692       1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6
I0322 06:05:54.737022       1 scraper.go:137] "Scraping node" node="175.1.1.2"
I0322 06:05:54.749244       1 round_trippers.go:553] GET https://175.1.1.2:10250/metrics/resource 200 OK in 11 milliseconds
I0322 06:05:54.753256       1 scraper.go:137] "Scraping node" node="175.1.1.5"
I0322 06:05:54.754775       1 scraper.go:137] "Scraping node" node="175.1.1.7"
I0322 06:05:54.760902       1 scraper.go:137] "Scraping node" node="175.1.1.6"
I0322 06:05:54.762340       1 scraper.go:137] "Scraping node" node="175.1.1.4"
I0322 06:05:54.764666       1 round_trippers.go:553] GET https://175.1.1.6:10250/metrics/resource 200 OK in 3 milliseconds
I0322 06:05:54.770056       1 scraper.go:137] "Scraping node" node="175.1.1.3"
I0322 06:05:54.777329       1 round_trippers.go:553] GET https://175.1.1.4:10250/metrics/resource 200 OK in 14 milliseconds
I0322 06:05:54.787128       1 round_trippers.go:553] GET https://175.1.1.3:10250/metrics/resource 200 OK in 16 milliseconds
I0322 06:05:54.830223       1 round_trippers.go:553] GET https://175.1.1.5:10250/metrics/resource 200 OK in 76 milliseconds
I0322 06:05:54.831374       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:00:48.001162496 +0000 UTC Timestamp:2022-03-22 06:05:53.414 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:05:54.831421       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-svd7x"
I0322 06:05:54.831499       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:00:52.994269696 +0000 UTC Timestamp:2022-03-22 06:05:35.57 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:05:54.831519       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-wtkkr"
I0322 06:05:54.831537       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:00:51.989779712 +0000 UTC Timestamp:2022-03-22 06:05:31.442 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:05:54.831553       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-979rh"
I0322 06:06:03.489525       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0322 06:06:03.489611       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0322 06:06:03.489767       1 shared_informer.go:270] caches populated
I0322 06:06:03.490503       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="1.626018ms" userAgent="kube-probe/1.22" audit-ID="5571f982-c327-466f-a5e3-45651a15eb1b" srcIP="175.1.1.2:40562" resp=200
I0322 06:06:03.500569       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0322 06:06:03.500606       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0322 06:06:03.500968       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="409.68µs" userAgent="kube-probe/1.22" audit-ID="4bf8a843-331a-4a33-959c-40698d48e17f" srcIP="175.1.1.2:40564" resp=200
I0322 06:06:08.235057       1 round_trippers.go:553] GET https://175.1.1.7:10250/metrics/resource  in 13479 milliseconds
E0322 06:06:08.235405       1 scraper.go:140] "Failed to scrape node" err="Get \"https://175.1.1.7:10250/metrics/resource\": context deadline exceeded" node="175.1.1.7"
I0322 06:06:08.235524       1 scraper.go:172] "Scrape finished" duration="13.502779912s" nodeCount=5 podCount=62
I0322 06:06:08.235565       1 server.go:139] "Storing metrics"
I0322 06:06:08.236173       1 server.go:144] "Scraping cycle complete"
I0322 06:06:09.732621       1 server.go:136] "Scraping metrics"


Type3:

I0322 06:09:24.732053       1 server.go:136] "Scraping metrics"
I0322 06:09:24.732202       1 scraper.go:115] "Scraping metrics from nodes" nodeCount=6
I0322 06:09:24.734745       1 scraper.go:137] "Scraping node" node="175.1.1.7"
I0322 06:09:24.752423       1 scraper.go:137] "Scraping node" node="175.1.1.2"
I0322 06:09:24.758202       1 scraper.go:137] "Scraping node" node="175.1.1.5"
I0322 06:09:24.767895       1 scraper.go:137] "Scraping node" node="175.1.1.4"
I0322 06:09:24.778185       1 scraper.go:137] "Scraping node" node="175.1.1.6"
I0322 06:09:24.778997       1 round_trippers.go:553] GET https://175.1.1.5:10250/metrics/resource 200 OK in 20 milliseconds
I0322 06:09:24.780647       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:03:54.989614848 +0000 UTC Timestamp:2022-03-22 06:08:55.403 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.780703       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-ctqwf"
I0322 06:09:24.780726       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:19.002719488 +0000 UTC Timestamp:2022-03-22 06:09:09.554 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.780753       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-87kg5"
I0322 06:09:24.780976       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:23.059934464 +0000 UTC Timestamp:2022-03-22 06:09:12.852 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781017       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-zghvd"
I0322 06:09:24.781034       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:09.99024384 +0000 UTC Timestamp:2022-03-22 06:09:02.62 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781048       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-pkbcz"
I0322 06:09:24.781064       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:21.99191424 +0000 UTC Timestamp:2022-03-22 06:09:16.611 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781152       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-9nl4v"
I0322 06:09:24.781170       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:07.988933888 +0000 UTC Timestamp:2022-03-22 06:09:07.516 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781222       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-j9sf2"
I0322 06:09:24.781239       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:03:51.99124352 +0000 UTC Timestamp:2022-03-22 06:09:14.357 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781253       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-2hbqz"
I0322 06:09:24.781265       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:04:18.992357888 +0000 UTC Timestamp:2022-03-22 06:08:51.78 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781278       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-wg4t6"
I0322 06:09:24.781308       1 decode.go:189] "Failed getting complete container metric" containerName="bbs" containerMetric={StartTime:2022-03-22 06:03:51.990928896 +0000 UTC Timestamp:2022-03-22 06:09:08.047 +0000 UTC CumulativeCpuUsed:0 MemoryUsage:0}
I0322 06:09:24.781322       1 decode.go:97] "Failed getting complete Pod metric" pod="default/bbs-59f798d8c4-b9c5z"
I0322 06:09:24.781465       1 scraper.go:137] "Scraping node" node="175.1.1.3"
I0322 06:09:24.790980       1 round_trippers.go:553] GET https://175.1.1.4:10250/metrics/resource 200 OK in 22 milliseconds
I0322 06:09:24.799395       1 round_trippers.go:553] GET https://175.1.1.3:10250/metrics/resource 200 OK in 17 milliseconds
I0322 06:09:24.807997       1 round_trippers.go:553] GET https://175.1.1.6:10250/metrics/resource 200 OK in 29 milliseconds
I0322 06:09:24.810657       1 round_trippers.go:553] GET https://175.1.1.2:10250/metrics/resource 200 OK in 58 milliseconds
I0322 06:09:33.493199       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0322 06:09:33.493339       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0322 06:09:33.493696       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="1.25371ms" userAgent="kube-probe/1.22" audit-ID="387b5689-d686-4891-b8be-9ccb6ac89b18" srcIP="175.1.1.2:41258" resp=200
I0322 06:09:33.507234       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0322 06:09:33.507298       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0322 06:09:33.507419       1 shared_informer.go:270] caches populated
I0322 06:09:33.507601       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="590.117µs" userAgent="kube-probe/1.22" audit-ID="bd3829b7-f9c5-48c9-a775-02a4b85f3d9e" srcIP="175.1.1.2:41256" resp=200
I0322 06:09:37.896336       1 round_trippers.go:553] GET https://175.1.1.7:10250/metrics/resource 401 Unauthorized in 13161 milliseconds
E0322 06:09:37.896905       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="175.1.1.7"
I0322 06:09:37.897026       1 scraper.go:172] "Scrape finished" duration="13.164760946s" nodeCount=5 podCount=52
I0322 06:09:37.897055       1 server.go:139] "Storing metrics"
I0322 06:09:37.897675       1 server.go:144] "Scraping cycle complete"

@yangjunmyfm192085
Copy link
Contributor

Hi, @grunlab, @serathius , I haven't made any progress on this issue lately.
Does this problem still exist in the environment? Could metrics-servcer get metrics normally if no Pod is restarting? @grunlab

@grunlab
Copy link
Author

grunlab commented Mar 28, 2022

@yangjunmyfm192085
The problem is still present.
Metrics-server container mainly restart when other pods are deployed/stopped on the cluster but it also happen (not so often) when nothing special happen on the cluster.
I suspect that the issue could be linked to some ponctual disk IO perf issues on the VMs hosting the cluster (VMs are hosted on a shared ESX infrastruture)
In between, I've deployed on each nodes a monitoring of the disk IO in order to see if there is a relation between bad disk IO perf and metrics-server container restart !? ... we will see.
I keep you informed when I have some figures to share with you.

@evheniyt
Copy link

I also have a problem with metrics-server restarts
kubernetes version: v1.23.1
metrics-server version 0.6.1

I0328 18:33:58.224925       1 handler.go:153] metrics-server: GET "/livez" satisfied by nonGoRestful
I0328 18:33:58.224942       1 pathrecorder.go:240] metrics-server: "/livez" satisfied by exact match
I0328 18:33:58.224987       1 server.go:175] "Failed probe" probe="metric-collection-timely" err="metric collection didn't finish on time" duration="1m57.873242235s" maxDuration="1m30s"
I0328 18:33:58.225000       1 healthz.go:257] metric-collection-timely check failed: livez
[-]metric-collection-timely failed: metric collection didn't finish on time
I0328 18:33:58.225046       1 httplog.go:129] "HTTP" verb="GET" URI="/livez" latency="171.062µs" userAgent="kube-probe/1.23" audit-ID="fee3cd19-4487-4afc-b8fb-74a171c29206" srcIP="10.244.0.1:42424" resp=0
I0328 18:33:58.225138       1 handler.go:153] metrics-server: GET "/readyz" satisfied by nonGoRestful
I0328 18:33:58.225150       1 pathrecorder.go:240] metrics-server: "/readyz" satisfied by exact match
I0328 18:33:58.225195       1 shared_informer.go:270] caches populated
I0328 18:33:58.225243       1 httplog.go:129] "HTTP" verb="GET" URI="/readyz" latency="740.728µs" userAgent="kube-probe/1.23" audit-ID="ffdc70c8-b9c5-4d48-8ba9-a34829778622" srcIP="10.244.0.1:42422" resp=200
I0328 18:33:58.233404       1 genericapiserver.go:470] [graceful-termination] RunPreShutdownHooks has completed
I0328 18:33:58.233407       1 genericapiserver.go:419] "[graceful-termination] shutdown event" name="ShutdownInitiated"
I0328 18:33:58.233423       1 object_count_tracker.go:84] "StorageObjectCountTracker pruner is exiting"
I0328 18:33:58.233430       1 genericapiserver.go:422] "[graceful-termination] shutdown event" name="AfterShutdownDelayDuration"
I0328 18:33:58.233438       1 reflector.go:225] Stopping reflector *v1.Node (0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233453       1 genericapiserver.go:460] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
I0328 18:33:58.233500       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0328 18:33:58.233574       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233506       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0328 18:33:58.233513       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0328 18:33:58.233518       1 secure_serving.go:311] Stopped listening on [::]:8443
I0328 18:33:58.233520       1 configmap_cafile_content.go:222] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0328 18:33:58.233581       1 dynamic_serving_content.go:145] "Shutting down controller" name="serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key"
I0328 18:33:58.233588       1 reflector.go:225] Stopping reflector *v1.PartialObjectMetadata (0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233622       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:58.233676       1 genericapiserver.go:448] "[graceful-termination] shutdown event" name="HTTPServerStoppedListening"
I0328 18:33:58.233679       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from pkg/mod/k8s.io/client-go@v0.23.2/tools/cache/reflector.go:167
I0328 18:33:59.309314       1 genericapiserver.go:477] [graceful-termination] apiserver is exiting

Attaching full logs
metrics.logs.zip

@yangjunmyfm192085
Copy link
Contributor

The logs show similar failures.

@grunlab
Copy link
Author

grunlab commented Apr 14, 2022

FYI, since I've deployed the Disk IO monitoring on all the cluster nodes, I didn't find any correlation between Disk IO pics and metrics-server container restart ... :-(

@yangjunmyfm192085
Copy link
Contributor

Thanks for sharing the information

@yangjunmyfm192085
Copy link
Contributor

Hi, @grunlab , Could you help to confirm whether the issue is similar to the following?
#907 (comment)

@grunlab
Copy link
Author

grunlab commented Apr 20, 2022

@yangjunmyfm192085

The issue looks not similar to #907

I'm getting the containers resources instantly on each cluster nodes when running this:

kubectl proxy
kubectl get --raw /api/v1/nodes/<cluster_nodes>/proxy/metrics/resource

I've also tried to increase the --metric-resolution from 15s to 30s ... same issue, metrics-server container is still restarting frequently :-(

@yangjunmyfm192085
Copy link
Contributor

ok, so it looks like it might not be the same issue

@AndrewSav
Copy link

Good to know that the fix is released, but do we know what was causing the issue? Specifically, why only some people is observing the issue and others are not?

@grunlab
Copy link
Author

grunlab commented Dec 13, 2022

FYI, no more container restart since upgraded to 0.6.2 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants