-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics-server 0.6.x - regular container restarts linked to err="metric collection didn't finish on time" #983
Comments
First time I see issue with the probes, so it's very unusual. This probe is meant to check if the metric collection pipeline is working. Metrics Server runs dedicated goroutine that is expected to collect and save metrics every 15s, this probe fails collection loop haven't finished within last 22.5s (1.5 times 15s). There is no way to increase the This is meant to allow recovery from deadlock in metric collection code as it restarts metrics server. There should be no big impact for your cluster as 15 restarts for 25 days is pretty rare (availability >99.5%), however it means there is a bug in the code so I would appreciate if you can help us find the cause. Could you enable higher log verbosity so we can see more full view of events happening before the restart? Just add |
Looks like an interesting bug |
OK, i've increased the log verbosity to 6. |
OK, the container has restarted 3 times in between:
Container start : Wed, 16 Mar 2022 15:06:39 Please find attached the logs (including the 3 restarts) Hope this will help. |
Yeah, I will also try to analyze it |
From the log, it prints
But Most likely the channel
I don't know if it's related. I will continue to analyze in depth. |
@yangjunmyfm192085 : I'm currently testing metrics-server 0.6.x on a staging cluster where there is no real activity. I use this cluster to test upgrades before upgrading my production environment. The test cluster is composed of 3 masters and 3 workers hosted on small VMs (2 cores / 4GB RAM) Cluster nodes:
Nodes current load:
Metrics-server load:
I noticed that metrics-server container restarts happen more or less when surrounding pods (in order namespaces) are switching in terminating. Maybe metrics-server is trying to get metrics from containers that are not existing anymore which can potentially explain the "metric collection didn't finish on time" ? If you need more logs (maybe at cri-o or kubelet level), don't hesitate to sollicitate me. |
@grunlab Thank you for your information and analysis. |
One thing that doesn't match your analysis is that access to
|
After a recent analysis, it is most likely that, |
I expect that |
Yeah. To verify this scenario. I set up an unreachable url to test, and sent two requests in total, one with |
The case of |
@serathius , I did the following verification today:
Type 1:(Types of analysis we focus on)
Type2:
Type3:
|
Hi, @grunlab, @serathius , I haven't made any progress on this issue lately. |
@yangjunmyfm192085 |
I also have a problem with metrics-server restarts
Attaching full logs |
The logs show similar failures. |
FYI, since I've deployed the Disk IO monitoring on all the cluster nodes, I didn't find any correlation between Disk IO pics and metrics-server container restart ... :-( |
Thanks for sharing the information |
Hi, @grunlab , Could you help to confirm whether the issue is similar to the following? |
The issue looks not similar to #907 I'm getting the containers resources instantly on each cluster nodes when running this:
I've also tried to increase the |
ok, so it looks like it might not be the same issue |
Good to know that the fix is released, but do we know what was causing the issue? Specifically, why only some people is observing the issue and others are not? |
FYI, no more container restart since upgraded to 0.6.2 :-) |
Since I've upgraded metrics-server from 0.5.x to 0.6.x, metrics-server container is restarting frequently (it was not the case in 0.5.x):
Kubernetes version:
Metrics-server version:
For example, 15 restarts in 25 days:
metrics-server container logs just before the restart:
Is the error "metric collection didn't finish on time" normal ?
Can I increase the maxDuration of 22.5s ?
Thank you for your support.
The text was updated successfully, but these errors were encountered: