-
Notifications
You must be signed in to change notification settings - Fork 403
since v0.10.0 pods (as part of deployments) are being killed every X minutes #432
Comments
I have the feeling this happens because of the k3os system-upgrade-controller (image: 'rancher/system-upgrade-controller:v0.4.0') Just scaled down the replicaset of the system-upgrade-controller deployment to 0 pods. |
Jep, that seems to solve it!: Another thing: About this section of the controller documentation:
Well.. this is not behaving correctly to be honest. And yes: I'm sure my plan's name is "k3os-latest". |
@kooskaspers I suspect this is a transcription error but
This tells me that the system-upgrade-controller still thinks that it has an upgrade to apply. To prevent application of the The way the SUC (system-upgrade-controller) determines if it should apply a plan to a node is that if that node matches the selection criteria specified by the plan's When the SUC applies a plan to a node and the plan's For the behavior that you have described I can imagine two possible causes:
|
that's a typo indeed ;), I meant 'disabled' for sure.
Just had a look at that label you're mentioning. It says: the last time the system-upgrade-controller ran it's pod, the output was:
I got only one cronjob scheduled:
So to my understanding, it can't be a Cronjob running something every 15 mins. And my single node cluster seems to be updated fine. This conclusion is based on the log I mentioned above, plus:
So the big question is: why does the SUC still thinks it has to perform upgrades?
How can I check if that's the case? |
Your
If such is the case, I do not know what is causing the behavior that you are describing. If you scale up the SUC deployment/replicaset back to 1 does the
|
Does this match the output of |
@dweomer here we go:
Yes, it does:
Let me know if you want me to rescale the SUC replicaset back to 1. And if so, if you need additional logging. |
@kooskaspers I think the old upgrade jobs (from previous versions of the SUC) are confusing the current version/deployment of the SUC. Please delete all existing jobs in |
Just to chime in, I am also seeing this issue on my k3os cluster. I've scaled down the SUC replicaset to 0 and removed all upgrade jobs from the system. |
@kooskaspers and @andyschmid I appreciate your patience and willingness to work with me tracking this one down. I was able to replicate the flapping. It happens because of an unforseen interaction between the latest SUC and legacy SUC upgrade jobs. After upgrading to v0.10.0 and making sure the node is labeled accordingly, e.g. kubectl delete job -n k3os-system --all This is a work-around until I can spend some time with the SUC code (likely later this week). I will submit an issue there pointing to this one. |
@dweomer thanks for having a detailed look at this issue! Made some time this morning to test out your workaround (was a bit busy last couple of days). Deleted all jobs:
All jobs are gone now:
scaled up the replicas:
controller starts running again:
And after a while, I still see no pod 'apply-k3os-latest-[...]' being scheduled anymore:
Looks like we tamed the upgrade controller! Question: I don't want the controller to upgrade my kubernetes node overnight. I want to be fully aware of an upgrade taking place, so when issues arise, I know where to have a look. FYI: I'm having lightning controllers, ventilation controlling, heating, dns, websites and such running on a single k8s node, the wife is not amused when everything is down. So what's the best practice to disable the controller, enabling it whenever I want? Scaling the pods? Or using the label trick? (set it to 'disabled'). Or hardcode the current version (like 'v0.10.0') in the spec section of the upgrade plan? Just curious what would be your strategy. |
@kooskaspers wrote:
The most reliable way to prevent the SUC from applying a plan to a node is by making sure that the node does not meet the selection criteria of the plan. This means removing the |
I hope to fix rancher/system-upgrade-controller#58 next week and push out a bugfix release for k3OS that includes it. |
Version (k3OS / kernel)
k3os version v0.10.0
5.0.0-43-generic #47~18.04.1 SMP Wed Apr 1 16:27:01 UTC 2020
Architecture
x86_64
Describe the bug
Since v.0.10. I'm experiencing pods being recreated every 'now and then'.
aprox every 15 minutes, I'm experiencing this behavior.
This is not applicable for pods part of daemonsets. ONLY pods part of depoyments (and of course replicasets) are being re-created:
see screenshot here:
first part are the pods of daemonsets, 2nd part are the pods of deployments/replicasets.
In one of my grafana dashboards, you can see this behavior pretty good. Have a look at the missing values every aprox 15 mins (it started around 23:15):
To Reproduce
just happens every aprox 15 mins
Expected behavior
stable deployments/replicasets. no pods being recreated every 15 minutes
Actual behavior
pods being recreated every 15 minutes
Additional context
The text was updated successfully, but these errors were encountered: