-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All nodes have warning events when stood up with kOps 1.30 #16763
Comments
I believe that the problem will have been introduced with the following merged pull request: |
@jim-barber-he Will try to figure it out for 1.30.1. Thanks for the report! |
It seems this was missed for the kOps 1.30.1 release. Then you can re-introduce the new checks at your leisure as you work out the fixes for them. I'm happy to raise a PR to disable those checks if you're happy to go with that... |
Sorry, missed this. PR should be ready to go for next release in 1-2 weeks. |
Beware - this also affects existing kops-1.29.2 clusters, as soon as "kops update cluster" is done with kops-1.30.1 |
Hi @hakman , I was wondering if 1.30.2 is still coming, since we are way past the 1-2 weeks, or if you are focusing on 1.31.0 now |
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
4. What commands did you run? What is the simplest way to reproduce this issue?
The cluster is stood up via a manifest that contains:
5. What happened after the commands executed?
The cluster comes up fine and all pods are healthy however when using
kubectl describe node
against any of the nodes (control-plane or worker) they have the following Warning events:These events do not clear over time.
The
InvalidDiskCapacity
event is what leads to theKubeletUnhealthy
event, and theContainerdStart
event is what leads to theContainerdUnhealthy
event.6. What did you expect to happen?
All events on the nodes to have a
Normal
status.7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
I believe that
node-problem-detector
is at fault.It can't find the
crictl
orsystemctl
commands.Here are some logs from
node-problem-detector
related to theContainerdUnhealthy
status:And logs for the
KubeletUnhealthy
status:I was able to stop the errors by just disabling the checks for them by making the following changes to the
node-problem-detector
daemonset and then rolling the nodes afterwards (since ones marked with the errors didn't clear):I also tried this instead to see if I could fix the problems so that it had access to the
crictl
andsystemctl
commands and the containerd socket:The above fixed the
ContainerdStart
andContainerdUnhealthy
problems but theInvalidDiskCapacity
andKubeletUnhealthy
problems still persist with the warning message still saying:invalid capacity 0 on image filesystem
.I haven't worked out how to fix it... I'm thinking perhaps something from the host still needs to be mounted into the container somewhere (wherever the image filesystem lives).
I tried running the
/home/kubernetes/bin/health-checker --component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s
command manually in the container and passing in the-v
option with various levels to try and get more verbosity, but that all just returned messages like the following no matter what verbosity level I tried.Here's a link to a Slack thread I created yesterday when I discovered and was working through the problem.
https://kubernetes.slack.com/archives/C3QUFP0QM/p1724120394886979
The text was updated successfully, but these errors were encountered: