All nodes have warning events when stood up with kOps 1.30 #16763

jim-barber-he · 2024-08-20T23:42:47Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

$ kops version     
Client version: 1.30.0 (git-v1.30.0)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: v1.30.4

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

The cluster is stood up via a manifest that contains:

spec:
  nodeProblemDetector:
    enabled: true

5. What happened after the commands executed?

The cluster comes up fine and all pods are healthy however when using kubectl describe node against any of the nodes (control-plane or worker) they have the following Warning events:

  Warning  InvalidDiskCapacity      19m                kubelet                invalid capacity 0 on image filesystem
  Warning  ContainerdStart          15m                systemd-monitor        Starting containerd container runtime...
  Warning  ContainerdUnhealthy      15m                health-checker         Node condition ContainerRuntimeUnhealthy is now: True, reason: ContainerdUnhealthy, message: "cri:containerd was found unhealthy; repair flag : true"
  Warning  KubeletUnhealthy         15m                health-checker         Node condition KubeletUnhealthy is now: True, reason: KubeletUnhealthy, message: "kubelet:kubelet was found unhealthy; repair flag : true"

These events do not clear over time.
The InvalidDiskCapacity event is what leads to the KubeletUnhealthy event, and the ContainerdStart event is what leads to the ContainerdUnhealthy event.

6. What did you expect to happen?

All events on the nodes to have a Normal status.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

Skipping this for now since it's the same manifest I've used for many generations of kOps clusters.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I believe that node-problem-detector is at fault.
It can't find the crictl or systemctl commands.

Here are some logs from node-problem-detector related to the ContainerdUnhealthy status:

I0820 02:06:49.633546       1 log_monitor.go:159] New status generated: &{Source:systemd-monitor Events:[{Severity:warn Timestamp:2024-08-20 02:05:43.239889 +0000 UTC Reason:ContainerdStart Message:Starting containerd container runtime...}] Conditions:[]}
I0820 02:06:50.213247       1 plugin.go:281] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000060bb0 Timeout:3m0s}
 I0820 02:06:50.212369      18 health_checker.go:180] command /usr/bin/crictl --timeout=2s --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest failed: fork/exec /usr/bin/crictl: no such file or directory, 
I0820 02:06:50.212821      18 health_checker.go:180] command systemctl show containerd --property=InactiveExitTimestamp failed: exec: "systemctl": executable file not found in $PATH,
I0820 02:06:50.212835      18 health_checker.go:86] error in getting uptime for cri: exec: "systemctl": executable file not found in $PATH
I0820 02:06:50.213334       1 custom_plugin_monitor.go:283] New status generated: &{Source:health-checker Events:[{Severity:warn Timestamp:2024-08-20 02:06:50.213306946 +0000 UTC m=+8.109772440 Reason:ContainerdUnhealthy Message:Node condition ContainerRuntimeUnhealthy is now: True, reason: ContainerdUnhealthy, message: "cri:containerd was found unhealthy; repair flag : true"}] Conditions:[{Type:ContainerRuntimeUnhealthy Status:True Transition:2024-08-20 02:06:50.213306946 +0000 UTC m=+8.109772440 Reason:ContainerdUnhealthy Message:cri:containerd was found unhealthy; repair flag : true}]}

And logs for the KubeletUnhealthy status:

I0820 02:06:50.213530       1 plugin.go:281] Start logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc000060c90 Timeout:3m0s}
 I0820 02:06:50.213073      15 health_checker.go:180] command systemctl show kubelet --property=InactiveExitTimestamp failed: exec: "systemctl": executable file not found in $PATH,
I0820 02:06:50.213109      15 health_checker.go:86] error in getting uptime for kubelet: exec: "systemctl": executable file not found in $PATH
I0820 02:06:50.213564       1 plugin.go:282] End logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc000060c90 Timeout:3m0s}
I0820 02:06:50.213602       1 custom_plugin_monitor.go:283] New status generated: &{Source:health-checker Events:[{Severity:warn Timestamp:2024-08-20 02:06:50.213581121 +0000 UTC m=+8.110046618 Reason:KubeletUnhealthy Message:Node condition KubeletUnhealthy is now: True, reason: KubeletUnhealthy, message: "kubelet:kubelet was found unhealthy; repair flag : true"}] Conditions:[{Type:KubeletUnhealthy Status:True Transition:2024-08-20 02:06:50.213581121 +0000 UTC m=+8.110046618 Reason:KubeletUnhealthy Message:kubelet:kubelet was found unhealthy; repair flag : true}]}

I was able to stop the errors by just disabling the checks for them by making the following changes to the node-problem-detector daemonset and then rolling the nodes afterwards (since ones marked with the errors didn't clear):

@ -36,7 +36,7 @@
         - /node-problem-detector
         - --logtostderr
         - --config.system-log-monitor=/config/kernel-monitor.json,/config/systemd-monitor.json
-        - --config.custom-plugin-monitor=/config/kernel-monitor-counter.json,/config/systemd-monitor-counter.json,/config/health-checker-containerd.json,/config/health-checker-kubelet.json
+        - --config.custom-plugin-monitor=/config/kernel-monitor-counter.json,/config/systemd-monitor-counter.json
         - --config.system-stats-monitor=/config/system-stats-monitor.json
         env:
         - name: NODE_NAME

I also tried this instead to see if I could fix the problems so that it had access to the crictl and systemctl commands and the containerd socket:

@@ -76,6 +76,14 @@
         - mountPath: /var/run/dbus/
           mountPropagation: Bidirectional
           name: dbus
+        - mountPath: /usr/bin/crictl
+          name: crictl
+          readOnly: true
+        - mountPath: /usr/bin/systemctl
+          name: systemctl
+          readOnly: true
+        - mountPath: /var/run/containerd/containerd.sock
+          name: containerd-sock
       dnsPolicy: ClusterFirst
       priorityClassName: system-node-critical
       restartPolicy: Always
@@ -115,6 +123,18 @@
           path: /var/run/dbus/
           type: ""
         name: dbus
+      - hostPath:
+          path: /usr/local/bin/crictl
+          type: File
+        name: crictl
+      - hostPath:
+          path: /usr/bin/systemctl
+          type: File
+        name: systemdctl
+      - hostPath:
+          path: /var/run/containerd/containerd.sock
+          type: Socket
+        name: containerd-sock
   updateStrategy:
     rollingUpdate:
       maxSurge: 0

The above fixed the ContainerdStart and ContainerdUnhealthy problems but the InvalidDiskCapacity and KubeletUnhealthy problems still persist with the warning message still saying: invalid capacity 0 on image filesystem.
I haven't worked out how to fix it... I'm thinking perhaps something from the host still needs to be mounted into the container somewhere (wherever the image filesystem lives).
I tried running the /home/kubernetes/bin/health-checker --component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s command manually in the container and passing in the -v option with various levels to try and get more verbosity, but that all just returned messages like the following no matter what verbosity level I tried.

I0820 06:38:38.371286    1509 health_checker.go:89] kubelet is unhealthy, component uptime: 19.371223472s
kubelet:kubelet was found unhealthy; repair flag : true

Here's a link to a Slack thread I created yesterday when I discovered and was working through the problem.
https://kubernetes.slack.com/archives/C3QUFP0QM/p1724120394886979

The text was updated successfully, but these errors were encountered:

jim-barber-he · 2024-08-20T23:45:35Z

I believe that the problem will have been introduced with the following merged pull request:
#16537

hakman · 2024-08-23T17:14:13Z

@jim-barber-he Will try to figure it out for 1.30.1. Thanks for the report!

jim-barber-he · 2024-09-14T01:39:22Z

It seems this was missed for the kOps 1.30.1 release.
For a quick fix, should we just disable these new checks for kOps 1.30.2 by removing ,/config/health-checker-containerd.json,/config/health-checker-kubelet.json from the --config.custom-plugin-monitor parameter that is passed to node-problem-detector?
That at least puts it back to being the same configuration that kOps 1.29.2 has.

Then you can re-introduce the new checks at your leisure as you work out the fixes for them.

I'm happy to raise a PR to disable those checks if you're happy to go with that...

hakman · 2024-09-14T07:28:14Z

I'm happy to raise a PR to disable those checks if you're happy to go with that...

Sorry, missed this. PR should be ready to go for next release in 1-2 weeks.

vitaliyf · 2024-09-17T16:53:49Z

Beware - this also affects existing kops-1.29.2 clusters, as soon as "kops update cluster" is done with kops-1.30.1

pwerth-drk · 2024-10-25T06:17:52Z

Hi @hakman , I was wondering if 1.30.2 is still coming, since we are way past the 1-2 weeks, or if you are focusing on 1.31.0 now

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2024

rifelpet added the kind/office-hours label Aug 22, 2024

hakman mentioned this issue Sep 14, 2024

Disable node-problem-detector containerd and kubelet checks #16831

Merged

k8s-ci-robot closed this as completed in #16831 Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All nodes have warning events when stood up with kOps 1.30 #16763

All nodes have warning events when stood up with kOps 1.30 #16763

jim-barber-he commented Aug 20, 2024

jim-barber-he commented Aug 20, 2024

hakman commented Aug 23, 2024

jim-barber-he commented Sep 14, 2024

hakman commented Sep 14, 2024

vitaliyf commented Sep 17, 2024

pwerth-drk commented Oct 25, 2024

All nodes have warning events when stood up with kOps 1.30 #16763

All nodes have warning events when stood up with kOps 1.30 #16763

Comments

jim-barber-he commented Aug 20, 2024

jim-barber-he commented Aug 20, 2024

hakman commented Aug 23, 2024

jim-barber-he commented Sep 14, 2024

hakman commented Sep 14, 2024

vitaliyf commented Sep 17, 2024

pwerth-drk commented Oct 25, 2024