Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cAdvisor stops updating metrics after k3s upgrade #3035

Closed
mnorrsken opened this issue Mar 9, 2021 · 25 comments
Closed

cAdvisor stops updating metrics after k3s upgrade #3035

mnorrsken opened this issue Mar 9, 2021 · 25 comments
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@mnorrsken
Copy link
Contributor

Environmental Info:
K3s Version:
v1.20.4+k3s1

Node(s) CPU architecture, OS, and Version:
Linux mandalore 5.10.16-meson64 #21.02.2 SMP PREEMPT Sun Feb 14 21:50:52 CET 2021 aarch64 GNU/Linux
Linux alderaan 5.10.17-v8+ #1403 SMP PREEMPT Mon Feb 22 11:37:54 GMT 2021 aarch64 GNU/Linux
Linux glados 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux

Cluster Configuration:
Cluster#1: 1 master 3 workers
Cluster#2: 1 master (x86)

Describe the bug:
cAdvisor stops reporting new stats for containers (cpu usage, memory usage) after doing an in-place upgrade of the system
curl -sfL https://get.k3s.io | sh -s -
this seems to happen regardless of k3s version
The only way to get the stats working again is to restart containers. For every new container starting after upgrading, stats are correct

It seems to happen on both my arm64 cluster and my amd64 single master.
Other things I tried without luck:

  • Restarting k3s or k3s-agent
  • Restarting metrics-server
  • Restarting prometheus

Steps To Reproduce:

  • Upgrade k3s "in place" curl -sfL https://get.k3s.io | sh -s -
  • Container running before upgrade are now stuck on the same cAdvisor stats until "restarted"

Expected behavior:

  • cAdvisor stats continue to report correctly after upgrade

Actual behavior:

  • I have to restart all containers for example by rolling node restarts

Additional context / logs:
I reported this earlier in #2895 but at the time I didn't know how to reproduce this issue.

@mnorrsken
Copy link
Contributor Author

This also seems to happen if no upgrade is taking place, by running the below again
The full command I use for installing is

      curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=v1.20 sh -s - 
      --write-kubeconfig-mode 640 
      --disable local-storage 
      --disable servicelb 
      --disable traefik 
      --kubelet-arg=image-gc-high-threshold=85
      --kubelet-arg=image-gc-low-threshold=75
      --kubelet-arg=container-log-max-files=2 
      --kubelet-arg=container-log-max-size=5Mi

After that I install nfs-client-provisioner, traefik 2 and metallb

@brandond
Copy link
Member

Can you provide a sample command to retrieve the cadvisor metrics that you're seeing go stale?

@mnorrsken
Copy link
Contributor Author

Before update:

$ kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/prom/pods/prom-prometheus-server-78bcb7cd77-55xs5  | jq
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "prom-prometheus-server-78bcb7cd77-55xs5",
    "namespace": "prom",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/prom/pods/prom-prometheus-server-78bcb7cd77-55xs5",
    "creationTimestamp": "2021-03-10T21:15:40Z"
  },
  "timestamp": "2021-03-10T21:14:31Z",
  "window": "30s",
  "containers": [
    {
      "name": "prometheus-server",
      "usage": {
        "cpu": "13935439n",
        "memory": "512052Ki"
      }
    }
  ]
}

After "update" of k3s cpu usage reports 0 and memory usage seems to report the same value every time:

$ kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/prom/pods/prom-prometheus-server-78bcb7cd77-55xs5  | jq
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "prom-prometheus-server-78bcb7cd77-55xs5",
    "namespace": "prom",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/prom/pods/prom-prometheus-server-78bcb7cd77-55xs5",
    "creationTimestamp": "2021-03-10T21:18:21Z"
  },
  "timestamp": "2021-03-10T21:17:42Z",
  "window": "30s",
  "containers": [
    {
      "name": "prometheus-server",
      "usage": {
        "cpu": "0",
        "memory": "517012Ki"
      }
    }
  ]
}

@brandond
Copy link
Member

Do you have any errors in your metrics-server pod logs?

@mnorrsken
Copy link
Contributor Author

No errors, i just did another "install/update" on my cluster. I checked other logs too but I don't see anything related to metrics or cgroups.

@mnorrsken
Copy link
Contributor Author

This is reproducible:
install debian 10 minimal on amd64 VM

apt install curl jq
curl -sfL https://get.k3s.io | sh -

create file loop.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
spec:
  containers:
  - name: busybox
    image: busybox:latest
    command: [ "/bin/sh", "-ec", "--" ]
    args: [ "while true; do sleep 0.0001; done;" ]
kubectl apply -f loop.yaml

Running the following will now report proper cpu usage
while true; do kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/busybox | jq -r '.containers[0].usage.cpu'; done

Start another shell:
curl -sfL https://get.k3s.io | sh -

Cpu metric will soon start to report 0 (zero)

Recreate pod
kubectl delete -f loop.yaml ; kubectl apply -f loop.yaml

Cpu metric will start to report proper values again after a few errors

@mnorrsken
Copy link
Contributor Author

mnorrsken commented Mar 13, 2021

I debugged the install script. The cause of the issue seems to be the line
systemctl disable k3s
in the "systemd_disable" function

Commenting this line in install script makes the issue disappear. However I don't know enough of systemd specifics to know why this happens.

@brandond
Copy link
Member

That doesn't sound right - disabling the service doesn't actually stop it, it just prevents it from starting automatically on the next boot.

Does this happen when you upgrade the k3s binary and restart the service manually, using curl and systemctl restart?

Does this happen if you stop/disable/enable/start the service without upgrading the binary?

@mnorrsken
Copy link
Contributor Author

mnorrsken commented Mar 13, 2021

Yes I tested doing just ”systemctl disable k3s” without running the install script and the same thing happens. i suspect systemd/systemctl (at least on debian buster) does something more than just disabling the service.

@mnorrsken
Copy link
Contributor Author

Ive tried to read a bit in systemctl code and it could be that something is sent to the ”cgroup subsystem” when disabling a service, because of the additional cgroup settings in the k3s unit file.

@Oats87
Copy link
Member

Oats87 commented Mar 16, 2021

@mnorrsken On a Debian 10 system running K3s

k3s version v1.20.4+k3s1 (838a906a)

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

(specifically the Debian 10 AMI in AWS) I'm unable to reproduce this issue through systemctl disable k3s. Metrics continue to be updated and CPU metrics do not go to zero.

Is there anything else that may be special about your system configuration?

@brandond
Copy link
Member

@mnorrsken what do you mean when you say:

because of the additional cgroup settings in the k3s unit file.

Have you customized your systemd unit to add additional settings not present in the one we install by default? As far as I know we don't do any cgroup-specific configuration in the unit generated by the install script.

@mnorrsken
Copy link
Contributor Author

Sorry I thought these had something to do with cGroup but it was ulimits.

LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity

Anyway I can still reproduce this.. on a pristine Debian 10.8 VM

After some new testing it seems this only happens when running systemctl disable k3s after running rm -f /etc/systemd/system/k3s.service

@mnorrsken
Copy link
Contributor Author

Even more debugging. What happens is that stuff in /sys/fs/cgroup/cpu,cpuacct disappears if removing the service file before doing systemctl disable:

# ls /sys/fs/cgroup/cpu,cpuacct
apparmor.service
cgroup.clone_children
cgroup.procs
console-setup.service
cpuacct.stat
cpuacct.usage
cpuacct.usage_all
cpuacct.usage_percpu
cpuacct.usage_percpu_sys
cpuacct.usage_percpu_user
cpuacct.usage_sys
cpuacct.usage_user
cpu.cfs_period_us
cpu.cfs_quota_us
cpu.shares
cpu.stat
cron.service
dbus.service
dbus.socket
dev-hugepages.mount
dev-mqueue.mount
ifupdown-pre.service
ifup@ens192.service
k3s.service
keyboard-setup.service
kmod-static-nodes.service
-.mount
networking.service
notify_on_release
rsyslog.service
run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-0ca054b089ba04b336662f1affd688687033836e925565c0fbc9dddd5eb2f9a7-shm.mount
run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-0d21693f83cda82b3918fc0c93b16231bcfab8dac25a12ee647c6f376a227f9b-shm.mount
run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-4b385dfcdea1c3d22f9811219fed515eb5d6d9bc2b84a99477411f9a1d6a82de-shm.mount
run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-716aa9c1e01202b5af2ae9e82eb72cfb4bd4fe390aa62b24bdbdef680e81f7e8-shm.mount
run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-d849981908e887c2b32e4d6cae062b32caef69558563454e43e3e1d83cf46a3c-shm.mount
run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-fcdc40e469182f717812fe4f26544725b2536e4667390901336a8e86fa77d6fa-shm.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-0ca054b089ba04b336662f1affd688687033836e925565c0fbc9dddd5eb2f9a7-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-0d21693f83cda82b3918fc0c93b16231bcfab8dac25a12ee647c6f376a227f9b-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-2ff072d8b74eefe60121fc7f4dc16f07c2e9b7d0d9cad14c8361123505ef4321-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-3c5e5640905cc781a9ee1d015b327f07b8863f6824ee150eb237327919e4ce67-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-4b385dfcdea1c3d22f9811219fed515eb5d6d9bc2b84a99477411f9a1d6a82de-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-53cf2fce914da4e0371d8150cf1c527d29914a0b932a56170c085ad3a7a8bf08-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-716aa9c1e01202b5af2ae9e82eb72cfb4bd4fe390aa62b24bdbdef680e81f7e8-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-9a4e359d4546723bfaec96788b31e92e186ddc1a76bd6c47e81a26d746f3bcde-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-b12c9057f8dbd1ae87b2bcc4839f6fa89a2ecb4b75ca6c5d7557e82cb2aadb83-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-d849981908e887c2b32e4d6cae062b32caef69558563454e43e3e1d83cf46a3c-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-d9ed3b46a5ef038cce81ba3038db45a56bcae2ea95df08e9141a0d6e32df2419-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-f64fc27fc4607ff948bcf8f2619e136e1da637dcc89c6483766e440eba2d5449-rootfs.mount
run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-fcdc40e469182f717812fe4f26544725b2536e4667390901336a8e86fa77d6fa-rootfs.mount
run-netns-cni\x2d4bbe587c\x2dbafe\x2d664f\x2d20a3\x2d23d9e1b817a7.mount
run-netns-cni\x2d5235a53b\x2d9264\x2d7374\x2db535\x2de019911cfbdf.mount
run-netns-cni\x2d8f89102d\x2dbb79\x2d9d3b\x2de051\x2da178e2ccee4e.mount
run-netns-cni\x2db5598530\x2d7b68\x2da6a6\x2d12b0\x2dc21177fbef9f.mount
run-netns-cni\x2dd57a9532\x2d7f92\x2d4cf5\x2ddcad\x2dd5e788c84027.mount
run-netns-cni\x2deb4057ec\x2d3832\x2d7188\x2deb5e\x2d91ec52f65f61.mount
run-user-1000.mount
ssh.service
sys-kernel-debug.mount
syslog.socket
systemd-fsckd.socket
systemd-initctl.socket
systemd-journald-audit.socket
systemd-journald-dev-log.socket
systemd-journald.service
systemd-journald.socket
systemd-journal-flush.service
systemd-logind.service
systemd-modules-load.service
systemd-random-seed.service
systemd-remount-fs.service
systemd-sysctl.service
systemd-sysusers.service
systemd-timesyncd.service
systemd-tmpfiles-setup-dev.service
systemd-tmpfiles-setup.service
systemd-udevd-control.socket
systemd-udevd-kernel.socket
systemd-udevd.service
systemd-udev-trigger.service
systemd-update-utmp.service
systemd-user-sessions.service
system-getty.slice
tasks
var-lib-kubelet-pods-1f425bdf\x2db33b\x2d485e\x2d80fd\x2d65338d12d7a2-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2d7dhj7.mount
var-lib-kubelet-pods-87d0741f\x2d605e\x2d4265\x2d9f22\x2d5135462238d1-volumes-kubernetes.io\x7esecret-local\x2dpath\x2dprovisioner\x2dservice\x2daccount\x2dtoken\x2dmd29n.mount
var-lib-kubelet-pods-8fe609a5\x2d12a9\x2d43ce\x2db7e7\x2def2c5fe912ae-volumes-kubernetes.io\x7esecret-coredns\x2dtoken\x2d4td8t.mount
var-lib-kubelet-pods-9acc4442\x2d7615\x2d4028\x2d9bf1\x2dce2babfe1450-volumes-kubernetes.io\x7esecret-metrics\x2dserver\x2dtoken\x2d54kln.mount
var-lib-kubelet-pods-c49f2827\x2dac9e\x2d48c1\x2d8e6c\x2d6e2f473798bb-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2d5w6p2.mount
var-lib-kubelet-pods-dba46519\x2ddc94\x2d4665\x2da667\x2d5ac5e9154d2d-volumes-kubernetes.io\x7esecret-ssl.mount
var-lib-kubelet-pods-dba46519\x2ddc94\x2d4665\x2da667\x2d5ac5e9154d2d-volumes-kubernetes.io\x7esecret-traefik\x2dtoken\x2dsnjph.mount
# rm -rf /etc/systemd/system/k3s.service
# systemctl disable k3s
# ls /sys/fs/cgroup/cpu,cpuacct
cgroup.clone_children
cgroup.procs
cpuacct.stat
cpuacct.usage
cpuacct.usage_all
cpuacct.usage_percpu
cpuacct.usage_percpu_sys
cpuacct.usage_percpu_user
cpuacct.usage_sys
cpuacct.usage_user
cpu.cfs_period_us
cpu.cfs_quota_us
cpu.shares
cpu.stat
k3s.service
notify_on_release
systemd-udevd.service
system-getty.slice
tasks

@mnorrsken
Copy link
Contributor Author

That is, only doing

# systemctl disable k3s

without removing k3s.service does NOT affect the cgroup data in /sys/fs/cgroup/cpu,cpuacct

@mnorrsken
Copy link
Contributor Author

root@k3stest:~# k3s --version
k3s version v1.20.4+k3s1 (838a906a)
go version go1.15.8
root@k3stest:~# cat /etc/debian_version
10.8
root@k3stest:~# uname -a
Linux k3stest 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
root@k3stest:~# systemd --version
systemd 241 (241)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid

@brandond
Copy link
Member

brandond commented Mar 16, 2021

That output doesn't look right? The pods should be nested under /kubepods, and k3s should be in /system.slice/k3s.service:

[root@centos03 ~]# ls -l /sys/fs/cgroup/cpu,cpuacct/
total 0
-rw-r--r--.  1 root root 0 Mar 16 15:03 cgroup.clone_children
-rw-r--r--.  1 root root 0 Mar 16 15:00 cgroup.procs
-r--r--r--.  1 root root 0 Mar 16 15:03 cgroup.sane_behavior
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.stat
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_all
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_percpu
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_percpu_sys
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_percpu_user
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_sys
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_user
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.cfs_period_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.cfs_quota_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.rt_period_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.rt_runtime_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.shares
-r--r--r--.  1 root root 0 Mar 16 15:03 cpu.stat
drwxr-xr-x.  4 root root 0 Mar 16 15:03 kubepods
-rw-r--r--.  1 root root 0 Mar 16 15:03 notify_on_release
-rw-r--r--.  1 root root 0 Mar 16 15:03 release_agent
drwxr-xr-x. 58 root root 0 Mar 16 15:00 system.slice
-rw-r--r--.  1 root root 0 Mar 16 15:03 tasks
drwxr-xr-x.  2 root root 0 Mar 16 15:00 user.slice

[root@centos03 ~]# systemd-cgtop -n 1 --depth 3 | cat
/                                                                117      -     1.5G        -        -
/kubepods                                                          -      -    61.3M        -        -
/kubepods/besteffort                                               -      -    46.9M        -        -
/kubepods/besteffort/podb77ee18e-8fc3-4929-968b-94cfa9bf5185       -      -    12.0M        -        -
/kubepods/besteffort/podd35119d3-8de3-4e24-9045-7ab80d445f46       -      -    16.9M        -        -
/kubepods/besteffort/podde5b0247-b65d-45e6-8612-f453bcd1bfee       -      -    10.6M        -        -
/kubepods/besteffort/podfd30f404-b2e1-4126-9f66-c638408ab761       -      -     3.0M        -        -
/kubepods/burstable                                                -      -    14.3M        -        -
/kubepods/burstable/podf270e1f6-2f35-4dc3-9a06-bae79ab3a002        -      -    14.3M        -        -
/system.slice                                                      -      -     1.3G        -        -
/system.slice/NetworkManager.service                               2      -    16.8M        -        -
/system.slice/auditd.service                                       1      -     4.1M        -        -
/system.slice/boot-efi.mount                                       -      -    44.0K        -        -
/system.slice/boot.mount                                           -      -    48.0K        -        -
/system.slice/chronyd.service                                      1      -     2.8M        -        -
/system.slice/crond.service                                        1      -     1.0M        -        -
/system.slice/dbus.service                                         1      -     2.7M        -        -
/system.slice/dev-hugepages.mount                                  -      -    76.0K        -        -
/system.slice/dev-mqueue.mount                                     -      -   128.0K        -        -
/system.slice/gssproxy.service                                     1      -     2.4M        -        -
/system.slice/irqbalance.service                                   1      -   900.0K        -        -
/system.slice/k3s.service                                          7      -     1.0G        -        -
/system.slice/lvm2-lvmetad.service                                 1      -     3.2M        -        -
/system.slice/polkit.service                                       1      -    15.0M        -        -
/system.slice/postfix.service                                      3      -     9.4M        -        -
/system.slice/qemu-guest-agent.service                             1      -   904.0K        -        -
/system.slice/rpcbind.service                                      1      -     2.0M        -        -
/system.slice/rsyslog.service                                      1      -     4.4M        -        -
/system.slice/sshd.service                                         1      -     7.2M        -        -
/system.slice/sys-kernel-debug.mount                               -      -   664.0K        -        -
/system.slice/system-getty.slice                                   1      -   384.0K        -        -
/system.slice/system-getty.slice/getty@tty1.service                1      -        -        -        -
/system.slice/system-lvm2\x2dpvscan.slice                          -      -   392.0K        -        -
/system.slice/systemd-journald.service                             1      -     2.6M        -        -
/system.slice/systemd-logind.service                               1      -     1.6M        -        -
/system.slice/systemd-udevd.service                                1      -    17.2M        -        -
/system.slice/tuned.service                                        1      -    25.6M        -        -
/system.slice/var-lib-nfs-rpc_pipefs.mount                         -      -    12.0K        -        -
/user.slice                                                        4      -   256.4M        -        -
/user.slice/user-0.slice/session-1.scope                           4      -        -        -        -

After deleting, everything is still there:

[root@centos03 ~]# rm -rf /etc/systemd/system/k3s.service

[root@centos03 ~]# ls -l /sys/fs/cgroup/cpu,cpuacct/
total 0
-rw-r--r--.  1 root root 0 Mar 16 15:03 cgroup.clone_children
-rw-r--r--.  1 root root 0 Mar 16 15:00 cgroup.procs
-r--r--r--.  1 root root 0 Mar 16 15:03 cgroup.sane_behavior
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.stat
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_all
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_percpu
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_percpu_sys
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_percpu_user
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_sys
-r--r--r--.  1 root root 0 Mar 16 15:03 cpuacct.usage_user
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.cfs_period_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.cfs_quota_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.rt_period_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.rt_runtime_us
-rw-r--r--.  1 root root 0 Mar 16 15:03 cpu.shares
-r--r--r--.  1 root root 0 Mar 16 15:03 cpu.stat
drwxr-xr-x.  4 root root 0 Mar 16 15:03 kubepods
-rw-r--r--.  1 root root 0 Mar 16 15:03 notify_on_release
-rw-r--r--.  1 root root 0 Mar 16 15:03 release_agent
drwxr-xr-x. 58 root root 0 Mar 16 15:00 system.slice
-rw-r--r--.  1 root root 0 Mar 16 15:03 tasks
drwxr-xr-x.  2 root root 0 Mar 16 15:00 user.slice

[root@centos03 ~]# systemd-cgtop -n 1 --depth 3 | cat
/                                                                117      -     1.5G        -        -
/kubepods                                                          -      -    81.8M        -        -
/kubepods/besteffort                                               -      -    60.6M        -        -
/kubepods/besteffort/podb77ee18e-8fc3-4929-968b-94cfa9bf5185       -      -    11.8M        -        -
/kubepods/besteffort/podd35119d3-8de3-4e24-9045-7ab80d445f46       -      -    23.4M        -        -
/kubepods/besteffort/podde5b0247-b65d-45e6-8612-f453bcd1bfee       -      -    18.0M        -        -
/kubepods/besteffort/podfd30f404-b2e1-4126-9f66-c638408ab761       -      -     3.0M        -        -
/kubepods/burstable                                                -      -    21.2M        -        -
/kubepods/burstable/podf270e1f6-2f35-4dc3-9a06-bae79ab3a002        -      -    21.2M        -        -
/system.slice                                                      -      -     1.3G        -        -
/system.slice/NetworkManager.service                               2      -    16.8M        -        -
/system.slice/auditd.service                                       1      -     4.1M        -        -
/system.slice/boot-efi.mount                                       -      -    44.0K        -        -
/system.slice/boot.mount                                           -      -    48.0K        -        -
/system.slice/chronyd.service                                      1      -     2.8M        -        -
/system.slice/crond.service                                        1      -     1.0M        -        -
/system.slice/dbus.service                                         1      -     2.7M        -        -
/system.slice/dev-hugepages.mount                                  -      -    76.0K        -        -
/system.slice/dev-mqueue.mount                                     -      -   128.0K        -        -
/system.slice/gssproxy.service                                     1      -     2.4M        -        -
/system.slice/irqbalance.service                                   1      -   940.0K        -        -
/system.slice/k3s.service                                          7      -     1.0G        -        -
/system.slice/lvm2-lvmetad.service                                 1      -     3.2M        -        -
/system.slice/polkit.service                                       1      -    15.0M        -        -
/system.slice/postfix.service                                      3      -     9.4M        -        -
/system.slice/qemu-guest-agent.service                             1      -   904.0K        -        -
/system.slice/rpcbind.service                                      1      -     2.0M        -        -
/system.slice/rsyslog.service                                      1      -     4.4M        -        -
/system.slice/sshd.service                                         1      -     7.2M        -        -
/system.slice/sys-kernel-debug.mount                               -      -   664.0K        -        -
/system.slice/system-getty.slice                                   1      -   384.0K        -        -
/system.slice/system-getty.slice/getty@tty1.service                1      -        -        -        -
/system.slice/system-lvm2\x2dpvscan.slice                          -      -   392.0K        -        -
/system.slice/systemd-journald.service                             1      -     2.6M        -        -
/system.slice/systemd-logind.service                               1      -     1.6M        -        -
/system.slice/systemd-udevd.service                                1      -    17.2M        -        -
/system.slice/tuned.service                                        1      -    25.6M        -        -
/system.slice/var-lib-nfs-rpc_pipefs.mount                         -      -    12.0K        -        -
/user.slice                                                        4      -   256.4M        -        -
/user.slice/user-0.slice/session-1.scope                           4      -        -        -        -

@mnorrsken
Copy link
Contributor Author

systemd logs when doing it the "wrong" way:

Mar 16 23:13:13 k3stest systemd[1]: Reloading.
Mar 16 23:13:13 k3stest systemd[1]: k3s.service: Current command vanished from the unit file, execution of the command list won't be resumed.

So this can be solved by doing systemctl disable before removing the unit files, i just tried that and it is working.

@brandond
Copy link
Member

brandond commented Mar 16, 2021

Are you saying that it is automatically triggering systemctl daemon-reload when you delete the unit file from disk? That is not something I have seen before.

@mnorrsken
Copy link
Contributor Author

No, what I'm saying is that systemd "bugs out" when trying to disable an unit file that doesn't exist.

@mnorrsken
Copy link
Contributor Author

You are running centos, which systemd version are you using? The issue could be Debian-centric.

@mnorrsken
Copy link
Contributor Author

Indeed it seems to be an OS issue. I installed debian backports and the problem disappears.

root@k3stest:~# uname -a
Linux k3stest 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux
root@k3stest:~# k3s --version
k3s version v1.20.4+k3s1 (838a906a)
go version go1.15.8
root@k3stest:~# systemd --version
systemd 247 (247.3-1~bpo10+1)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified

However, debian buster is still the stable version, and anyway, the order I suggest of disabling before removing unit files is more logical (reverse order compared to installing a service).

@brandond brandond added this to the v1.20.5+k3s1 milestone Mar 18, 2021
@brandond brandond self-assigned this Mar 18, 2021
@brandond brandond added the kind/bug Something isn't working label Mar 18, 2021
@brandond
Copy link
Member

Note for QA - this appears to only affect whatever version of systemd Debian Buster is currently shipping.

@mnorrsken
Copy link
Contributor Author

I've run my install/upgrade ansible script several times now against get.k3s.io without this issue appearing on any of my debian servers, so as far as I'm concerned this issue is solved.

@ShylajaDevadiga
Copy link
Contributor

Reproduced using k3s version v1.20.4+k3s1, following the instructions in #3035 (comment), Cpu metric reports brief 0's before reporting actual values.
$ cat /etc/debian_version
10.8

Error from server (NotFound): podmetrics.metrics.k8s.io "default/busybox" not found
Error from server (NotFound): podmetrics.metrics.k8s.io "default/busybox" not found
Error from server (NotFound): podmetrics.metrics.k8s.io "default/busybox" not found
551626208n
551626208n
551626208n
551626208n
551626208n
551626208n
551626208n
551626208n
0
0
0
0
0
0
0
0
0
0
0
551992545n
551992545n
551992545n
551992545n
551992545n
551992545n

Validated the fix in k3s version v1.20.5-rc1+k3s1, following the same insturctions, cpu metrics does not report 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants