-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cAdvisor stops updating metrics after k3s upgrade #3035
Comments
This also seems to happen if no upgrade is taking place, by running the below again
After that I install nfs-client-provisioner, traefik 2 and metallb |
Can you provide a sample command to retrieve the cadvisor metrics that you're seeing go stale? |
Before update:
After "update" of k3s cpu usage reports 0 and memory usage seems to report the same value every time:
|
Do you have any errors in your metrics-server pod logs? |
No errors, i just did another "install/update" on my cluster. I checked other logs too but I don't see anything related to metrics or cgroups. |
This is reproducible:
create file loop.yaml:
Running the following will now report proper cpu usage Start another shell: Cpu metric will soon start to report 0 (zero) Recreate pod Cpu metric will start to report proper values again after a few errors |
I debugged the install script. The cause of the issue seems to be the line Commenting this line in install script makes the issue disappear. However I don't know enough of systemd specifics to know why this happens. |
That doesn't sound right - disabling the service doesn't actually stop it, it just prevents it from starting automatically on the next boot. Does this happen when you upgrade the k3s binary and restart the service manually, using curl and systemctl restart? Does this happen if you stop/disable/enable/start the service without upgrading the binary? |
Yes I tested doing just ”systemctl disable k3s” without running the install script and the same thing happens. i suspect systemd/systemctl (at least on debian buster) does something more than just disabling the service. |
Ive tried to read a bit in systemctl code and it could be that something is sent to the ”cgroup subsystem” when disabling a service, because of the additional cgroup settings in the k3s unit file. |
@mnorrsken On a Debian 10 system running K3s
(specifically the Debian 10 AMI in AWS) I'm unable to reproduce this issue through Is there anything else that may be special about your system configuration? |
@mnorrsken what do you mean when you say:
Have you customized your systemd unit to add additional settings not present in the one we install by default? As far as I know we don't do any cgroup-specific configuration in the unit generated by the install script. |
Sorry I thought these had something to do with cGroup but it was ulimits.
Anyway I can still reproduce this.. on a pristine Debian 10.8 VM After some new testing it seems this only happens when running |
Even more debugging. What happens is that stuff in /sys/fs/cgroup/cpu,cpuacct disappears if removing the service file before doing systemctl disable:
|
That is, only doing
without removing k3s.service does NOT affect the cgroup data in |
|
That output doesn't look right? The pods should be nested under
After deleting, everything is still there:
|
systemd logs when doing it the "wrong" way:
So this can be solved by doing systemctl disable before removing the unit files, i just tried that and it is working. |
Are you saying that it is automatically triggering |
No, what I'm saying is that systemd "bugs out" when trying to disable an unit file that doesn't exist. |
You are running centos, which systemd version are you using? The issue could be Debian-centric. |
Indeed it seems to be an OS issue. I installed debian backports and the problem disappears.
However, debian buster is still the stable version, and anyway, the order I suggest of disabling before removing unit files is more logical (reverse order compared to installing a service). |
Note for QA - this appears to only affect whatever version of systemd Debian Buster is currently shipping. |
I've run my install/upgrade ansible script several times now against get.k3s.io without this issue appearing on any of my debian servers, so as far as I'm concerned this issue is solved. |
Reproduced using k3s version v1.20.4+k3s1, following the instructions in #3035 (comment), Cpu metric reports brief 0's before reporting actual values.
Validated the fix in k3s version v1.20.5-rc1+k3s1, following the same insturctions, cpu metrics does not report 0. |
Environmental Info:
K3s Version:
v1.20.4+k3s1
Node(s) CPU architecture, OS, and Version:
Linux mandalore 5.10.16-meson64 #21.02.2 SMP PREEMPT Sun Feb 14 21:50:52 CET 2021 aarch64 GNU/Linux
Linux alderaan 5.10.17-v8+ #1403 SMP PREEMPT Mon Feb 22 11:37:54 GMT 2021 aarch64 GNU/Linux
Linux glados 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
Cluster Configuration:
Cluster#1: 1 master 3 workers
Cluster#2: 1 master (x86)
Describe the bug:
cAdvisor stops reporting new stats for containers (cpu usage, memory usage) after doing an in-place upgrade of the system
curl -sfL https://get.k3s.io | sh -s -
this seems to happen regardless of k3s version
The only way to get the stats working again is to restart containers. For every new container starting after upgrading, stats are correct
It seems to happen on both my arm64 cluster and my amd64 single master.
Other things I tried without luck:
Steps To Reproduce:
curl -sfL https://get.k3s.io | sh -s -
Expected behavior:
Actual behavior:
Additional context / logs:
I reported this earlier in #2895 but at the time I didn't know how to reproduce this issue.
The text was updated successfully, but these errors were encountered: