Resource temporarily unavailable: Due to logs filling up /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io #3253

prashilgupta · 2021-10-28T15:30:25Z

Description

Describe the bug:
/run partition is filling up causing containerd and kubernetes to be unusable

df -ah /run
Filesystem Size Used Avail Use% Mounted on
tmpfs 395M 395M 0 100% /run

kubectl get pods -A
bash: fork: retry: Resource temporarily unavailable
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x404fcf7 m=0 sigcode=18446744073709551610
...
...
Steps To Reproduce:

Have Installed K3s and running since 65 days
one of the pods has liveliness probe configured on alpine based keepalived pod
livenessProbe:
exec:
command:

pidof
keepalived
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
This probe is causing some issue we are suspecting
:/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io# ls -lah 7d99079fde994c2c54c695ccc47eead5336305e083afdaf585120686d5b498d5/log.json
-rw-r--r-- 1 root root 35M Oct 26 08:00 7d99079fde994c2c54c695ccc47eead5336305e083afdaf585120686d5b498d5/log.json
Screenshot 2021-10-26 at 1 31 59 PM
E1022 18:03:02.434927 13149 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec "0e5a8e1539902911f76888152b289f0976dc3d6245f9660952ecE1022 18:02:42.467601 13149 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec "8dd086b3d8cddc215260a5778bb43ce2e00784981911930804f6
4ccdb846f16a": OCI runtime exec failed: exec failed: write /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/7d99079fde994c2c54c695ccc47eead5336305e083afdaf585120686d5b498d5/.8dd086b3d8cddc215260a5778bb43ce2e00784981911930804f64cc

Screenshot 2021-10-26 at 1 37 02 PM

Expected behavior:
limit the size of log in path: :/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io

Actual behavior:
log file in below path /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io keeps growing and make containerd and k3s unstable
Additional context / logs:
Attached screencap of logs

Environmental Info:
K3s Version:

kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1+k3s1", GitCommit:"75dba57f9b1de3ec0403b148c52c348e1dee2a5e", GitTreeState:"clean", BuildDate:"2021-05-21T16:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1+k3s1", GitCommit:"75dba57f9b1de3ec0403b148c52c348e1dee2a5e", GitTreeState:"clean", BuildDate:"2021-05-21T16:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Node(s) CPU architecture, OS, and Version:

uname -a
Linux ztna-gateway 4.15.0-132-generic #136-Ubuntu SMP Tue Jan 12 14:58:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
8166b995-e5cd-4358-a7c3-3db4a2796a30 Ready control-plane,etcd,master 66d v1.21.1+k3s1 10.224.71.116 Ubuntu 18.04.5 LTS 4.15.0-132-generic containerd://1.4.4-k3s2
e3de9562-ff8a-4ccc-91ea-2183895edf10 Ready control-plane,etcd,master 66d v1.21.1+k3s1 10.224.71.115 Ubuntu 18.04.5 LTS 4.15.0-132-generic containerd://1.4.4-k3s2
f702c5aa-c5ae-4322-8536-ff2f012cfeef Ready control-plane,etcd,master 66d v1.21.1+k3s1 10.224.71.117 Ubuntu 18.04.5 LTS 4.15.0-132-generic containerd://1.4.4-k3s2

Reference ticket in k3s
k3s-io/k3s#4327

kolyshkin · 2021-10-28T16:18:28Z

Does not looks like runc bug; runc do not write any logs per se.

prashilgupta · 2021-10-28T16:49:47Z

@kolyshkin below is the sample log
and its coming from

runc/libcontainer/nsenter/nsexec.c

Line 1056 in 4d17654

write_log(DEBUG, "forward stage-1 (%d) and stage-2 (%d) pids to runc",

Can you check again
{"level":"error","msg":"failed to decode "{\"level\":\"debug\", \"msg\": \"nsexec-0[24983]: forward stage-1 (24984) and stage-2 (24985) pids to runc{\"level\":\"debug\", \"msg\": \"nsexec-1[24984]: signal completion to stage-0\"}" to json: invalid character 'l' after object key:value pair","time":"2021-10-26T06:11:37Z"}

cwarje · 2021-12-02T18:18:34Z

@kolyshkin would you consider reopening this issue, given @prashilgupta's last comment? I'm having the exact same problem.

brandond · 2021-12-09T18:28:26Z

Yeah, I think this needs to be reopened. I think the root cause is that (unless I'm misreading how the nsenter call is set up) f1b703f changed the log level to DEBUG by default:

runc/libcontainer/nsenter/nsexec.c

Line 113 in 4d17654

static int loglevel = DEBUG;

runc expects all the messages output by nsenter to be parsable as JSON, but the debug messages are not. This causes runc to log a bunch of errors when it fails to parse the messages, which are output every time nsenter is used when debugging is enabled.

cyphar · 2021-12-10T00:08:59Z

@brandond I don't think your explanation is correct:

While we set DEBUG as the default, this is changed to the log-level specified by the user almost immediately in nsexec.c (this is done because there are debug messages output before we get all of the environment variables -- you wouldn't be able to every get them using --debug otherwise). The logs are additionally filtered by the Go code before outputting as well.
All log messages in nsexec.c are JSON -- write_log wraps them all in JSON and JSON-escapes the strings too.

I suspect that what's happening is that that the JSON-encoded log messages are being interleaved somehow (which would explain why the log messages have interwoven JSON in the error message) -- but this really shouldn't happen (the log messages are all well under <4096 bytes which means they're basically guaranteed by Linux to be written in one block with no interleaving).

I'll reopen this since it does appear to be an issue within runc, though I can't see how it could happen -- maybe the system has a reduced global pipe buffer size (though I just did some testing and even setting /proc/sys/fs/pipe-max-size to 1 still doesn't cause the logging issue).

brandond · 2021-12-10T08:23:01Z

While we set DEBUG as the default, this is changed to the log-level specified by the user almost immediately in nsexec.c

Yeah, I did a little more tracing through the code after that comment and it looked like the log level environment variable was always set. Debug seems like an odd choice for a default though.

write_log wraps them all in JSON and JSON-escapes the strings too.

I think that might be the source of the problem. Just based on the messages included in the screenshots above, it looks like the asprintf call at

runc/libcontainer/nsenter/nsexec.c

Lines 180 to 181 in 4d17654

    
           ret = asprintf(&json, "{\"level\":\"%s\", \"msg\": \"%s[%d]: %s\"}\n", 
        
           	       level_str[level], stage, getpid(), message);

is truncating the json payload just after the message. I'm not sure if it's tripping on null termination or something, but the end result is that multiple unterminated JSON log payloads end up on the same line. I'm not sure if that's more or less likely than the interleaving of log lines that you described?

I'm new to this code base, so apologies if none of this makes any sense.

cyphar · 2021-12-10T13:31:07Z

Debug seems like an odd choice for a default though.

I was trying to explain this above, but maybe it was a bit unclear. The reason this is necessary is that you need to have debug be the default in order for runc --debug to work for log messages before the log level is set in nsexec.c from the environment variable. In practice this is like 2-3 log events, but without debug as the default there would be no point logging debug messages before the log level is set (because they would always be discarded even if you ran runc with runc --debug).

runc's actual default log level is warn (or info?), not debug.

... is truncating the json payload just after the message. I'm not sure if it's tripping on null termination or something, but the end result is that multiple unterminated JSON log payloads end up on the same line. I'm not sure if that's more or less likely than the interleaving of log lines that you described?

But I'm not sure how that can happen -- %s won't print a naked null byte. However the two log messages are coming from different processes (nsexec-0 and nsexec-1) which seems to suggest that it's a race because there are also JSON decoding failures on the trailing "} (indicating that those bytes are getting printed, just later). But I'm not sure how it could be interleaved either. Very strange.

kolyshkin · 2021-12-10T19:09:38Z

This looks very much like a race that we fixed in #3130 (the fix made its way to runc 1.0.2). Since I do not see runc version anywhere in this bug report, I assume you're using a version prior to 1.0.2. Please upgrade to 1.0.2 or 1.0.3.

kolyshkin · 2021-12-10T19:12:07Z

IOW this is a duplicate of #3119

brandond · 2021-12-10T19:13:12Z

We're updating to v1.0.3 for our December release cycle - we'll ask users to test once that's available!

prashil-g · 2021-12-14T06:43:46Z

@brandond What version of k3s will have this fix?

brandond · 2021-12-14T18:31:02Z

the ones that come out later this week.

kolyshkin closed this as completed Oct 28, 2021

cyphar reopened this Dec 10, 2021

kolyshkin closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource temporarily unavailable: Due to logs filling up /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io #3253

Resource temporarily unavailable: Due to logs filling up /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io #3253

prashilgupta commented Oct 28, 2021

kolyshkin commented Oct 28, 2021

prashilgupta commented Oct 28, 2021

cwarje commented Dec 2, 2021

brandond commented Dec 9, 2021 •

edited

Loading

cyphar commented Dec 10, 2021

brandond commented Dec 10, 2021 •

edited

Loading

cyphar commented Dec 10, 2021

kolyshkin commented Dec 10, 2021

kolyshkin commented Dec 10, 2021

brandond commented Dec 10, 2021 •

edited

Loading

prashil-g commented Dec 14, 2021

brandond commented Dec 14, 2021

Resource temporarily unavailable: Due to logs filling up /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io #3253

Resource temporarily unavailable: Due to logs filling up /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io #3253

Comments

prashilgupta commented Oct 28, 2021

kolyshkin commented Oct 28, 2021

prashilgupta commented Oct 28, 2021

cwarje commented Dec 2, 2021

brandond commented Dec 9, 2021 • edited Loading

cyphar commented Dec 10, 2021

brandond commented Dec 10, 2021 • edited Loading

cyphar commented Dec 10, 2021

kolyshkin commented Dec 10, 2021

kolyshkin commented Dec 10, 2021

brandond commented Dec 10, 2021 • edited Loading

prashil-g commented Dec 14, 2021

brandond commented Dec 14, 2021

brandond commented Dec 9, 2021 •

edited

Loading

brandond commented Dec 10, 2021 •

edited

Loading

brandond commented Dec 10, 2021 •

edited

Loading