Better handle hitting the memory limit #2812

kolyshkin · 2021-02-19T02:56:44Z

This improves runc error reporting in cases where memory limit is set too low.

NOTE that while this is clearly an enhancement I want this in rc94 because

it does not change runc behavior other than error text;
it helps a lot with some issues reported by customers.

1. Improve EBUSY handling on setting cgroup memory limit

On cgroup v1 + fs driver, setting memory.limit_in_bytes results in EBUSY. This is reported as

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "1000": write /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-1011.scope/xe3/memory.limit_in_bytes: device or resource busy

To decipher the error a user needs to know that EBUSY is returned by the kernel when the limit being set is too low. In addition, it would be handy to know what the current usage is.

Handle EBUSY and report this:

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: unable to set memory limit to 1000 (current usage: 1204224, peak usage: 2936832)

Related to #2736

2. Check for OOM kill on container start

When a container fails to start due to memory limit set too low, and container init being OOM-killed,
error messages returned by runc are semi-random and rather cryptic. Here are a few examples
(truncated to remove the common prefix for clarity):

process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe (cgroup v1 + systemd driver)
process_linux.go:352: getting the final child's pid from pipe caused: EOF (cgroup v1 + systemd driver)
process_linux.go:495: container init caused: read init-p: connection reset by peer (cgroup v2)
process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe (cgroup v2)

On container start error path, add a check if OOM kill has happened, and report it instead of the original (cryptic) error:

ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)

(or, if --debug is set, also provide the original error):

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:343: container init was OOM-killed (memory limit too low?) caused: process_linux.go:520: container init caused: process_linux.go:509: writing syncT 'resume' caused: write init-p: broken pipe

3. Check for OOM kill on container exec

Same as above, with some nuances:

The container is already running and OOM kill counter might not be
zero. This is why we have to read the counter before exec and after
it failed.
An unrelated OOM kill event might occur in parallel with our exec
(and I see no way to find out which process was killed, except to
parse kernel logs which seems excessive and not very reliable).
This is why we report possible OOM kill.

The error message changed from

ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer

to

ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer

4. Some minor improvements and refactoring along the way.

Please see individual commits for details.

TODO

look into why runc init memory current and max differs that much
~~look into why systemd v1 driver does not fail to set the limits like fs driver does~~ see Fix discrepancy between systemd and fs cgroup managers #2813 / libct/cgroups/systemd: don't set limits in Apply #2814.

kolyshkin · 2021-02-19T03:53:08Z

Before:

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "1000": write /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-1011.scope/xe3/memory.limit_in_bytes: device or resource busy

After:

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: unable to set memory limit to 1000 (current usage: 1204224, peak usage: 2936832)

This at least gives some info about what's wrong as well as a hint about the lowest possible limit.

This also sheds some light about what is going on in #2736 -- apparently runc init has some peak memory use upon the start, this is why retrying on EBUSY helps @wzshiming.

Yet I am not convinced this should be handled by a retry.

kolyshkin · 2021-02-19T20:06:29Z

Failure on centos 7, seems like a random one (some rare race in old systemd?):

not ok 68 ps -e -x
(in test file tests/integration/ps.bats, line 53)
  `[ "$status" -eq 0 ]' failed
 runc spec (status=0):
 runc run -d --console-socket /tmp/console.sock test_busybox (status=1):
 time="2021-02-19T11:44:35Z" level=error msg="container_linux.go:367: starting container process
    caused: process_linux.go:502: container init
    caused: process_linux.go:465: setting cgroup config for procHooks process
    caused: Unit runc-test_busybox.scope is not loaded."

kolyshkin · 2021-02-19T20:07:33Z

@wzshiming can you test this PR on your setup to reproduce #2736 and copy-paste the runc errors here? It will help to understand what is going on in your case.

kolyshkin · 2021-02-22T20:47:26Z

No longer a wip. runc init memory usage analysis were not done, but it won't change anything in this code.

kolyshkin · 2021-02-23T23:54:57Z

Moved to draft to add OOM handling in runc exec. Added, tested, ready for the prime time!

libcontainer/cgroups/cgroups.go

Add integration test coverage for code initially added by commit d8b8f76 ("Fix problem when update memory and swap memory", 2016-04-05). This is in addition to existing unit test, TestMemorySetSwapSmallerThanMemory. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Currently, we read and parse 5 different files while we only need 1. Use GetCgroupParamUint() directly to get current limit. While at it, remove the workaround previously needed for the unit test, and make it a bit more verbose. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

1. Factor out setMemory and setSwap 2. Pass cgroup.Resources (rather than cgroup) to setMemoryAndSwap(). 3. Merge the duplicated "set memory, set swap" case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

EBUSY when trying to set memory limit may mean the new limit is too low (lower than the current usage, and the kernel can't do anything). Provide a more specific error for such case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

1. This is the only function in the package with Get prefix that does not read a file (but parses a string). Rename accordingly, and convert the callers. GetCgroupParamKeyValue -> ParseKeyValue 2. Use strings.Split rather than strings.Fields. Split by a space is 2x faster, plus we can limit the splitting. The downside is we have to strip a newline in one of the callers. 3. Improve the doc and the code flow. 4. Fix a test case with invalid data (spaces at BOL). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

This is to benefit from openat2() implementation, on kernels that support it. Theoretically this also improves security. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Generalize the libct/getValueFromCgroup() as fscommon.GetValueByKey(), and document it. No changes other than using fscommon.ParseUint to convert the value. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

This makes the code simpler and more future-proof, in case any more values will appear in hugetlb.*.events. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

In some cases, container init fails to start because it is killed by the kernel OOM killer. The errors returned by runc in such cases are semi-random and rather cryptic. Below are a few examples. On cgroup v1 + systemd cgroup driver: > process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe > process_linux.go:352: getting the final child's pid from pipe caused: EOF On cgroup v2: > process_linux.go:495: container init caused: read init-p: connection reset by peer > process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe This commits adds the OOM method to cgroup managers, which tells whether the container was OOM-killed. In case that has happened, the original error is discarded (unless --debug is set), and the new OOM error is reported instead: > ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?) Also, fix the rootless test cases that are failing because they expect an error in the first line, and we have an additional warning now: > unable to get oom kill count" error="no directory specified for memory.oom_control Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

An exec may fail due to memory shortage (cgroup memory limits being too tight), and an error message provided in this case is clueless: > $ sudo ../runc/runc exec xx56 top > ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer Same as the previous commit for run/start, check the OOM kill counter and report an OOM kill. The differences from run are 1. The container is already running and OOM kill counter might not be zero. This is why we have to read the counter before exec and after it failed. 2. An unrelated OOM kill event might occur in parallel with our exec (and I see no way to find out which process was killed, except to parse kernel logs which seems excessive and not very reliable). This is why we report _possible_ OOM kill. With this commit, the error message looks like: > ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

libcontainer/cgroups/fscommon/utils.go

kolyshkin force-pushed the memory-tight branch from fb6b165 to eecb4fe Compare February 19, 2021 03:19

kolyshkin mentioned this pull request Feb 19, 2021

Retry when write file busy #2810

Closed

kolyshkin force-pushed the memory-tight branch 3 times, most recently from 7e961df to 560b22e Compare February 19, 2021 03:49

kolyshkin mentioned this pull request Feb 19, 2021

Creating a pod has a probability of failure #2736

Closed

kolyshkin force-pushed the memory-tight branch 5 times, most recently from a416213 to 7a3b640 Compare February 19, 2021 19:51

This comment has been minimized.

Sign in to view

kolyshkin mentioned this pull request Feb 20, 2021

Fix discrepancy between systemd and fs cgroup managers #2813

Closed

wzshiming mentioned this pull request Feb 22, 2021

Incorrect cgroup values when pod resources set to just numeric 1 kubernetes/kubernetes#99296

Closed

kolyshkin changed the title ~~[WIP] better handle the attempt to set memory limit too low~~ better handle the attempt to set memory limit too low Feb 22, 2021

kolyshkin marked this pull request as ready for review February 22, 2021 20:37

kolyshkin marked this pull request as draft February 22, 2021 22:57

kolyshkin force-pushed the memory-tight branch from 7a3b640 to 4cf7a53 Compare February 23, 2021 01:22

kolyshkin changed the title ~~better handle the attempt to set memory limit too low~~ Better handle hitting the memory limit Feb 23, 2021

kolyshkin force-pushed the memory-tight branch from 4cf7a53 to 80b81cd Compare February 23, 2021 01:39

kolyshkin marked this pull request as ready for review February 23, 2021 23:54

mrunalp reviewed Feb 24, 2021

View reviewed changes

libcontainer/cgroups/cgroups.go Outdated Show resolved Hide resolved

kolyshkin added 3 commits February 23, 2021 16:11

libct/cg/fs: setMemoryAndSwap: refactor

27fd3fc

1. Factor out setMemory and setSwap 2. Pass cgroup.Resources (rather than cgroup) to setMemoryAndSwap(). 3. Merge the duplicated "set memory, set swap" case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin added 7 commits February 23, 2021 16:11

libct/cg/fs/memory: handle EBUSY

1880d2f

EBUSY when trying to set memory limit may mean the new limit is too low (lower than the current usage, and the kernel can't do anything). Provide a more specific error for such case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

libcontainer/notify_linux_v2: use fscommon.ReadFile

c54c3f8

This is to benefit from openat2() implementation, on kernels that support it. Theoretically this also improves security. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

libct/cg/fscommon: add GetValueByKey

9fa65f6

Generalize the libct/getValueFromCgroup() as fscommon.GetValueByKey(), and document it. No changes other than using fscommon.ParseUint to convert the value. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

libct/cg/fs2/hugetlb: use fscommon.GetValueByKey

7e137b9

This makes the code simpler and more future-proof, in case any more values will appear in hugetlb.*.events. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the memory-tight branch from 80b81cd to 38b2dd3 Compare February 24, 2021 00:16

kolyshkin added the enhancement label Feb 24, 2021

kolyshkin added this to the 1.0.0-rc94 milestone Feb 24, 2021

mrunalp approved these changes Feb 24, 2021

View reviewed changes

AkihiroSuda reviewed Feb 25, 2021

View reviewed changes

libcontainer/cgroups/fscommon/utils.go Show resolved Hide resolved

kolyshkin added the impact/changelog label Feb 26, 2021

AkihiroSuda approved these changes Feb 26, 2021

View reviewed changes

AkihiroSuda merged commit d56a9c6 into opencontainers:master Feb 26, 2021

smoke mentioned this pull request Mar 1, 2021

6.6.0: random build errors with "starting container process caused: read init-p: connection reset by peer" concourse/concourse#6180

Open

kolyshkin mentioned this pull request Mar 12, 2021

umount all mount points in runc root dir #2843

Closed

Iceber mentioned this pull request Mar 19, 2021

debian10 containerd1.4.4 Unable to adapt k8s 1.20.4 kubernetes/kubernetes#100389

Closed

kolyshkin mentioned this pull request Apr 15, 2021

libct/cg: don't return OOMKillCount error when rootless #2910

Merged

kolyshkin mentioned this pull request May 3, 2021

rc94 discussion (mid-April 2021?) #2790

Closed

safronovD mentioned this pull request Jun 8, 2021

[ISSUE-417] Fix CSI deployment via Operator on Openshift dell/csi-baremetal-operator#39

Merged

6 tasks

kolyshkin mentioned this pull request Sep 20, 2021

[4.6] Better handle hitting the memory limit projectatomic/runc#55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handle hitting the memory limit #2812

Better handle hitting the memory limit #2812

kolyshkin commented Feb 19, 2021 •

edited

Loading

kolyshkin commented Feb 19, 2021

kolyshkin commented Feb 19, 2021

kolyshkin commented Feb 19, 2021

This comment has been minimized.

kolyshkin commented Feb 22, 2021

kolyshkin commented Feb 23, 2021

Better handle hitting the memory limit #2812

Better handle hitting the memory limit #2812

Conversation

kolyshkin commented Feb 19, 2021 • edited Loading

1. Improve EBUSY handling on setting cgroup memory limit

2. Check for OOM kill on container start

3. Check for OOM kill on container exec

4. Some minor improvements and refactoring along the way.

TODO

kolyshkin commented Feb 19, 2021

kolyshkin commented Feb 19, 2021

kolyshkin commented Feb 19, 2021

This comment has been minimized.

kolyshkin commented Feb 22, 2021

kolyshkin commented Feb 23, 2021

kolyshkin commented Feb 19, 2021 •

edited

Loading