Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handle hitting the memory limit #2812

Merged
merged 10 commits into from
Feb 26, 2021

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Feb 19, 2021

This improves runc error reporting in cases where memory limit is set too low.

NOTE that while this is clearly an enhancement I want this in rc94 because

  • it does not change runc behavior other than error text;
  • it helps a lot with some issues reported by customers.

1. Improve EBUSY handling on setting cgroup memory limit

On cgroup v1 + fs driver, setting memory.limit_in_bytes results in EBUSY. This is reported as

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "1000": write /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-1011.scope/xe3/memory.limit_in_bytes: device or resource busy

To decipher the error a user needs to know that EBUSY is returned by the kernel when the limit being set is too low. In addition, it would be handy to know what the current usage is.

Handle EBUSY and report this:

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: unable to set memory limit to 1000 (current usage: 1204224, peak usage: 2936832)

Related to #2736

2. Check for OOM kill on container start

When a container fails to start due to memory limit set too low, and container init being OOM-killed,
error messages returned by runc are semi-random and rather cryptic. Here are a few examples
(truncated to remove the common prefix for clarity):

  • process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe (cgroup v1 + systemd driver)
  • process_linux.go:352: getting the final child's pid from pipe caused: EOF (cgroup v1 + systemd driver)
  • process_linux.go:495: container init caused: read init-p: connection reset by peer (cgroup v2)
  • process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe (cgroup v2)

On container start error path, add a check if OOM kill has happened, and report it instead of the original (cryptic) error:

ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)

(or, if --debug is set, also provide the original error):

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:343: container init was OOM-killed (memory limit too low?) caused: process_linux.go:520: container init caused: process_linux.go:509: writing syncT 'resume' caused: write init-p: broken pipe

3. Check for OOM kill on container exec

Same as above, with some nuances:

  1. The container is already running and OOM kill counter might not be
    zero. This is why we have to read the counter before exec and after
    it failed.

  2. An unrelated OOM kill event might occur in parallel with our exec
    (and I see no way to find out which process was killed, except to
    parse kernel logs which seems excessive and not very reliable).
    This is why we report possible OOM kill.

The error message changed from

ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer

to

ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer

4. Some minor improvements and refactoring along the way.

Please see individual commits for details.

TODO

@kolyshkin kolyshkin force-pushed the memory-tight branch 3 times, most recently from 7e961df to 560b22e Compare February 19, 2021 03:49
@kolyshkin
Copy link
Contributor Author

Before:

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "1000": write /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-1011.scope/xe3/memory.limit_in_bytes: device or resource busy

After:

ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: unable to set memory limit to 1000 (current usage: 1204224, peak usage: 2936832)

This at least gives some info about what's wrong as well as a hint about the lowest possible limit.

This also sheds some light about what is going on in #2736 -- apparently runc init has some peak memory use upon the start, this is why retrying on EBUSY helps @wzshiming.

Yet I am not convinced this should be handled by a retry.

@kolyshkin kolyshkin force-pushed the memory-tight branch 5 times, most recently from a416213 to 7a3b640 Compare February 19, 2021 19:51
@kolyshkin
Copy link
Contributor Author

Failure on centos 7, seems like a random one (some rare race in old systemd?):

not ok 68 ps -e -x
(in test file tests/integration/ps.bats, line 53)
  `[ "$status" -eq 0 ]' failed
 runc spec (status=0):
 runc run -d --console-socket /tmp/console.sock test_busybox (status=1):
 time="2021-02-19T11:44:35Z" level=error msg="container_linux.go:367: starting container process
    caused: process_linux.go:502: container init
    caused: process_linux.go:465: setting cgroup config for procHooks process
    caused: Unit runc-test_busybox.scope is not loaded."

@kolyshkin
Copy link
Contributor Author

@wzshiming can you test this PR on your setup to reproduce #2736 and copy-paste the runc errors here? It will help to understand what is going on in your case.

@kolyshkin

This comment has been minimized.

@kolyshkin kolyshkin changed the title [WIP] better handle the attempt to set memory limit too low better handle the attempt to set memory limit too low Feb 22, 2021
@kolyshkin kolyshkin marked this pull request as ready for review February 22, 2021 20:37
@kolyshkin
Copy link
Contributor Author

No longer a wip. runc init memory usage analysis were not done, but it won't change anything in this code.

@kolyshkin kolyshkin marked this pull request as draft February 22, 2021 22:57
@kolyshkin kolyshkin changed the title better handle the attempt to set memory limit too low Better handle hitting the memory limit Feb 23, 2021
@kolyshkin kolyshkin marked this pull request as ready for review February 23, 2021 23:54
@kolyshkin
Copy link
Contributor Author

Moved to draft to add OOM handling in runc exec. Added, tested, ready for the prime time!

Add integration test coverage for code initially added by commit
d8b8f76 ("Fix problem when update memory and swap memory",
2016-04-05).

This is in addition to existing unit test,
TestMemorySetSwapSmallerThanMemory.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, we read and parse 5 different files while we only need 1.

Use GetCgroupParamUint() directly to get current limit.

While at it, remove the workaround previously needed for the unit test,
and make it a bit more verbose.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Factor out setMemory and setSwap
2. Pass cgroup.Resources (rather than cgroup) to setMemoryAndSwap().
3. Merge the duplicated "set memory, set swap" case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
EBUSY when trying to set memory limit may mean the new limit is too low
(lower than the current usage, and the kernel can't do anything).
Provide a more specific error for such case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. This is the only function in the package with Get prefix
   that does not read a file (but parses a string). Rename
   accordingly, and convert the callers.

	GetCgroupParamKeyValue -> ParseKeyValue

2. Use strings.Split rather than strings.Fields. Split by a space
   is 2x faster, plus we can limit the splitting. The downside is
   we have to strip a newline in one of the callers.

3. Improve the doc and the code flow.

4. Fix a test case with invalid data (spaces at BOL).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is to benefit from openat2() implementation, on kernels
that support it. Theoretically this also improves security.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Generalize the libct/getValueFromCgroup() as fscommon.GetValueByKey(),
and document it.

No changes other than using fscommon.ParseUint to convert the value.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This makes the code simpler and more future-proof, in case
any more values will appear in hugetlb.*.events.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some cases, container init fails to start because it is killed by
the kernel OOM killer. The errors returned by runc in such cases are
semi-random and rather cryptic. Below are a few examples.

On cgroup v1 + systemd cgroup driver:

> process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe

> process_linux.go:352: getting the final child's pid from pipe caused: EOF

On cgroup v2:

> process_linux.go:495: container init caused: read init-p: connection reset by peer

> process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe

This commits adds the OOM method to cgroup managers, which tells whether
the container was OOM-killed. In case that has happened, the original error
is discarded (unless --debug is set), and the new OOM error is reported
instead:

> ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)

Also, fix the rootless test cases that are failing because they expect
an error in the first line, and we have an additional warning now:

> unable to get oom kill count" error="no directory specified for memory.oom_control

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
An exec may fail due to memory shortage (cgroup memory limits being too
tight), and an error message provided in this case is clueless:

> $ sudo ../runc/runc exec xx56 top
> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer

Same as the previous commit for run/start, check the OOM kill counter
and report an OOM kill.

The differences from run are

1. The container is already running and OOM kill counter might not be
   zero.  This is why we have to read the counter before exec and after
   it failed.

2. An unrelated OOM kill event might occur in parallel with our exec
   (and I see no way to find out which process was killed, except to
   parse kernel logs which seems excessive and not very reliable).
   This is why we report _possible_ OOM kill.

With this commit, the error message looks like:

> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants