raw_exec: make raw exec driver work with cgroups v2 #12419

shoenig · 2022-03-31T15:37:09Z

This PR adds support for the raw_exec driver on systems with only cgroups v2.

The raw exec driver is may use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2. Because the driver manages cgroups
directly and not via containerd, we must account for the v2 changes ourselves.

There is a bit of refactoring in this PR, but the fundamental design remains
the same.

Unlike the v1 implementation, the v2 implementation respects the client cgroup_parent
configuration option. This value is plumbed through to the driver via a task
environment variable, since it (and allocID / taskName) are not available otherwise
from the task config protobuf. A new environment variable NOMAD_PARENT_CGROUP
is now part of the task runtime when supported.

A few irrelevant tests have been stubbed out with SkipSlow because they are very flaky
on GHA, which skips slow tests. Circle is still configured to run slow tests.

Closes #12351 #12348
Partial #10251 (leaving open for the memory stats)

Reviewers:
Neither GHA or Circle are yet to offer machine types using only cgroupsv2. Instead of running everything manually I forked Nomad into shoenig/nomad and setup a temporary self-hosted GHA runner running ubuntu-21.10. The run based on c8d2402 (this PR) is HERE. The show-system job indicates the cgroupv2 mount point on /sys/fs/cgroup, which activates Nomad's cgroup v2 code paths and tests. Which you can't the output of unless you're me 😞

shoenig@opti3010:~$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 21.10"
NAME="Ubuntu"
VERSION_ID="21.10"
VERSION="21.10 (Impish Indri)"
VERSION_CODENAME=impish
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=impish
shoenig@opti3010:~$ mount -l | grep cgroup 
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

shoenig · 2022-03-31T19:06:42Z

.github/workflows/test-core.yaml

@@ -78,7 +78,8 @@ jobs:
        run: |
          make bootstrap
          make generate-all
-          make test-nomad-module
+          sudo sed -i 's!Defaults!#Defaults!g' /etc/sudoers
+          sudo -E env "PATH=$PATH" make test-nomad-module


TestFS_Log checks the alloc becomes complete, which it doesn't because it gets blocked by the missing cgroup for cpuset unless running as root.

This PR adds support for the raw_exec driver on systems with only cgroups v2. The raw exec driver is able to use cgroups to manage processes. This happens only on Linux, when exec_driver is enabled, and the no_cgroups option is not set. The driver uses the freezer controller to freeze processes of a task, issue a sigkill, then unfreeze. Previously the implementation assumed cgroups v1, and now it also supports cgroups v2. There is a bit of refactoring in this PR, but the fundamental design remains the same. Closes #12351 #12348

tgross

LGTM @shoenig. I've left some comments but they're mostly more like questions. 😀

tgross · 2022-04-05T17:14:51Z

nomad/structs/structs.go

+	// CgroupParent for this node (linux only)
+	CgroupParent string


Should this show up in the api.Node struct as well?

good catch! fixed

tgross · 2022-04-05T17:16:49Z

nomad/core_sched_test.go

@@ -2244,7 +2244,7 @@ func TestCoreScheduler_CSIPluginGC(t *testing.T) {
 }

 func TestCoreScheduler_CSIVolumeClaimGC(t *testing.T) {
-	ci.Parallel(t)
+	ci.SkipSlow(t, "flaky on GHA; #12358")


This isn't a streaming test as suggested in #12358. I think it's fixed with #12439 which merged yesterday. Have you seen otherwise?

Ah yup this is out of date, removed!

tgross · 2022-04-05T17:34:45Z

client/lib/resources/containment_default.go

+type containment struct {
+	// non-linux executors currently do not create resources to be cleaned up
+}


I'm having a little trouble following what we're doing with the Containment interface. We don't implement it for the non-Linux case (here), so isn't there only ever 1 implementation?

Good point, the only implementation applies to Linux and the creation / usage is all gated in build-tagged files. The implementation in containment_default.go isn't even complete much less used; let's just get rid of it.

tgross · 2022-04-05T17:48:15Z

client/lib/cgutil/group_killer.go

+	// kill the processes in cgroup
+	for _, pid := range pids {


Do we not expect that all the PIDs are going to be in the same process group? That is, could we kill via -pid here the same way we do for non-Linux use cases, or is this so we can avoid killing the executor even though we moved it into a different cgroup?

This is honestly just copying the original implementation, but I believe this defends against sub-sub-sub processes that get daemonized. So like

Executor -> A -> B -> C

If B dies C gets owed by PID 1, not A and is no longer part of the process group ... right? 🤔

Children always get the parent's process group when forked, so the chain would be preserved even in that case (I don't think it gets changed after the fact on wait). But this implementation works, so let's keep it.

This is pretty in the weeds but I guess it does protect against callers of setsid or setpgid

That's a good point. 👍

tgross · 2022-04-05T18:05:55Z

drivers/rawexec/driver_unix_test.go

+		// todo(shoenig) - Setting AllocID causes test to fail - with or without
+		//  cgroups, and with or without chroot. It has to do with MkAllocDir
+		//  creating the directory structure, but the actual root cause is still
+		//  TBD. The symptom is that any command you try to execute will result
+		//  in "permission denied" coming from os/exec. This test is imperfect,
+		//  the actual feature of running commands as another user works fine.
+		// AllocID: uuid.Generate()


If we leave AllocID out then the AllocDir.SharedDir and AllocDir.AllocDir then both end up directly under the tempdirthen ends up directly under the tempdir (which is also task.AllocDir), instead of under the allocation ID one level down.

That is instead of /tmp/nomad_driver_harness-/alloc/:alloc_id and /tmp/nomad_driver_harness-/alloc we get /tmp/nomad_driver_harness-/alloc and /tmp/nomad_driver_harness- and maybe the latter has the wrong permissions now?

tgross · 2022-04-05T18:12:33Z

drivers/shared/executor/executor_universal_linux.go

+		// use the task environment variables for determining the cgroup path -
+		// not ideal but plumbing the values directly requires grpc protobuf changes
+		parent := lookup(e.commandCfg.Env, taskenv.CgroupParent)
+		allocID := lookup(e.commandCfg.Env, taskenv.AllocID)
+		task := lookup(e.commandCfg.Env, taskenv.TaskName)


And those changes would be platform-specific, which would be sad to have to plumb into the protobufs if we can avoid it. It's probably worth considering whether we want to have a "platform variables" field or something in the future so we're not trying to smuggle things in thru env vars that we might not necessarily want to expose to the workload.

"platform variables" field

That's a clever idea I'm definitely going to forget, so I created #12468

shoenig · 2022-04-05T21:41:52Z

Final CI run on V2 https://github.com/shoenig/nomad/actions/runs/2098586202

github-actions · 2022-10-22T02:43:28Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig force-pushed the exec-cleanup branch from cf909fa to c8d2402 Compare March 31, 2022 15:43

vercel bot deployed to Preview – nomad-storybook-and-ui March 31, 2022 15:43 View deployment

vercel bot deployed to Preview – nomad March 31, 2022 15:43 View deployment

shoenig commented Mar 31, 2022

View reviewed changes

shoenig marked this pull request as ready for review March 31, 2022 19:11

shoenig requested review from tgross and schmichael and removed request for tgross March 31, 2022 19:11

shoenig force-pushed the exec-cleanup branch from c8d2402 to 06a3872 Compare April 4, 2022 20:57

vercel bot deployed to Preview – nomad April 4, 2022 20:57 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui April 4, 2022 20:57 View deployment

shoenig force-pushed the exec-cleanup branch from 06a3872 to be7ec8d Compare April 4, 2022 21:11

vercel bot deployed to Preview – nomad April 4, 2022 21:11 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui April 4, 2022 21:11 View deployment

lgfa29 added this to the 1.3.0 milestone Apr 5, 2022

tgross approved these changes Apr 5, 2022

View reviewed changes

shoenig mentioned this pull request Apr 5, 2022

Support for platform variables field in task drivers #12468

Open

raw_exec: fixup review comments

1d2e2c0

vercel bot deployed to Preview – nomad-storybook-and-ui April 5, 2022 20:21 View deployment

vercel bot temporarily deployed to Preview – nomad April 5, 2022 20:21 Inactive

shoenig merged commit 1334712 into main Apr 5, 2022

shoenig deleted the exec-cleanup branch April 5, 2022 21:42

This was referenced Apr 5, 2022

Implement raw_exec cgroups v2 support #12348

Closed

Memory stats and freezer management with cgroupv2 #10251

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw_exec: make raw exec driver work with cgroups v2 #12419

raw_exec: make raw exec driver work with cgroups v2 #12419

shoenig commented Mar 31, 2022 •

edited

Loading

shoenig Mar 31, 2022 •

edited

Loading

tgross left a comment

tgross Apr 5, 2022

shoenig Apr 5, 2022

tgross Apr 5, 2022

shoenig Apr 5, 2022

tgross Apr 5, 2022

shoenig Apr 5, 2022

tgross Apr 5, 2022

shoenig Apr 5, 2022

tgross Apr 5, 2022 •

edited

Loading

shoenig Apr 5, 2022

tgross Apr 5, 2022

tgross Apr 5, 2022

tgross Apr 5, 2022

shoenig Apr 5, 2022

shoenig commented Apr 5, 2022

github-actions bot commented Oct 22, 2022

		// CgroupParent for this node (linux only)
		CgroupParent string

raw_exec: make raw exec driver work with cgroups v2 #12419

raw_exec: make raw exec driver work with cgroups v2 #12419

Conversation

shoenig commented Mar 31, 2022 • edited Loading

shoenig Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross Apr 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig commented Apr 5, 2022

github-actions bot commented Oct 22, 2022

shoenig commented Mar 31, 2022 •

edited

Loading

shoenig Mar 31, 2022 •

edited

Loading

tgross Apr 5, 2022 •

edited

Loading