Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task using exec driver fails with "cgroup path must be set" #14797

Closed
soupdiver opened this issue Oct 4, 2022 · 15 comments
Closed

Task using exec driver fails with "cgroup path must be set" #14797

soupdiver opened this issue Oct 4, 2022 · 15 comments

Comments

@soupdiver
Copy link

Nomad version

Nomad v1.3.5 (1359c2580fed080295840fb888e28f0855e42d50)

Operating system and Environment details

Ubuntu 22.04 inside Proxmox CT

Issue

Task configured with exec driver fails with error message: cgroup path must be set

Reproduction steps

Expected Result

Task starts

Actual Result

Task fails

Job file (if appropriate)

job "home-tor" {
    datacenters = ["home"]
    type = "service"
    
    group "tor" {
        count = 1

        service {
            provider = "nomad"
        }

        task "tor" {
            driver = "exec"
            user = "debian-tor"

            config {
                command = "tor"
            }

            resources {
                memory = "1000"
                cpu    = "1000"
            }
        }
    }
}

I run my worker inside a Proxmox CT which is a lxc container. Not sure if the nesting causes trouble.
My worker reports unique.cgroup.mountpoint | /sys/fs/cgroup and v2

I could not find anything via google except the source code with the error:

if command.Resources == nil || command.Resources.LinuxResources == nil || command.Resources.LinuxResources.CpusetCgroupPath == "" {
return errors.New("cgroup path must be set")
}

But I could not really make sense out of it and what goes actually wrong

@soupdiver soupdiver changed the title Task with exec driver fails with "cgroup path must be set" Task using exec driver fails with "cgroup path must be set" Oct 4, 2022
@tgross tgross self-assigned this Oct 4, 2022
@tgross
Copy link
Member

tgross commented Oct 4, 2022

Hi @soupdiver! I'm going to be honest here and tell you that running Nomad clients in a container is life on hard mode, to the point where we've resisted shipping a Nomad Docker image to discourage it. We need to put together some documentation on how to do this (and figure it out ourselves in detail, for that matter).

The error you're getting there looks to be because the taskrunner isn't setting the CpusetCgroupPath on the resources that it passes to the driver. In the cgroups v2 case that's calling CgroupPathFor but I can't seem to find a code path that'll return an empty value without an error. That suggests that the cpuset manager isn't being created at all.

Check the client logs for a log line containing the phrase "disable cpuset management" (ref cpuset_manager_v2.go#L62 and cpuset_manager_v2.go#L69). I'm looking at the docs and I suspect that line is meant to say "disabling cpuset management" but the results don't seem to agree either. I might need to circle up with one of my colleagues who worked on this a bunch, but they're currently on vacation so that might need to wait till next week.

In the meantime, it might help if you can provide the configuration you're using for the LXC container. That might give a hint as to what needs to happen here.

@soupdiver
Copy link
Author

soupdiver commented Oct 4, 2022

What I can find in the logs is

Oct 04 17:40:36 nomad-worker-1 nomad[949]:     2022-10-04T17:40:26.730Z [WARN]  client.cpuset.v2: failed to lookup cpus from parent cgroup; disable cpuset management: error="openat2 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective: no such file or directory"
Oct 04 17:40:36 nomad-worker-1 nomad[949]:     2022-10-04T17:40:26.792Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Oct 04 17:40:36 nomad-worker-1 nomad[949]:     2022-10-04T17:40:26.794Z [WARN]  client.fingerprint_mgr.cpu: failed to detect set of reservable cores: error="openat2 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective: no such file or directory"

On the other hand:

ll /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective
-r--r--r-- 1 root root 0 Oct  4 17:53 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective

I found another issue that claimed this would only happen on a first start on a new system. The issue was fixed/closed and that was all I could find related to my error message.

In the meantime, it might help if you can provide the configuration you're using for the LXC container. That might give a hint as to what needs to happen here.

Proxmox CT, unpriviliged.
Not 100% what are the lxc details Proxmox is configuring.

@tgross
Copy link
Member

tgross commented Oct 4, 2022

On the other hand:

Is that command being run from inside or outside the container?

I found another issue that claimed this would only happen on a first start on a new system. The issue was fixed/closed and that was all I could find related to my error message.

Let's ignore that for now... there was a flurry of similar-sounding issues around cgroups v2 support with a lot of different causes, but your client is telling us where to look.

Proxmox CT, unpriviliged.
Not 100% what are the lxc details Proxmox is configuring.

Maybe here? https://pve.proxmox.com/wiki/Linux_Container#pct_configuration

But if you're running unprivileged you're going to hit additional walls really quickly here. Nomad is running tasks and needs a high level of privilege to do so. See https://www.nomadproject.io/docs/install/production/requirements#linux-capabilities (but also #13669 for more discussion about future possiblities)

@soupdiver
Copy link
Author

I was giving it another try with a priviliged container but get the same error message.
Here are my logs when starting the nomad agent.

==> Nomad agent started! Log data will stream in below:

    2022-10-05T08:12:26.701Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
    2022-10-05T08:12:26.702Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2022-10-05T08:12:26.702Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2022-10-05T08:12:26.702Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2022-10-05T08:12:26.702Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2022-10-05T08:12:26.702Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2022-10-05T08:12:26.702Z [WARN]  client.cpuset.v2: failed to lookup cpus from parent cgroup; disable cpuset management: error="openat2 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective: no such file or directory"
    2022-10-05T08:12:26.702Z [INFO]  client: using state directory: state_dir=/opt/nomad/data/client
    2022-10-05T08:12:26.702Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
    2022-10-05T08:12:26.702Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2022-10-05T08:12:26.764Z [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled:
  error=
  | 3 errors occurred:
  | 	* module bridge not in /proc/modules
  | 	* failed to open /lib/modules/5.15.60-1-pve/modules.builtin: open /lib/modules/5.15.60-1-pve/modules.builtin: no such file or directory
  | 	* failed to open /lib/modules/5.15.60-1-pve/modules.dep: open /lib/modules/5.15.60-1-pve/modules.dep: no such file or directory
  |

    2022-10-05T08:12:26.764Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2022-10-05T08:12:26.766Z [WARN]  client.fingerprint_mgr.cpu: failed to detect set of reservable cores: error="openat2 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective: no such file or directory"
    2022-10-05T08:12:36.773Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2022-10-05T08:12:36.773Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2022-10-05T08:12:36.773Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2022-10-05T08:12:36.804Z [INFO]  client: started client: node_id=ca3bdf13-ed3d-6a33-0124-e32b68d57f03

This time the file is indeed missing:

ll /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective
ls: cannot access '/sys/fs/cgroup/nomad.slice/cpuset.cpus.effective': No such file or directory
ll /sys/fs/cgroup/nomad.slice/
total 0
drwxr-xr-x 2 root root 0 Oct  5 08:04 ./
drwxr-xr-x 7 root root 0 Oct  5 08:04 ../
-r--r--r-- 1 root root 0 Oct  5 08:15 cgroup.controllers
-r--r--r-- 1 root root 0 Oct  5 08:15 cgroup.events
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.freeze
--w------- 1 root root 0 Oct  5 08:15 cgroup.kill
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.max.depth
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.procs
-r--r--r-- 1 root root 0 Oct  5 08:15 cgroup.stat
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.threads
-rw-r--r-- 1 root root 0 Oct  5 08:15 cgroup.type
-rw-r--r-- 1 root root 0 Oct  5 08:15 cpu.pressure
-r--r--r-- 1 root root 0 Oct  5 08:15 cpu.stat
-rw-r--r-- 1 root root 0 Oct  5 08:15 io.pressure
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.current
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.events
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.events.local
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.high
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.low
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.max
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.min
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.numa_stat
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.oom.group
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.pressure
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.stat
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.swap.current
-r--r--r-- 1 root root 0 Oct  5 08:15 memory.swap.events
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.swap.high
-rw-r--r-- 1 root root 0 Oct  5 08:15 memory.swap.max
-r--r--r-- 1 root root 0 Oct  5 08:15 pids.current
-r--r--r-- 1 root root 0 Oct  5 08:15 pids.events
-rw-r--r-- 1 root root 0 Oct  5 08:15 pids.max

For completness my Proxmox container conifig:

arch: amd64
cores: 4
hostname: nomad-worker-2
memory: 2048
net0: name=eth0,bridge=vmbr0,hwaddr=02:A0:15:50:73:A1,ip=dhcp,type=veth
ostype: ubuntu
rootfs: b-hdd-disks:subvol-108-disk-0,size=16G
swap: 1024

Googling for the error message I found the following 2 issues (and not much otherwise)
#14229
#14494

@tgross
Copy link
Member

tgross commented Oct 5, 2022

Googling for the error message I found the following 2 issues (and not much otherwise)

@soupdiver other issues aren't likely to be relevant here because you're running in an unsupported / undocumented way in the first place (you might note from your logs that bridge networking is going to be broken too). If you were to run the agent outside the container and got the same thing, that might be interesting but not as it is.

This time the file is indeed missing:

The missing file is from the cgroups v2 cpuset controller and represents the set of CPUs allowed to be used by tasks within this group. I'm assuming this ls command is being run inside the container (you didn't say). You'll need to look into whether Proxmox is setting that value such that the child container can't use the cpuset. There may be a configuration value you need to change, but I'm not familiar enough with Proxmox to tell you what that is.

There's probably an argument to be made here that the client fingerprint should fail open and just not let you schedule workloads with cores, which would fix this issue for you (but not the bridge networking, which is just going to be broken unless you bind-in the appropriate modules to your container). That's sort of what I'd expect to see from the no-op cpuset manager you're getting. I can look into that to see if there's anything we can do around that, but I'm also going to strongly discourage you from this path if you're not very comfortable with configuring all the details of the Proxmox container.

@soupdiver
Copy link
Author

other issues aren't likely to be relevant here because you're running in an unsupported / undocumented way in the first place

yup, I got that. Just wanted to post for completness

If you were to run the agent outside the container and got the same thing, that might be interesting but not as it is.

I can replicate the setup on a VM and not a container in my homelab and see if that works out. Maybe VM approach is better than container in this case.

I'm assuming this ls command is being run inside the container (you didn't say)

You assume right, I did the ls from inside the container.

You'll need to look into whether Proxmox is setting that value such that the child container can't use the cpuset. There may be a configuration value you need to change, but I'm not familiar enough with Proxmox to tell you what that is.

I will check that in the Promox forum.

just not let you schedule workloads with cores

What does "worklaod with cores" mean? That a cpu resource limit is set or what does cores refer to?

but I'm also going to strongly discourage you from this path if you're not very comfortable with configuring all the details of the Proxmox container.

but not the bridge networking, which is just going to be broken unless you bind-in the appropriate modules to your container

What module are you referring to? Is that some cgroup/namespace/??? feature this is missing? (I'm familiar with k8s/lxc and all that jazz but haven't worked much on this low level, so I might miss some terms)

I'm happy to poke around. That's what I have the homelab for :) And maybe this can help to make the undocumented usecase a lil more documented.

Thanks for you information so far @tgross !

@tgross
Copy link
Member

tgross commented Oct 5, 2022

What does "worklaod with cores" mean? That a cpu resource limit is set or what does cores refer to?

Ah, sorry, should've linked it. I meant cores which means reserving a whole core (or cores) for the workload. That's what this whole cpuset management is actually for. So if we could just not worry about cpuset management you'd be able to start the task, but you'd only be able to specify resources.cpu and not resources.cores. Which may be a reasonable workaround if the environment doesn't allow for it.

What module are you referring to? Is that some cgroup/namespace/??? feature this is missing? (I'm familiar with k8s/lxc and all that jazz but haven't worked much on this low level, so I might miss some terms)

I mean the error message you got here in the client logs:

    2022-10-05T08:12:26.764Z [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled:
  error=
  | 3 errors occurred:
  | 	* module bridge not in /proc/modules
  | 	* failed to open /lib/modules/5.15.60-1-pve/modules.builtin: open /lib/modules/5.15.60-1-pve/modules.builtin: no such file or directory
  | 	* failed to open /lib/modules/5.15.60-1-pve/modules.dep: open /lib/modules/5.15.60-1-pve/modules.dep: no such file or directory
  |

Nomad is trying to look in your kernel modules to make sure that the environment has bridge networking available. This approach is not great, tbh and we have some open issues around that (#10902 and #9344 and #2633), but we'll likely fix that in #6618 (which is on the near-ish term roadmap). In the meantime, you might be able to workaround that by bind-mounting the /lib/modules from the host into the container.

@soupdiver
Copy link
Author

but you'd only be able to specify resources.cpu and not resources.cores. Which may be a reasonable workaround if the environment doesn't allow for it.

Ah gotcha! I mean... yea.. I only specified cpu anyway.

I mean the error message you got here in the client logs:

Ah, yea the message says "kernel module". I missed that. I guess that corresponds to how Proxmox is setting up the container.

@soupdiver
Copy link
Author

I also tried the same setup inside a qemu vm and not a container.
I get the exact same error and behviour. Is running Nomad inside a VM also an issue or is maybe something odd with my base system? 🤔

@tgross
Copy link
Member

tgross commented Oct 5, 2022

No, running Nomad in a VM should work just fine; Nomad doesn't try to sniff out if its in a hypervisor or anything like that and assumes it has the run of whatever the kernel tells it. But that information eliminates a whole category of problems!

This suggests there's something wrong in the code introduced in #14230 which was intended to fix #14229, or that #14494 applies to your environment as well. What's the base distro you're using?

@soupdiver
Copy link
Author

soupdiver commented Oct 5, 2022

What's the base distro you're using?

For the container it was ubuntu-22.04-minimal and for the vm I installed from ubuntu-22.04-live-server

So in both cases quite minimal installation. Maybe I'm missing some package or so?

This suggests there's something wrong in the code introduced in #14230 which was intended to fix #14229, or that #14494 applies to your environment as well. What's the base distro you're using?

Yea, exactly the issue that I initially linked. The only few results that came up on my search but I could not tell how related they actually are.

But that information eliminates a whole category of problems!

👍

Update:
Digging some more:
On the vm:

- cat /sys/fs/cgroup/cgroup.controllers
- cpuset cpu io memory hugetlb pids rdma misc
- cat /sys/fs/cgroup/cgroup.subtree_control
- memory pids

Ok here is less listed than in the other cases.

In the container:

- cat /sys/fs/cgroup/cgroup.controllers
- cpuset cpu io memory hugetlb pids rdma misc
- cat /sys/fs/cgroup/cgroup.subtree_control
- cpuset cpu io memory hugetlb pids rdma misc

Update 2:
When running the nomad client on bare metal (the Proxmox system) it actually works.

Update 3:
Ok, at least for the vm I got it working. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel helped

# echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
# echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control

@soupdiver
Copy link
Author

I tikered around yesterday evening but it seems I messed things up a bit during my updates. I might confused some container/vm terminal sessions 😬

Anyway, everything seems to work as expected now. I have the nomad agent running using the exec driver. No more errors during nomad startup or task creation.

What have I done?
On a new container I have the following:

cat /sys/fs/cgroup/cgroup.subtree_control 
memory pids

It seems only memory and pids controller are enabled by default(?).

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control

Seems to enable the other needed controllers. Restart nomand and things are fine.
I also applied this "trick": #10902 (comment) to get rid of the bridge kernel module error message.

As far as I can tell everything works just fine.

It "seems" running nomad inside a container isn't that painful at all... if you figure out the fundamentals and not much tweaking or anything seems needed

@tgross
Copy link
Member

tgross commented Oct 6, 2022

It seems only memory and pids controller are enabled by default(?).

Yeah, the systemd project has some docs around delegation that don't quite seem current (ref https://systemd.io/CGROUP_DELEGATION/):

systemd supports a number of controllers (but not all). Specifically, supported are:

on cgroup v1: cpu, cpuacct, blkio, memory, devices, pids
on cgroup v2: cpu, io, memory, pids
It is our intention to natively support all cgroup v2 controllers as they are added to the kernel.

I don't know if that means that systemd hasn't caught up yet or what (at least on the version shipping on recent distros), but obviously we've got to live with their behavior given systemd's position in the Linux ecosystem. 😀 I checked on my personal Ubuntu Jammy desktop and for some reason cpuset is already active, but I'm going to admit that I may have messed with that in the past. In any case, that leaves us with a couple of open items:

  • cgroups: unable to initialize cpuset manager on CentOS9 #14494 is actually going to apply to an increasing number of Linux hosts with systemd as time goes on, so we should figure out what we're going to do about it.
  • We need to do some documentation around what needs to be done to run Nomad in less fully-privileged environments and/or containers, what features are impacted, and whether it's meaningful in the context of the Nomad security model to do so. We have a planned "docs days" focus window coming up and that was already on the list of items to cover, so I'll make sure that's on my personal TODO list.

At this point I think the specific problems we've got here are dupes of issues covered elsewhere, so if there are no objections, I'm going to close this issue?

@soupdiver
Copy link
Author

At this point I think the specific problems we've got here are dupes of issues covered elsewhere, so if there are no objections, I'm going to close this issue?

Yea, I guess so. Seems whatever was my issue I somehow managed to solve (for now) 😁

@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants