-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task using exec driver fails with "cgroup path must be set" #14797
Comments
Hi @soupdiver! I'm going to be honest here and tell you that running Nomad clients in a container is life on hard mode, to the point where we've resisted shipping a Nomad Docker image to discourage it. We need to put together some documentation on how to do this (and figure it out ourselves in detail, for that matter). The error you're getting there looks to be because the taskrunner isn't setting the Check the client logs for a log line containing the phrase "disable cpuset management" (ref In the meantime, it might help if you can provide the configuration you're using for the LXC container. That might give a hint as to what needs to happen here. |
What I can find in the logs is
On the other hand:
I found another issue that claimed this would only happen on a first start on a new system. The issue was fixed/closed and that was all I could find related to my error message.
Proxmox CT, unpriviliged. |
Is that command being run from inside or outside the container?
Let's ignore that for now... there was a flurry of similar-sounding issues around cgroups v2 support with a lot of different causes, but your client is telling us where to look.
Maybe here? https://pve.proxmox.com/wiki/Linux_Container#pct_configuration But if you're running unprivileged you're going to hit additional walls really quickly here. Nomad is running tasks and needs a high level of privilege to do so. See https://www.nomadproject.io/docs/install/production/requirements#linux-capabilities (but also #13669 for more discussion about future possiblities) |
I was giving it another try with a priviliged container but get the same error message.
This time the file is indeed missing:
For completness my Proxmox container conifig:
Googling for the error message I found the following 2 issues (and not much otherwise) |
@soupdiver other issues aren't likely to be relevant here because you're running in an unsupported / undocumented way in the first place (you might note from your logs that bridge networking is going to be broken too). If you were to run the agent outside the container and got the same thing, that might be interesting but not as it is.
The missing file is from the cgroups v2 cpuset controller and represents the set of CPUs allowed to be used by tasks within this group. I'm assuming this There's probably an argument to be made here that the client fingerprint should fail open and just not let you schedule workloads with |
yup, I got that. Just wanted to post for completness
I can replicate the setup on a VM and not a container in my homelab and see if that works out. Maybe VM approach is better than container in this case.
You assume right, I did the
I will check that in the Promox forum.
What does "worklaod with cores" mean? That a cpu resource limit is set or what does
What module are you referring to? Is that some cgroup/namespace/??? feature this is missing? (I'm familiar with k8s/lxc and all that jazz but haven't worked much on this low level, so I might miss some terms) I'm happy to poke around. That's what I have the homelab for :) And maybe this can help to make the undocumented usecase a lil more documented. Thanks for you information so far @tgross ! |
Ah, sorry, should've linked it. I meant
I mean the error message you got here in the client logs:
Nomad is trying to look in your kernel modules to make sure that the environment has |
Ah gotcha! I mean... yea.. I only specified
Ah, yea the message says "kernel module". I missed that. I guess that corresponds to how Proxmox is setting up the container. |
I also tried the same setup inside a qemu vm and not a container. |
No, running Nomad in a VM should work just fine; Nomad doesn't try to sniff out if its in a hypervisor or anything like that and assumes it has the run of whatever the kernel tells it. But that information eliminates a whole category of problems! This suggests there's something wrong in the code introduced in #14230 which was intended to fix #14229, or that #14494 applies to your environment as well. What's the base distro you're using? |
For the container it was So in both cases quite minimal installation. Maybe I'm missing some package or so?
Yea, exactly the issue that I initially linked. The only few results that came up on my search but I could not tell how related they actually are.
👍
- cat /sys/fs/cgroup/cgroup.controllers
- cpuset cpu io memory hugetlb pids rdma misc - cat /sys/fs/cgroup/cgroup.subtree_control
- memory pids
- cat /sys/fs/cgroup/cgroup.controllers
- cpuset cpu io memory hugetlb pids rdma misc - cat /sys/fs/cgroup/cgroup.subtree_control
- cpuset cpu io memory hugetlb pids rdma misc Update 2: Update 3:
|
I tikered around yesterday evening but it seems I messed things up a bit during my updates. I might confused some container/vm terminal sessions 😬 Anyway, everything seems to work as expected now. I have the nomad agent running using the What have I done?
It seems only memory and pids controller are enabled by default(?).
Seems to enable the other needed controllers. Restart nomand and things are fine. As far as I can tell everything works just fine. It "seems" running nomad inside a container isn't that painful at all... if you figure out the fundamentals and not much tweaking or anything seems needed |
Yeah, the systemd project has some docs around delegation that don't quite seem current (ref https://systemd.io/CGROUP_DELEGATION/):
I don't know if that means that systemd hasn't caught up yet or what (at least on the version shipping on recent distros), but obviously we've got to live with their behavior given systemd's position in the Linux ecosystem. 😀 I checked on my personal Ubuntu Jammy desktop and for some reason cpuset is already active, but I'm going to admit that I may have messed with that in the past. In any case, that leaves us with a couple of open items:
At this point I think the specific problems we've got here are dupes of issues covered elsewhere, so if there are no objections, I'm going to close this issue? |
Yea, I guess so. Seems whatever was my issue I somehow managed to solve (for now) 😁 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.3.5 (1359c2580fed080295840fb888e28f0855e42d50)
Operating system and Environment details
Ubuntu 22.04 inside Proxmox CT
Issue
Task configured with
exec
driver fails with error message:cgroup path must be set
Reproduction steps
Expected Result
Task starts
Actual Result
Task fails
Job file (if appropriate)
I run my worker inside a Proxmox CT which is a lxc container. Not sure if the nesting causes trouble.
My worker reports
unique.cgroup.mountpoint | /sys/fs/cgroup
andv2
I could not find anything via google except the source code with the error:
nomad/drivers/shared/executor/executor_linux.go
Lines 668 to 670 in d3a5591
But I could not really make sense out of it and what goes actually wrong
The text was updated successfully, but these errors were encountered: