-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad 1.3.0-rc.1 cgroupsv2 /dev/ strangeness #12877
Comments
I failed to mention this is with #12875 applied. Without it, I couldn't get far enough to create a semi-reliable example (I was originally going to report this bug, but in trying to reproduce what I was seeing with my "real" jobs, I hit #12863) Sorry about that and thanks! Another thing I didn't mention, to reproduce with the provided job file it usually breaks pretty quick for me. However there has been a time or two where I've had more success stopping and the restarting the job or a random allocation. The count is not important, just trying to tickle the bug out quicker. FWIW, smells like some kind of kernel bug to me. I'm curios if anyone else will be able to reproduce. |
I was able to reproduce this on Archlinux 5.17.5-arch1-1 #1 SMP PREEMPT Wed, 27 Apr 2022 20:56:11 +0000 x86_64 GNU/Linux. |
I'm not 100% my issue is related to this one outside of the error message but I have run into this same "Operation not permitted" issue with the Cinder CSI Driver container after updating to 1.3.0 on one of our client nodes. Ever since updating the node, I'm seeing the error messages below when a job using a Cinder CSI Volume is scheduled on it:
To clarify, the only thing done between the node working and getting the error above, was draining the client and stopping the service, then updating Nomad on the client and restarting it. The servers were already updated to 1.3.0. Since this is a docker container task I suspect this may be a different issue but figured I'd reach out here first before opening a new issue. If this is different enough to open a new issue I'm happy to open one. I have left that node running (it still runs non-csi jobs just fine) so if there is any info I could gather that may help let me know and i'll round it up. |
@RickyGrassmuck that sure smells pretty similar. Can you confirm you are also using cgroups v2? |
For sure, I'll check on that when I get in the office in a few hours |
Hmmm.... Interesting, that node is not using cgroups v2 it seems.
|
Ok, sorry for the noise, was actually able to stop that particular error from occurring after redeploying my CSI plugin. Having some new problems that warrant their own issue but they don't seem related to the cgroup issues anymore. |
So I had a little time today and played with this a bit. on 1.3.2-dev After a few runs, I was able to tease this out of the logs:
I can get that fairly consistently. Not as frequent as the OP. Which, to me, looks like a symptom of the same problem. So on a whim I tried this patch:
Which breaks tasks consistently with a similar looking error message, albeit much earlier with:
Kind of all leads me back to #12875 ? In random googling it also sounds like when launching nomad with systemd This is still easily reproducible for me on a vanilla arch linux install running the nomad binary as root from the command line, and ubuntu 22.04. I'm happy to try any patches or wild ideas. |
@tgross I've updated to Nomad 1.3.1 but getting this error when running CSI plugin jobs with
|
Yeah let's open a new issue for that @mr-karan. There's far more moving parts involved and we'll want to get all the CSI plugin logs and whatnot to diagnose that. (Although I'll note that CSI plugins generally aren't intended to be run without isolation, so that they work as |
I have noticed that when I updated the machines running nomad client to cgroupv2 we start having issues with an application we raw_exec, we raw exec teleport:
And when the machines are configured to run cgroupv2, teleport regularly errors after startup, with this message:
Switching back to cgroupv1 seems to fix the issue. This looks similar to the reported issue, but if you want me to raise a new issue please let me know. We are running Nomad 1.3.3 on flatcar linux 3227.2.1 |
This is what I am experiencing as well here: #13538 (comment) I'm running 1.3.1 |
I have absolutely no clue if any of this is useful, as up to 1 hour ago I knew nothing about how cgroups worked etc, but this problem has hit me like a truck from the left. I was nearly finished setting up the cluster I have been working on for the last few weeks .. to notice random outages when reacting to things like AWS ASG LifeCycle hooks .. which run, as you guessed it, with the So, the important bits for context: I have a job that runs with the Initially, the jobs always start fine. They can do everything as you would expect. Just after a few minutes / hours (haven't measured this yet), they lose access to So diving into cgroups, you end up in BPF world, as that decides if the file can be access or not. Now this is where it gets weird. When just starting a job, and checking out the BPF, it looks like this (dumped with
With a lot of
Now I can fully understand all device access fails with a BPF like that. It literally says that ( Again, no clue if this is useful, or that it is just "how it is suppose to work". At least I now know why it returns "not permitted". Now to find out why that happens :) Edit:
Some of these jobs have been running for 20 hours, others for 3 hours, and Nomad mentions no activity on the jobs. Yet they all have their BPF changed at the exact same time. |
Thanks for the detailed update @TrueBrain - do you mind adding the exact Nomad and Linux kernel versions? |
And another discovery: when ever a job starts with the |
So to reproduce: Start a job with driver
It tells me ID I run
Notice: 699 is 512B in xlated. I stop a job with the
I am not sure what is happening, but this doesn't feel right. But again, just to repeat myself too many times: I know nothing about what I am actually looking at. Just tracing symptoms here :) But it seems that whenever a job with the Edit: this also holds true if you restart another job that uses the |
Oh wow, a reproduction is a huge step forward @TrueBrain, thanks again. No doubt docker is managing the device controller interface for the shared parent cgroup even though we don't want it to. The device controller is (currently) beyond my skill set, but perhaps a dumb solution would be to just move the docker cgroups somewhere else. https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#device-controller |
Haha, tnx @shoenig ; the question here is only: can anyone else reproduce it :D I found even simpler steps: Start job A with Job B can no longer access any files in |
Okay, to finalize steps to reproduce, here some very simple steps (sorry, I am always really happy when you can condense an issue to a few simple steps :D):
Now that should make debugging significantly easier, as it breaks on a developer machine too :D As for the solution .. I wouldn't even know where to start looking :( nomad.hcl
PS: I did notice it only breaks when a new allocation is made; not when a process is restarted within its own allocation. |
Some poking in the code later (I never touched a Go project .. that was interesting; but I have to say, getting up and running is very smooth), the issue seems to be caused by the cpuset manager. From what I can tell, every time an allocation changes, the cgroup is written to for all allocations: nomad/client/lib/cgutil/cpuset_manager_v2.go Lines 225 to 233 in ee8cf15
And very specifically: nomad/client/lib/cgutil/cpuset_manager_v2.go Lines 332 to 337 in ee8cf15
Which ends up, I think, here: This seems to suggest this removes all device groups when that function is called without. But it also suggests it removes all memory restrictions etc, and I am pretty sure that is not true; in other words: I am pretty sure I am not actually following what happens, but I just hope if I talk long enough someone goes: owh, shit, this is broken, and fixes the issue :) Sadly, still no clue how to work around this, as that is kinda the only thing I am looking for :D Edit: ah, okay, CPU etc aren't changed if all values are zero. Seems that isn't the case for devices .. delving deeper into the rabbit hole :) And finally found the answer here: I will make a Pull Request in a bit. |
Right, created a Pull Request with a change that fixes it for me at least. So there is that. Sadly, I haven't found any way to work around this problem, short of disabling |
Nomad version
Nomad v1.3.0-rc.1 (31b0a18)
Operating system and Environment details
Ubuntu 22.04 Jammy Jellyfish 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Issue
Something funky happens when using cgroups v2 where (at the very least) the ability to read /dev/zero and write to /dev/null is lost. Using a raw_exec job and
dd if=/dev/zero of=/dev/null count=1
I getdd: failed to open '/dev/zero': Operation not permitted
fairly randomly but consistently. The original jobs are actually python/java, not that it seems to matter.If I disable cgroupsv2 with the following, the problem disappears.
Output from a failed run, I'll note it seems to always work on the "first" run, only seems to be restarts under the same allocation id.
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: