-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Few Quick Questions #3
Comments
Hey Nicholas,
So this was compiled with the previous kernel version and works ok. Can't promise it will be for every upgrade.
The job from snippet above shows up in slurm like this:
We monitor the nodes with prometheus+node_exporter and we have used that to indirectly see if the node is in the "bad" oom state. So, what we've experienced is same like you, severe degradation, bringing nodes down to us power-cycling them.
This fired every time a singularity job triggered the issue, and we keep it around to tell us when kp_oom doesn't do its job. So, If you have any method of monitoring your nodes, check the IOPSes, else... nothing comes to mind right now, if it does, will let you know (friday night 😄 ) |
So I gave it a try on CentOS 7 with Kernel 3.10.0-1160.49.1.el7.x86_64 and the moment the kp_oom is triggered the node immediately crashes and reboots. I installed using the following commands.
I then triggered kp_oom using singularity and R (create really big array to trigger OOM)
Here is the log if it provides any clues:
|
Can reproduce this, and should be a quick one to fix. The issue was caused by the eventfd part of the code, which was really unnecessary since a while ago, but was left inside for... honestly, no good reason. |
Looks like your fix corrected the issue. I still need to do some more testing to make sure it properly kills our R code that was triggering the bug but this looks promosing. Thank yo for the quick turn around!
|
@nlvw glad to hear it worked. Would be curious to hear if it helped in the end, or did you manage do find another workaround? |
My simple test cases have passed without issue and now I'm trying to reproduce using actual user code that was triggering the bug. This has been more difficult as I believe that GPFS (our shared filesystem) has been playing a part in why some processes refuse to be closed by oom-kill. Once I get an Ansible deployment put together I'll probably be rolling this out to our interactive/debug partition for user testing. |
So the results are in and your code is handling the bug great! Though I did notice it took 4 attempts for it to kill the process in questions (1 before the OOM event and 3 afterwards). That being said it successfully killed the Slurm job and the user processes were cleaned up. I've include a before and after of running the same job that exceeds its memory request. Before:
After:
|
That's great to hear :) also thank you for that original crash log, that actually helped explain our "mysterious" node reboot in grid partition from last week. And we've rolled out this new version with the fix as well. ps if you eventually find it works for you and roll it out to prod, we'd be more then happy to receive a postcard from your site at this address (at: IT department) https://www.imp.ac.at/about/contact/ 😉 |
Hi @pja237, I wanted to follow-up your post from apptainer/singularity#5850 (comment) with a few question. I figured an issue here would be more appropriate to continue.
Thanks you for making this public and taking the time to answer questions!
The text was updated successfully, but these errors were encountered: