Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few Quick Questions #3

Closed
nlvw opened this issue Feb 4, 2022 · 8 comments
Closed

Few Quick Questions #3

nlvw opened this issue Feb 4, 2022 · 8 comments

Comments

@nlvw
Copy link

nlvw commented Feb 4, 2022

Hi @pja237, I wanted to follow-up your post from apptainer/singularity#5850 (comment) with a few question. I figured an issue here would be more appropriate to continue.

  1. Does this need to be rebuilt and reinstalled after every kernel upgrade?
  2. What branch do you recommend to install from (I saw there was quite a few).
  3. When this initiates killing a job how is it reported in Slurm? Does it show up as failed due to OOM or does it just show as canceled?
  4. Do you have this deployed along side Open Science Grid (OSG)? If so has there been any issues?
  5. Do you have any recommendations for monitoring when this has failed to kill a job? In our case this issue causes massive system performance degradation until eventually the system completely hangs. Our monitoring can already determine when the system hangs but we haven't figured how to detect this early before it gets to that point. I haven't seen a full crash and reboot yet.

Thanks you for making this public and taking the time to answer questions!

@pja237
Copy link
Owner

pja237 commented Feb 4, 2022

Hey Nicholas,
No problem, it's perfectly fine here.
On to the answers...

  1. We have done a few without recompiling and it seems to work. We also did recompile, but not 100% sure it was really necessary in that case.
    Keep in mind is that it can STOP working after doing a non-kernel yum upgrade.
    Machine is fine, the probe doesn't fire in oom situations. rmmod/modprobe resolves that.
    I've seen kexec being used in the logs of the upgrade, so perhaps that cleared up the krobes. No harm was done, thing just didn't work as it should.
    Right now we're doing a kernel upgrade from 3.10.0-1160.45.1.el7.x86_64 -> 3.10.0-1160.53.1.el7.x86_64 (due to the polkit vuln.), we didn't recompile, and here is now it looks like on one of the upgraded nodes:
# modinfo kp_oom
filename:       /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/kp_oom.ko
version:        1.0
license:        GPL
author:         pj
description:    kprobes kernel module
retpoline:      Y
rhelversion:    7.9
srcversion:     DB2C1637BE8D7C381917349
depends:        
vermagic:       3.10.0-1160.45.1.el7.x86_64 SMP mod_unload modversions 

So this was compiled with the previous kernel version and works ok. Can't promise it will be for every upgrade.

  1. run the HEAD of master, ignore all the branches, these were my experiments with various attack vectors on the problem, a lot of dead-ends there, just ignore.

  2. As an OUT_OF_MEMORY job. because right before we sigkill the slurmstepd child, we send an oom event counter to
    slurmstepd of the job (using eventfd mechanism slurm uses to monitor ooms received by its jobs).
    Also, right after we send the sigkill, the regular cgroup oom will happen, so you will see both in dmesg, the KP_OOM line and the whole oom dump from the kernel.

[Tue Feb  1 14:35:43 2022] KP_OOM: Call send_sig(SIGKILL) on pid=2444 comm=starter-suid parent.comm=slurmstepd uid=12797
, current.pid=19095 current.comm=python
[Tue Feb  1 14:35:43 2022] python invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
[Tue Feb  1 14:35:43 2022] python cpuset=step_0 mems_allowed=0-1
[Tue Feb  1 14:35:43 2022] CPU: 7 PID: 18950 Comm: python Kdump: loaded Tainted: G        W  OE  ------------ T 3.10.0-1160.53.1.el7.x86_64 #1
[Tue Feb  1 14:35:43 2022] Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
[Tue Feb  1 14:35:43 2022] Call Trace:
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b383579>] dump_stack+0x19/0x1b
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b37e618>] dump_header+0x90/0x229
[Tue Feb  1 14:35:43 2022]  [<ffffffff9af0f1db>] ? cred_has_capability+0x6b/0x120
[Tue Feb  1 14:35:43 2022]  [<ffffffff9acdae02>] ? default_wake_function+0x12/0x20
[Tue Feb  1 14:35:43 2022]  [<ffffffff9adc1fe6>] ? find_lock_task_mm+0x56/0xc0
[Tue Feb  1 14:35:43 2022]  [<ffffffff9adc254d>] oom_kill_process+0x2cd/0x490
[Tue Feb  1 14:35:43 2022]  [<ffffffff9ae416cc>] mem_cgroup_oom_synchronize+0x55c/0x590
[Tue Feb  1 14:35:43 2022]  [<ffffffff9ae40b30>] ? mem_cgroup_charge_common+0xc0/0xc0
[Tue Feb  1 14:35:43 2022]  [<ffffffff9adc2e34>] pagefault_out_of_memory+0x14/0x90
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b37cb55>] mm_fault_error+0x6a/0x157
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b3908d1>] __do_page_fault+0x491/0x500
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b390a26>] trace_do_page_fault+0x56/0x150
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b38ffa2>] do_async_page_fault+0x22/0xf0
[Tue Feb  1 14:35:43 2022]  [<ffffffff9b38c7a8>] async_page_fault+0x28/0x30
[Tue Feb  1 14:35:43 2022] Task in /slurm/uid_12797/job_34106540/step_0/task_0 killed as a result of limit of /slurm/uid_12797/job_34106540
[Tue Feb  1 14:35:43 2022] memory: usage 125829088kB, limit 125829120kB, failcnt 284400
[Tue Feb  1 14:35:43 2022] memory+swap: usage 125829028kB, limit 9007199254740988kB, failcnt 0
[Tue Feb  1 14:35:43 2022] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Tue Feb  1 14:35:43 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Feb  1 14:35:43 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540/step_extern: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Feb  1 14:35:43 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540/step_extern/task_0: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Feb  1 14:35:44 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Feb  1 14:35:44 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540/step_batch/task_0: cache:0KB rss:3980KB rss_huge:2048KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:3980KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Feb  1 14:35:44 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540/step_0: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Feb  1 14:35:44 2022] Memory cgroup stats for /slurm/uid_12797/job_34106540/step_0/task_0: cache:19840KB rss:124373740KB rss_huge:0KB mapped_file:992KB swap:0KB inactive_anon:3264KB active_anon:124439172KB inactive_file:2576KB active_file:13932KB unevictable:0KB
[Tue Feb  1 14:35:44 2022] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Tue Feb  1 14:35:44 2022] [ 2359]     0  2359    27014       88       8        0             0 sleep
[Tue Feb  1 14:35:44 2022] [ 2424] 12797  2424    28324      383      12        0             0 slurm_script
[Tue Feb  1 14:35:44 2022] [ 2427] 12797  2427    82513     1362      40        0             0 srun
[Tue Feb  1 14:35:44 2022] [ 2429] 12797  2429    11973      188      27        0             0 srun
[Tue Feb  1 14:35:44 2022] [18950] 12797 18950  2065280  1631878    3281        0             0 python
[Tue Feb  1 14:35:44 2022] Memory cgroup out of memory: Kill process 19037 (python) score 157 or sacrifice child

The job from snippet above shows up in slurm like this:

[root@clip-c2-88 ~]# sacct -j 34106540 -o 'jobid,jobname,state,exitcode'
       JobID    JobName      State ExitCode 
------------ ---------- ---------- -------- 
34106540     jupyter_c+ OUT_OF_ME+    0:125 
34106540.ba+      batch OUT_OF_ME+    0:125 
34106540.ex+     extern OUT_OF_ME+    0:125 
34106540.0   singulari+ OUT_OF_ME+    0:125 
  1. Not familiar with OSG software stack and how it differs from WLCG one, but we're running their software stack and haven't had issues with it so far. Job provokes oom situation... job dies. Shouldn't be relevant to the grid components what happens below in slurm and on the nodes themselves.

  2. Tricky one 🤔 we haven't really done that... in a precise way, we've done it sort of, indirectly 😄

We monitor the nodes with prometheus+node_exporter and we have used that to indirectly see if the node is in the "bad" oom state. So, what we've experienced is same like you, severe degradation, bringing nodes down to us power-cycling them.
This degradation in metrics is manifested as the spike in IOPS count towards the local disk (probable cause: page trashing)
(note: except minimum of os, all of our software stack + data for the nodes comes over network (nfs) and our local-drive iopses are normally almost nothing)
So we used the prometheus alerting expression to fire an alarm if the node has increased iopses (more then 500)

expr: irate(node_disk_reads_completed_total{job="os-sd-inst-prod"}[2m]) > 500

This fired every time a singularity job triggered the issue, and we keep it around to tell us when kp_oom doesn't do its job.

So, If you have any method of monitoring your nodes, check the IOPSes, else... nothing comes to mind right now, if it does, will let you know (friday night 😄 )

@nlvw
Copy link
Author

nlvw commented Feb 4, 2022

So I gave it a try on CentOS 7 with Kernel 3.10.0-1160.49.1.el7.x86_64 and the moment the kp_oom is triggered the node immediately crashes and reboots. I installed using the following commands.

git clone https://github.com/pja237/kp_oom.git /tmp/kp_oom
pushd /tmp/kp_oom
make -C "/lib/modules/$(uname -r)/build" M="$(pwd)"
make -C "/lib/modules/$(uname -r)/build" M="$(pwd)" modules_install
popd
rm -rf /tmp/kp_oom
modprobe kp_oom
lsmod | grep kp_oom

I then triggered kp_oom using singularity and R (create really big array to trigger OOM)

srun -n 1 -c 1 --mem-per-cpu=1g -t 06:00:00 -p interactive --reservation=root_10 --pty /bin/bash
singularity shell singularity/R_4.1.2-DADA2.sif
R
dat <- do.call(data.frame, replicate(80000, rep(FALSE, 80000), simplify=FALSE))

Here is the log if it provides any clues:

[ 2116.889955] BUG: unable to handle kernel paging request at fffffffffffffff2
[ 2116.896941] IP: [<ffffffffb638bb12>] _raw_spin_lock_irqsave+0x22/0x40
[ 2116.903392] PGD 1ec14067 PUD 1ec16067 PMD 0
[ 2116.907721] Oops: 0002 [#1] SMP
[ 2116.910982] Modules linked in: squashfs loop overlay(T) kp_oom(OE) mmfs26(OE) mmfslinux(OE) tracedev(OE) cuse fuse rdma_ucm(OE) nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 rdma_cm(OE) xt_conntrack iw_cm(OE) nf_conntrack iptable_filter ib_ipoib(OE) ib_cm(OE) ib_umad(OE) dell_smbios iTCO_wdt iTCO_vendor_support dell_wmi_descriptor dcdbas vfat fat skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg i2c_i801 mei_me lpc_ich mei wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad binfmt_misc auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) sd_mod crc_t10dif crct10dif_generic ib_core(OE) mgag200 i2c_algo_bit drm_kms_helper mlx5_core(OE) syscopyarea sysfillrect sysimgblt fb_sys_fops mlxfw(OE) ttm psample ahci auxiliary(OE) crct10dif_pclmul crct10dif_common tg3 libahci drm mlx_compat(OE) bnxt_en crc32c_intel libata megaraid_sas ptp devlink drm_panel_orientation_quirks pps_core nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod xpmem(OE) [last unloaded: kp_oom]
[ 2117.015804] CPU: 2 PID: 12684 Comm: starter-suid Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
[ 2117.027761] Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.12.2 07/09/2021
[ 2117.035309] task: ffff96f77a112100 ti: ffff96c66de28000 task.ti: ffff96c66de28000
[ 2117.042771] RIP: 0010:[<ffffffffb638bb12>]  [<ffffffffb638bb12>] _raw_spin_lock_irqsave+0x22/0x40
[ 2117.051644] RSP: 0000:ffff96c66de2bc80  EFLAGS: 00010046
[ 2117.056943] RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000000000
[ 2117.064059] RDX: 0000000000000001 RSI: 0000000000000001 RDI: fffffffffffffff2
[ 2117.071175] RBP: ffff96c66de2bc88 R08: 000000000000000a R09: 0000000000000000
[ 2117.078291] R10: 0000000000000ac4 R11: ffff96c66de2b916 R12: fffffffffffffff2
[ 2117.085406] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[ 2117.092522] FS:  00002b4b8709d700(0000) GS:ffff96c77ee40000(0000) knlGS:0000000000000000
[ 2117.100591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2117.106321] CR2: fffffffffffffff2 CR3: 0000002e6e912000 CR4: 00000000007607e0
[ 2117.113439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2117.120555] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2117.127671] PKRU: 55555554
[ 2117.130377] Call Trace:
[ 2117.132829]  [<ffffffffb5ea0bd2>] eventfd_signal+0x22/0x70
[ 2117.138298]  [<ffffffffc0461558>] kp_pre+0x548/0x7a0 [kp_oom]
[ 2117.144032]  [<ffffffffb5e41171>] ? mem_cgroup_oom_synchronize+0x1/0x590
[ 2117.150713]  [<ffffffffb638feda>] kprobe_ftrace_handler+0xba/0x120
[ 2117.156877]  [<ffffffffb5e41175>] ? mem_cgroup_oom_synchronize+0x5/0x590
[ 2117.163563]  [<ffffffffb5e41170>] ? mem_cgroup_select_victim_node+0x240/0x240
[ 2117.170680]  [<ffffffffb5dc2e34>] ? pagefault_out_of_memory+0x14/0x90
[ 2117.177105]  [<ffffffffb5d62164>] ftrace_ops_list_func+0xf4/0x120
[ 2117.183183]  [<ffffffffb6399bd4>] ftrace_regs_call+0x5/0x81
[ 2117.188741]  [<ffffffffb5e41171>] ? mem_cgroup_oom_synchronize+0x1/0x590
[ 2117.195425]  [<ffffffffb5e41175>] ? mem_cgroup_oom_synchronize+0x5/0x590
[ 2117.202108]  [<ffffffffb5dc2e34>] ? pagefault_out_of_memory+0x14/0x90
[ 2117.208534]  [<ffffffffb637cb15>] mm_fault_error+0x6a/0x157
[ 2117.214091]  [<ffffffffb63908d1>] __do_page_fault+0x491/0x500
[ 2117.219825]  [<ffffffffb6390975>] do_page_fault+0x35/0x90
[ 2117.225211]  [<ffffffffb638c778>] page_fault+0x28/0x30
[ 2117.230337] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 06 48 89 d8 5b 5d c3 89 c6 e8 c1 21 ff ff
[ 2117.250594] RIP  [<ffffffffb638bb12>] _raw_spin_lock_irqsave+0x22/0x40
[ 2117.257123]  RSP <ffff96c66de2bc80>
[ 2117.260607] CR2: fffffffffffffff2
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1160.49.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Nov 30 15:51:32 UTC 2021

@pja237
Copy link
Owner

pja237 commented Feb 5, 2022

Can reproduce this, and should be a quick one to fix. The issue was caused by the eventfd part of the code, which was really unnecessary since a while ago, but was left inside for... honestly, no good reason.
I've cleaned it up and tested on a node here, works same as before. Could you recompile my last commit in main and try?
(note, in certain oom situations, it triggers immediately, and in some it takes up to half a minute-minute before kernel triggers oom and this kicks in)

@nlvw
Copy link
Author

nlvw commented Feb 7, 2022

Looks like your fix corrected the issue. I still need to do some more testing to make sure it properly kills our R code that was triggering the bug but this looks promosing. Thank yo for the quick turn around!

[4130192.911883] kp_oom init kp_oom-1.1
[4130192.915453] kp_oom init of kprobe
[4130192.919888] kp_oom found symbol mem_cgroup_oom_synchronize at: ffffffffb6241170
[4142297.301180] KP_OOM: Call send_sig(SIGKILL) on pid=147396 comm=bash parent.comm=slurmstepd uid=572850
, current.pid=147485 current.comm=starter-suid
[4142297.314456] starter-suid invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
[4142297.322553] starter-suid cpuset=step_0 mems_allowed=0-1
[4142297.327947] CPU: 18 PID: 147485 Comm: starter-suid Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
[4142297.340252] Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.12.2 07/09/2021
[4142297.347972] Call Trace:
[4142297.350597]  [<ffffffffb6783539>] dump_stack+0x19/0x1b
[4142297.355902]  [<ffffffffb677e5d8>] dump_header+0x90/0x229
[4142297.361378]  [<ffffffffb60dae02>] ? default_wake_function+0x12/0x20
[4142297.367811]  [<ffffffffb62646b3>] ? pollwake+0x73/0x90
[4142297.373122]  [<ffffffffb61c1fe6>] ? find_lock_task_mm+0x56/0xc0
[4142297.379207]  [<ffffffffb623d2d8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[4142297.386070]  [<ffffffffb61c254d>] oom_kill_process+0x2cd/0x490
[4142297.392063]  [<ffffffffb62416cc>] mem_cgroup_oom_synchronize+0x55c/0x590
[4142297.398918]  [<ffffffffb6240b30>] ? mem_cgroup_charge_common+0xc0/0xc0
[4142297.405605]  [<ffffffffb61c2e34>] pagefault_out_of_memory+0x14/0x90
[4142297.412037]  [<ffffffffb677cb15>] mm_fault_error+0x6a/0x157
[4142297.417773]  [<ffffffffb67908d1>] __do_page_fault+0x491/0x500
[4142297.423683]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4142297.429243]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4142297.434543] Task in /slurm/uid_572850/job_1786694/step_0/task_0 killed as a result of limit of /slurm/uid_572850/job_1786694/step_0
[4142297.446522] memory: usage 2097152kB, limit 2097152kB, failcnt 1117391
[4142297.453123] memory+swap: usage 2097152kB, limit 2097152kB, failcnt 0
[4142297.459632] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4142297.465797] Memory cgroup stats for /slurm/uid_572850/job_1786694/step_0: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[4142297.485723] Memory cgroup stats for /slurm/uid_572850/job_1786694/step_0/task_0: cache:68KB rss:1707036KB rss_huge:10240KB mapped_file:4KB swap:0KB inactive_anon:12KB active_anon:1707804KB inactive_file:28KB active_file:24KB unevictable:0KB
[4142297.530176] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4142297.538321] [147457] 572850 147457   272455     3741      46        0             0 starter-suid
[4142297.547257] [147484] 572850 147484     3954      330      13        0             0 bash
[4142297.555500] Memory cgroup out of memory: Kill process 147552 (R) score 993 or sacrifice child
[4142299.472439] 99-clean.sh (147646): drop_caches: 3

@nlvw nlvw closed this as completed Feb 7, 2022
@pja237
Copy link
Owner

pja237 commented Feb 9, 2022

@nlvw glad to hear it worked.

Would be curious to hear if it helped in the end, or did you manage do find another workaround?

@nlvw
Copy link
Author

nlvw commented Feb 9, 2022

My simple test cases have passed without issue and now I'm trying to reproduce using actual user code that was triggering the bug. This has been more difficult as I believe that GPFS (our shared filesystem) has been playing a part in why some processes refuse to be closed by oom-kill.

Once I get an Ansible deployment put together I'll probably be rolling this out to our interactive/debug partition for user testing.

@nlvw
Copy link
Author

nlvw commented Feb 9, 2022

So the results are in and your code is handling the bug great! Though I did notice it took 4 attempts for it to kill the process in questions (1 before the OOM event and 3 afterwards). That being said it successfully killed the Slurm job and the user processes were cleaned up. I've include a before and after of running the same job that exceeds its memory request.

Before:

[4304745.812279] starter-suid invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
[4304745.820191] starter-suid cpuset=step_batch mems_allowed=0-1
[4304745.825936] CPU: 50 PID: 167240 Comm: starter-suid Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
[4304745.838252] Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.12.2 07/09/2021
[4304745.845980] Call Trace:
[4304745.848610]  [<ffffffffb6783539>] dump_stack+0x19/0x1b
[4304745.853920]  [<ffffffffb677e5d8>] dump_header+0x90/0x229
[4304745.859403]  [<ffffffffb60dae02>] ? default_wake_function+0x12/0x20
[4304745.865835]  [<ffffffffb62646b3>] ? pollwake+0x73/0x90
[4304745.871145]  [<ffffffffb61c1fe6>] ? find_lock_task_mm+0x56/0xc0
[4304745.877230]  [<ffffffffb623d2d8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[4304745.884094]  [<ffffffffb61c254d>] oom_kill_process+0x2cd/0x490
[4304745.890095]  [<ffffffffb62416cc>] mem_cgroup_oom_synchronize+0x55c/0x590
[4304745.896959]  [<ffffffffb6240b30>] ? mem_cgroup_charge_common+0xc0/0xc0
[4304745.903653]  [<ffffffffb61c2e34>] pagefault_out_of_memory+0x14/0x90
[4304745.910087]  [<ffffffffb677cb15>] mm_fault_error+0x6a/0x157
[4304745.915829]  [<ffffffffb67908d1>] __do_page_fault+0x491/0x500
[4304745.921739]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304745.927300]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304745.932612] Task in /slurm/uid_572850/job_1787216/step_batch/task_0 killed as a result of limit of /slurm/uid_572850/job_1787216/step_batch
[4304745.945291] memory: usage 1048576kB, limit 1048576kB, failcnt 162187
[4304745.951799] memory+swap: usage 1048576kB, limit 1048576kB, failcnt 0
[4304745.958311] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4304745.964478] Memory cgroup stats for /slurm/uid_572850/job_1787216/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[4304745.984203] Memory cgroup stats for /slurm/uid_572850/job_1787216/step_batch/task_0: cache:160KB rss:1048408KB rss_huge:12288KB mapped_file:0KB swap:0KB inactive_anon:4KB active_anon:1048420KB inactive_file:120KB active_file:20KB unevictable:0KB
[4304746.006365] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4304746.014483] [167056] 572850 167056    28358       86      11        0             0 slurm_script
[4304746.023421] [167089] 572850 167089    28359       86      11        0             0 bash
[4304746.031669] [167219] 572850 167219   272521     3252      45        0             0 starter-suid
[4304746.040600] [167238] 572850 167238    55248     1404      50        0             0 rserver
[4304746.049099] [167325] 572850 167325   431563   203335     574        0             0 rsession
[4304746.057689] [167452] 572850 167452   435471   203349     549        0             0 rsession
[4304746.066285] Memory cgroup out of memory: Kill process 167430 (rsession) score 777 or sacrifice child
[4304746.075572] Killed process 167452 (rsession), UID 572850, total-vm:1741884kB, anon-rss:813392kB, file-rss:4kB, shmem-rss:0kB
[4304858.440912] INFO: task rsession:167325 blocked for more than 120 seconds.
[4304858.447855] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4304858.455842] rsession        D ffffa09927fcf080     0 167325 167295 0x00000080
[4304858.463183] Call Trace:
[4304858.465807]  [<ffffffffb6789179>] schedule+0x29/0x70
[4304858.470933]  [<ffffffffc0701725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[4304858.478049]  [<ffffffffc0701ec8>] ? squashfs_read_metadata+0x58/0x130 [squashfs]
[4304858.485598]  [<ffffffffb60c6f50>] ? wake_up_atomic_t+0x30/0x30
[4304858.491587]  [<ffffffffc0701ff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[4304858.498876]  [<ffffffffc0703272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[4304858.505907]  [<ffffffffb61cbe98>] __do_page_cache_readahead+0x248/0x260
[4304858.512675]  [<ffffffffb61cc451>] ra_submit+0x21/0x30
[4304858.517889]  [<ffffffffb61c0e35>] filemap_fault+0x105/0x420
[4304858.523624]  [<ffffffffb61ee7aa>] __do_fault.isra.61+0x8a/0x100
[4304858.529700]  [<ffffffffb61eed5c>] do_read_fault.isra.63+0x4c/0x1b0
[4304858.536041]  [<ffffffffb61f65a0>] handle_mm_fault+0xa20/0xfb0
[4304858.541955]  [<ffffffffb60e4807>] ? update_curr+0x127/0x1e0
[4304858.547687]  [<ffffffffb6790653>] __do_page_fault+0x213/0x500
[4304858.553592]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304858.559150]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304858.564451] INFO: task rsession:167329 blocked for more than 120 seconds.
[4304858.571394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4304858.579373] rsession        D ffffa09927fce400     0 167329 167295 0x00000080
[4304858.586713] Call Trace:
[4304858.589335]  [<ffffffffb6789179>] schedule+0x29/0x70
[4304858.594464]  [<ffffffffc0701725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[4304858.601577]  [<ffffffffc0701ec8>] ? squashfs_read_metadata+0x58/0x130 [squashfs]
[4304858.609128]  [<ffffffffb60c6f50>] ? wake_up_atomic_t+0x30/0x30
[4304858.615118]  [<ffffffffc0701ff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[4304858.622407]  [<ffffffffc0703272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[4304858.629438]  [<ffffffffb61cbe98>] __do_page_cache_readahead+0x248/0x260
[4304858.636216]  [<ffffffffb61cc451>] ra_submit+0x21/0x30
[4304858.641431]  [<ffffffffb61c0e35>] filemap_fault+0x105/0x420
[4304858.647160]  [<ffffffffb61ee7aa>] __do_fault.isra.61+0x8a/0x100
[4304858.653238]  [<ffffffffb61eed5c>] do_read_fault.isra.63+0x4c/0x1b0
[4304858.659576]  [<ffffffffb61f65a0>] handle_mm_fault+0xa20/0xfb0
[4304858.665482]  [<ffffffffb6790653>] __do_page_fault+0x213/0x500
[4304858.671387]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304858.676948]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304858.682248] INFO: task rsession:167330 blocked for more than 120 seconds.
[4304858.689193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4304858.697171] rsession        D ffffa082efc35280     0 167330 167295 0x00000080
[4304858.704511] Call Trace:
[4304858.707134]  [<ffffffffb6789179>] schedule+0x29/0x70
[4304858.712260]  [<ffffffffc0701725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[4304858.719376]  [<ffffffffb60c6f50>] ? wake_up_atomic_t+0x30/0x30
[4304858.725368]  [<ffffffffc0701ff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[4304858.732659]  [<ffffffffc0703272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[4304858.739689]  [<ffffffffb61cbe98>] __do_page_cache_readahead+0x248/0x260
[4304858.746455]  [<ffffffffb61cc451>] ra_submit+0x21/0x30
[4304858.751670]  [<ffffffffb61c0e35>] filemap_fault+0x105/0x420
[4304858.757404]  [<ffffffffb6790975>] ? do_page_fault+0x35/0x90
[4304858.763136]  [<ffffffffb61ee7aa>] __do_fault.isra.61+0x8a/0x100
[4304858.769214]  [<ffffffffb61eed5c>] do_read_fault.isra.63+0x4c/0x1b0
[4304858.775553]  [<ffffffffb67885bd>] ? schedule_hrtimeout_range_clock+0x12d/0x150
[4304858.782928]  [<ffffffffb61f65a0>] handle_mm_fault+0xa20/0xfb0
[4304858.788834]  [<ffffffffb629d509>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
[4304858.795775]  [<ffffffffb6790653>] __do_page_fault+0x213/0x500
[4304858.801681]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304858.807241]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304858.812540] INFO: task rsession:167345 blocked for more than 120 seconds.
[4304858.819482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4304858.827464] rsession        D ffffa09914026300     0 167345 167295 0x00000080
[4304858.834802] Call Trace:
[4304858.837426]  [<ffffffffb6789179>] schedule+0x29/0x70
[4304858.842552]  [<ffffffffc0701725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[4304858.849667]  [<ffffffffc0701ec8>] ? squashfs_read_metadata+0x58/0x130 [squashfs]
[4304858.857214]  [<ffffffffb60c6f50>] ? wake_up_atomic_t+0x30/0x30
[4304858.863209]  [<ffffffffc0701ff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[4304858.870498]  [<ffffffffc0703272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[4304858.877526]  [<ffffffffb61cbe98>] __do_page_cache_readahead+0x248/0x260
[4304858.884299]  [<ffffffffb61cc451>] ra_submit+0x21/0x30
[4304858.889511]  [<ffffffffb61c0e35>] filemap_fault+0x105/0x420
[4304858.895243]  [<ffffffffb61ee7aa>] __do_fault.isra.61+0x8a/0x100
[4304858.901322]  [<ffffffffb61eed5c>] do_read_fault.isra.63+0x4c/0x1b0
[4304858.907661]  [<ffffffffb61f65a0>] handle_mm_fault+0xa20/0xfb0
[4304858.913566]  [<ffffffffb6790653>] __do_page_fault+0x213/0x500
[4304858.919471]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304858.925030]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304858.930329] INFO: task rsession:167430 blocked for more than 120 seconds.
[4304858.937274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4304858.945255] rsession        D ffffa09927fcf080     0 167430 167295 0x00000080
[4304858.952593] Call Trace:
[4304858.955214]  [<ffffffffb6789179>] schedule+0x29/0x70
[4304858.960343]  [<ffffffffc0701725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[4304858.967459]  [<ffffffffc0701ec8>] ? squashfs_read_metadata+0x58/0x130 [squashfs]
[4304858.975006]  [<ffffffffb60c6f50>] ? wake_up_atomic_t+0x30/0x30
[4304858.980999]  [<ffffffffc0701ff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[4304858.988286]  [<ffffffffc0703272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[4304858.995316]  [<ffffffffb61cbe98>] __do_page_cache_readahead+0x248/0x260
[4304859.002087]  [<ffffffffb61cc451>] ra_submit+0x21/0x30
[4304859.007301]  [<ffffffffb61c0e35>] filemap_fault+0x105/0x420
[4304859.013056]  [<ffffffffb61ee7aa>] __do_fault.isra.61+0x8a/0x100
[4304859.019136]  [<ffffffffb61eed5c>] do_read_fault.isra.63+0x4c/0x1b0
[4304859.025477]  [<ffffffffb61f65a0>] handle_mm_fault+0xa20/0xfb0
[4304859.031380]  [<ffffffffb60ca3e9>] ? hrtimer_try_to_cancel+0x29/0x120
[4304859.037891]  [<ffffffffb6790653>] __do_page_fault+0x213/0x500
[4304859.043796]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304859.049357]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304859.054658] INFO: task rsession:167452 blocked for more than 120 seconds.
[4304859.061599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4304859.069578] rsession        D ffffa09927fcf080     0 167452 167325 0x00100084
[4304859.076918] Call Trace:
[4304859.079541]  [<ffffffffb6789179>] schedule+0x29/0x70
[4304859.084669]  [<ffffffffc0701725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[4304859.091784]  [<ffffffffb60c6f50>] ? wake_up_atomic_t+0x30/0x30
[4304859.097778]  [<ffffffffc0701ff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[4304859.105063]  [<ffffffffc0703272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
[4304859.112095]  [<ffffffffb61cbe98>] __do_page_cache_readahead+0x248/0x260
[4304859.118864]  [<ffffffffb61cc451>] ra_submit+0x21/0x30
[4304859.124077]  [<ffffffffb61c0e35>] filemap_fault+0x105/0x420
[4304859.129812]  [<ffffffffb6201948>] ? page_add_new_anon_rmap+0xb8/0x170
[4304859.136410]  [<ffffffffb61ee7aa>] __do_fault.isra.61+0x8a/0x100
[4304859.142486]  [<ffffffffb61eed5c>] do_read_fault.isra.63+0x4c/0x1b0
[4304859.148823]  [<ffffffffb61f65a0>] handle_mm_fault+0xa20/0xfb0
[4304859.154726]  [<ffffffffb60e4807>] ? update_curr+0x127/0x1e0
[4304859.160459]  [<ffffffffb6790653>] __do_page_fault+0x213/0x500
[4304859.166364]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304859.171926]  [<ffffffffb678c778>] page_fault+0x28/0x30

After:

[4304194.578497] KP_OOM: Call send_sig(SIGKILL) on pid=165625 comm=slurm_script parent.comm=slurmstepd uid=572850
, current.pid=165758 current.comm=starter-suid
[4304194.592470] starter-suid invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
[4304194.600564] starter-suid cpuset=step_batch mems_allowed=0-1
[4304194.606305] CPU: 50 PID: 165758 Comm: starter-suid Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
[4304194.618619] Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.12.2 07/09/2021
[4304194.626348] Call Trace:
[4304194.628974]  [<ffffffffb6783539>] dump_stack+0x19/0x1b
[4304194.634281]  [<ffffffffb677e5d8>] dump_header+0x90/0x229
[4304194.639761]  [<ffffffffb60dae02>] ? default_wake_function+0x12/0x20
[4304194.646193]  [<ffffffffb62646b3>] ? pollwake+0x73/0x90
[4304194.651503]  [<ffffffffb61c1fe6>] ? find_lock_task_mm+0x56/0xc0
[4304194.657591]  [<ffffffffb623d2d8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[4304194.664454]  [<ffffffffb61c254d>] oom_kill_process+0x2cd/0x490
[4304194.670457]  [<ffffffffb62416cc>] mem_cgroup_oom_synchronize+0x55c/0x590
[4304194.677319]  [<ffffffffb6240b30>] ? mem_cgroup_charge_common+0xc0/0xc0
[4304194.684003]  [<ffffffffb61c2e34>] pagefault_out_of_memory+0x14/0x90
[4304194.690428]  [<ffffffffb677cb15>] mm_fault_error+0x6a/0x157
[4304194.696170]  [<ffffffffb67908d1>] __do_page_fault+0x491/0x500
[4304194.702093]  [<ffffffffb6790975>] do_page_fault+0x35/0x90
[4304194.707659]  [<ffffffffb678c778>] page_fault+0x28/0x30
[4304194.712961] Task in /slurm/uid_572850/job_1787215/step_batch/task_0 killed as a result of limit of /slurm/uid_572850/job_1787215/step_batch
[4304194.725642] memory: usage 1048468kB, limit 1048576kB, failcnt 197087
[4304194.732151] memory+swap: usage 1048468kB, limit 1048576kB, failcnt 1
[4304194.738664] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4304194.744828] Memory cgroup stats for /slurm/uid_572850/job_1787215/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[4304194.764625] Memory cgroup stats for /slurm/uid_572850/job_1787215/step_batch/task_0: cache:88KB rss:1048060KB rss_huge:16384KB mapped_file:0KB swap:0KB inactive_anon:4KB active_anon:1048392KB inactive_file:36KB active_file:8KB unevictable:0KB
[4304194.786533] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4304194.794675] [165658] 572850 165658    28359       86      11        0             0 bash
[4304194.802922] [165745] 572850 165745   256201     3800      45        0             0 starter-suid
[4304194.811862] [165756] 572850 165756    55248     1402      50        0             0 rserver
[4304194.820366] [165850] 572850 165850   431571   204140     579        0             0 rsession
[4304194.828960] [166000] 572850 166000   435479   204153     549        0             0 rsession
[4304194.837554] Memory cgroup out of memory: Kill process 165980 (rsession) score 780 or sacrifice child
[4304194.846845] Killed process 166000 (rsession), UID 572850, total-vm:1741916kB, anon-rss:816608kB, file-rss:4kB, shmem-rss:0kB
[4304194.859472] KP_OOM: Call send_sig(SIGKILL) on pid=165625 comm=slurm_script parent.comm=slurmstepd uid=572850
, current.pid=165758 current.comm=starter-suid
[4304194.874631] KP_OOM: Call send_sig(SIGKILL) on pid=165625 comm=slurm_script parent.comm=slurmstepd uid=572850
[4304194.874631] , current.pid=165758 current.comm=starter-suidKP_OOM: Call send_sig(SIGKILL) on pid=165625 comm=slurm_script parent.comm=slurmstepd uid=572850
, current.pid=165758 current.comm=starter-suid
[4304194.905456] KP_OOM: Call send_sig(SIGKILL) on pid=165625 comm=slurm_script parent.comm=slurmstepd uid=572850
[4304194.905456] , current.pid=165758 current.comm=starter-suid[4304214.314071] 99-clean.sh (166158): drop_caches: 3

@pja237
Copy link
Owner

pja237 commented Feb 10, 2022

That's great to hear :)

also thank you for that original crash log, that actually helped explain our "mysterious" node reboot in grid partition from last week. And we've rolled out this new version with the fix as well.
Multiple kp_oom triggering for a single job are a normal occurrence, and depending on what the process does (heavy IO or so), sometimes there are quite a few sigkills being dispatched before it dies.

ps if you eventually find it works for you and roll it out to prod, we'd be more then happy to receive a postcard from your site at this address (at: IT department) https://www.imp.ac.at/about/contact/ 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants