Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel 5.13+ panics resulting in deadlock #940

Closed
wkruse opened this issue Aug 27, 2021 · 48 comments
Closed

Kernel 5.13+ panics resulting in deadlock #940

wkruse opened this issue Aug 27, 2021 · 48 comments
Labels

Comments

@wkruse
Copy link

wkruse commented Aug 27, 2021

Describe the bug

We are using Typhoon (https://typhoon.psdn.io/fedora-coreos/bare-metal/) to provision Fedora CoreOS and Kubernetes on bare metal. The last stable version without the issue was 34.20210711.3.0. Starting from 34.20210725.3.0 up to the 34.20210808.3.0 we started to see system freezes, to force reboot we have to use the power switch.

This is the kernel crash log just before the hard reboot.

------------[ cut here ]------------
audit: type=1325 audit(1629974912.370:1367): table=nat family=2 entries=63 op=xt_replace pid=158894 subj=system_u:system_r:spc_t:s0 comm="iptables-restor"
------------[ cut here ]------------
NETDEV WATCHDOG: eno1 (ixgbe): transmit queue 17 timed out
WARNING: CPU: 17 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x24d/0x260
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 17 PID: 0 Comm: swapper/17 Kdump: loaded Not tainted 5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
RIP: 0010:dev_watchdog+0x24d/0x260
Code: 49 66 fd ff eb a9 4c 89 f7 c6 05 27 c8 4d 01 01 e8 28 61 fa ff 44 89 e9 4c 89 f6 48 c7 c7 80 47 6a bc 48 89 c2 e8 8d 36 17 00 <0f> 0b eb 8a 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
RSP: 0018:ffffc06d86838eb0 EFLAGS: 00010282
RAX: 000000000000003a RBX: ffff9c3d54df4ec0 RCX: 0000000000000027
RDX: ffff9c4c9fc18a08 RSI: 0000000000000001 RDI: ffff9c4c9fc18a00
RBP: ffff9c3d55ee03dc R08: ffffffffbcc66880 R09: 0000000000000001
R10: ffffffffffffffff R11: ffffffffbd794fa2 R12: ffff9c3d55ee0480
R13: 0000000000000011 R14: ffff9c3d55ee0000 R15: ffff9c3d54df4f40
FS:  0000000000000000(0000) GS:ffff9c4c9fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000ee796000 CR3: 000000011f5c8003 CR4: 00000000001706e0
Call Trace:
 <IRQ>
 ? pfifo_fast_enqueue+0x150/0x150
 call_timer_fn+0x29/0xf0
 __run_timers.part.0+0x1b1/0x210
 ? __hrtimer_run_queues+0x129/0x250
 ? recalibrate_cpu_khz+0x10/0x10
 ? ktime_get+0x38/0x90
 ? lapic_next_deadline+0x28/0x30
 run_timer_softirq+0x26/0x50
 __do_softirq+0xd0/0x28f
 __irq_exit_rcu+0xcc/0x100
 sysvec_apic_timer_interrupt+0x72/0x90
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:cpuidle_enter_state+0xc7/0x350
Code: 8b 3d d5 97 69 44 e8 28 bf 79 ff 49 89 c5 0f 1f 44 00 00 31 ff e8 59 d7 79 ff 45 84 ff 0f 85 fa 00 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 06 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d
RSP: 0018:ffffc06d864cbeb0 EFLAGS: 00000246
RAX: ffff9c4c9fc2ac80 RBX: 0000000000000002 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 00000000313b18a8 RDI: 0000000000000000
RBP: ffffe06d7fc00168 R08: 0000056ba6474e2b R09: 0000000000000008
R10: 000000000000003b R11: 000000000000001f R12: ffffffffbce5c500
R13: 0000056ba6474e2b R14: 0000000000000002 R15: 0000000000000000
 cpuidle_enter+0x29/0x40
 do_idle+0x1ce/0x270
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
---[ end trace 81ee11fe7294464a ]---
ixgbe 0000:01:00.0 eno1: initiating reset due to tx timeout
ixgbe 0000:01:00.0 eno1: Reset adapter
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [runc:158948]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 5 PID: 158948 Comm: runc Kdump: loaded Tainted: G        W         5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
RIP: 0010:smp_call_function_single+0x7f/0xf0
Code: d7 33 e9 44 a9 00 01 ff 00 75 7f 85 c9 75 32 48 c7 c6 40 bd 02 00 65 48 03 35 ad cd e8 44 8b 46 08 a8 01 74 09 f3 90 8b 46 08 <a8> 01 75 f7 83 4e 08 01 4c 89 46 10 48 89 56 18 e8 ac fe ff ff c9
RSP: 0018:ffffc06da1d57c80 EFLAGS: 00000202
RAX: 0000000000000001 RBX: 0000001745ac8754 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9c4c9faabd40 RDI: 0000000000000004
RBP: ffffc06da1d57ca0 R08: ffffffffbb03e770 R09: 0000000000000000
R10: 0000000000000004 R11: 006f666e69757063 R12: 0000056630e8c841
R13: 0000000000000001 R14: ffff9c469c709110 R15: 0000000000000000
FS:  00007f1be4f8e740(0000) GS:ffff9c4c9fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c000210008 CR3: 0000000b0f35e002 CR4: 00000000001706e0
Call Trace:
 aperfmperf_snapshot_cpu+0x57/0x70
 arch_freq_prepare_all+0x6b/0xb0
 ? proc_reg_poll+0x90/0x90
 cpuinfo_open+0xe/0x20
 do_dentry_open+0x14b/0x360
 path_openat+0xab3/0x1020
 ? filename_lookup+0x125/0x1a0
 ? avc_has_perm+0x6d/0x160
 do_filp_open+0x8f/0x120
 ? __check_object_size+0x136/0x150
 ? alloc_fd+0x50/0x170
 do_sys_openat2+0x7a/0x130
 __x64_sys_openat+0x45/0x70
 do_syscall_64+0x40/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x564d1a44caca
Code: e8 bb b5 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
RSP: 002b:000000c0000d2f08 EFLAGS: 00000216 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 000000c000034800 RCX: 0000564d1a44caca
RDX: 0000000000080000 RSI: 000000c000092b70 RDI: ffffffffffffff9c
RBP: 000000c0000d2f88 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000000000b8
R13: 00000000000000b7 R14: 0000000000000200 R15: ffffffffffffffff
watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [kworker/18:2:114194]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 18 PID: 114194 Comm: kworker/18:2 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Workqueue: events netstamp_clear
RIP: 0010:smp_call_function_many_cond+0x118/0x2c0
Code: 8b 75 08 e8 0a 6c 50 00 3b 05 38 e0 dd 01 89 c7 73 22 48 63 c7 48 8b 4d 00 48 03 0c c5 00 69 6b bc 8b 41 08 a8 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 eb cb 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e
RSP: 0018:ffffc06d8d077da0 EFLAGS: 00000202
RAX: 0000000000000011 RBX: 0000000000000001 RCX: ffff9c449fa32d20
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff9c449fc6bdc0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246
R13: 0000000000000000 R14: 0000000000000020 R15: ffff9c449fc6bdc0
FS:  0000000000000000(0000) GS:ffff9c449fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fae28431018 CR3: 0000000908c10006 CR4: 00000000001706e0
Call Trace:
 ? text_poke_loc_init+0x100/0x100
 on_each_cpu_cond_mask+0x19/0x20
 text_poke_bp_batch+0xa3/0x1e0
 text_poke_finish+0x1b/0x30
 arch_jump_label_transform_apply+0x16/0x30
 static_key_enable_cpuslocked+0x57/0x90
 static_key_enable+0x16/0x20
 process_one_work+0x1ec/0x380
 worker_thread+0x53/0x3e0
 ? process_one_work+0x380/0x380
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#31 stuck for 22s! [migration/31:170]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 31 PID: 170 Comm: migration/31 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0x8f/0xe0
RIP: 0010:rcu_momentary_dyntick_idle+0x24/0x30
Code: 84 00 00 00 00 00 48 c7 c0 40 ba 02 00 65 c6 05 f5 39 ed 44 00 65 48 03 05 f9 93 eb 44 ba 04 00 00 00 f0 0f c1 90 20 01 00 00 <83> e2 02 74 01 c3 0f 0b c3 0f 1f 00 0f 1f 44 00 00 31 c0 65 48 8b
RSP: 0000:ffffc06d86a83e60 EFLAGS: 00000202
RAX: ffff9c4c9fdeba40 RBX: ffffc06d8778fb88 RCX: 0000000000000000
RDX: 000000006288ebe2 RSI: ffffc06d8778fbe0 RDI: ffffffffbc211c20
RBP: ffffc06d8778fbac R08: ffff9c4c9fddd9b0 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000001f R12: 0000000000000001
R13: ffffffffbc211c20 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff9c4c9fdc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000004f42160 CR3: 000000089a3d6001 CR4: 00000000001706e0
Call Trace:
 multi_cpu_stop+0xb9/0x110
 ? stop_machine_yield+0x10/0x10
 cpu_stopper_thread+0xd1/0x140
 smpboot_thread_fn+0xc5/0x160
 ? smpboot_register_percpu_thread+0xf0/0xf0
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [migration/20:115]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 20 PID: 115 Comm: migration/20 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0x8f/0xe0
RIP: 0010:rcu_momentary_dyntick_idle+0x24/0x30
Code: 84 00 00 00 00 00 48 c7 c0 40 ba 02 00 65 c6 05 f5 39 ed 44 00 65 48 03 05 f9 93 eb 44 ba 04 00 00 00 f0 0f c1 90 20 01 00 00 <83> e2 02 74 01 c3 0f 0b c3 0f 1f 00 0f 1f 44 00 00 31 c0 65 48 8b
RSP: 0000:ffffc06d8689fe60 EFLAGS: 00000202
RAX: ffff9c449fcaba40 RBX: ffffc06d8c53fb88 RCX: 0000000000000000
RDX: 000000003a1f919a RSI: ffffc06d8c53fbe0 RDI: ffffffffbc212020
RBP: ffffc06d8c53fbac R08: ffff9c449fc9d9b0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000014 R12: 0000000000000001
R13: ffffffffbc212020 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff9c449fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0010ab558 CR3: 000000012ff24006 CR4: 00000000001706e0
Call Trace:
 multi_cpu_stop+0xb9/0x110
 ? stop_machine_yield+0x10/0x10
 cpu_stopper_thread+0xd1/0x140
 smpboot_thread_fn+0xc5/0x160
 ? smpboot_register_percpu_thread+0xf0/0xf0
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#5 stuck for 48s! [runc:158948]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 5 PID: 158948 Comm: runc Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
RIP: 0010:smp_call_function_single+0x7c/0xf0
Code: 65 8b 05 d7 33 e9 44 a9 00 01 ff 00 75 7f 85 c9 75 32 48 c7 c6 40 bd 02 00 65 48 03 35 ad cd e8 44 8b 46 08 a8 01 74 09 f3 90 <8b> 46 08 a8 01 75 f7 83 4e 08 01 4c 89 46 10 48 89 56 18 e8 ac fe
RSP: 0018:ffffc06da1d57c80 EFLAGS: 00000202
RAX: 0000000000000001 RBX: 0000001745ac8754 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9c4c9faabd40 RDI: 0000000000000004
RBP: ffffc06da1d57ca0 R08: ffffffffbb03e770 R09: 0000000000000000
R10: 0000000000000004 R11: 006f666e69757063 R12: 0000056630e8c841
R13: 0000000000000001 R14: ffff9c469c709110 R15: 0000000000000000
FS:  00007f1be4f8e740(0000) GS:ffff9c4c9fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c000210008 CR3: 0000000b0f35e002 CR4: 00000000001706e0
Call Trace:
 aperfmperf_snapshot_cpu+0x57/0x70
 arch_freq_prepare_all+0x6b/0xb0
 ? proc_reg_poll+0x90/0x90
 cpuinfo_open+0xe/0x20
 do_dentry_open+0x14b/0x360
 path_openat+0xab3/0x1020
 ? filename_lookup+0x125/0x1a0
 ? avc_has_perm+0x6d/0x160
 do_filp_open+0x8f/0x120
 ? __check_object_size+0x136/0x150
 ? alloc_fd+0x50/0x170
 do_sys_openat2+0x7a/0x130
 __x64_sys_openat+0x45/0x70
 do_syscall_64+0x40/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x564d1a44caca
Code: e8 bb b5 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
RSP: 002b:000000c0000d2f08 EFLAGS: 00000216 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 000000c000034800 RCX: 0000564d1a44caca
RDX: 0000000000080000 RSI: 000000c000092b70 RDI: ffffffffffffff9c
RBP: 000000c0000d2f88 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000000000b8
R13: 00000000000000b7 R14: 0000000000000200 R15: ffffffffffffffff
watchdog: BUG: soft lockup - CPU#18 stuck for 48s! [kworker/18:2:114194]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 18 PID: 114194 Comm: kworker/18:2 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Workqueue: events netstamp_clear
RIP: 0010:smp_call_function_many_cond+0x11b/0x2c0
Code: e8 0a 6c 50 00 3b 05 38 e0 dd 01 89 c7 73 22 48 63 c7 48 8b 4d 00 48 03 0c c5 00 69 6b bc 8b 41 08 a8 01 74 0a f3 90 8b 51 08 <83> e2 01 75 f6 eb cb 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3
RSP: 0018:ffffc06d8d077da0 EFLAGS: 00000202
RAX: 0000000000000011 RBX: 0000000000000001 RCX: ffff9c449fa32d20
RDX: 0000000000000011 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff9c449fc6bdc0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246
R13: 0000000000000000 R14: 0000000000000020 R15: ffff9c449fc6bdc0
FS:  0000000000000000(0000) GS:ffff9c449fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fae28431018 CR3: 0000000908c10006 CR4: 00000000001706e0
Call Trace:
 ? text_poke_loc_init+0x100/0x100
 on_each_cpu_cond_mask+0x19/0x20
 text_poke_bp_batch+0xa3/0x1e0
 text_poke_finish+0x1b/0x30
 arch_jump_label_transform_apply+0x16/0x30
 static_key_enable_cpuslocked+0x57/0x90
 static_key_enable+0x16/0x20
 process_one_work+0x1ec/0x380
 worker_thread+0x53/0x3e0
 ? process_one_work+0x380/0x380
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu:         27-...!: (2 GPs behind) idle=9c6/1/0x4000000000000002 softirq=400673/400673 fqs=1269 
        (detected by 29, t=60002 jiffies, g=1841281, q=62941)
Sending NMI from CPU 29 to CPUs 27:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 0 PID: 155856 Comm: kworker/0:0 Kdump: loaded Not tainted 5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Workqueue: events vmstat_shepherd
RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1d0
Code: 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 11 01 00 00 85 c0 74 0e 8b 02 84 c0 74 08 f3 90 8b 02 <84> c0 75 f8 b8 01 00 00 00 66 89 02 c3 8b 37 b9 00 02 00 00 81 fe
RSP: 0018:ffffc06d8efdfe08 EFLAGS: 00000002
RAX: 0000000000000101 RBX: ffff9c4c9fd72900 RCX: 0000000000000000
RDX: ffff9c4c9fd6a600 RSI: 0000000000000000 RDI: ffff9c4c9fd6a600
RBP: 0000000000000025 R08: ffff9c4c9fd6a600 R09: 0000000000000036
R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000001b
R13: 0000000000027ba8 R14: ffff9c44c004aa00 R15: ffff9c4c9fd676e0
FS:  0000000000000000(0000) GS:ffff9c449fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc0d0000010 CR3: 00000001527f6004 CR4: 00000000001706f0
Call Trace:
 _raw_spin_lock+0x1a/0x20
 __queue_work+0x16d/0x3e0
 ? __switch_to_asm+0x42/0x70
 queue_delayed_work_on+0x31/0x50
 vmstat_shepherd+0x6d/0xa0
 process_one_work+0x1ec/0x380
 worker_thread+0x53/0x3e0
 ? process_one_work+0x380/0x380
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 4 PID: 177 Comm: kauditd Kdump: loaded Not tainted 5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
RIP: 0010:vprintk_emit+0x1f1/0x260
Code: c7 78 00 79 bd c6 07 00 0f 1f 40 00 81 e3 00 02 00 00 84 d2 74 5c 0f b6 05 b4 96 64 02 84 c0 74 0d f3 90 0f b6 15 a7 96 64 02 <84> d2 75 f3 e8 36 09 00 00 48 85 db 0f 84 4a ff ff ff fb 66 0f 1f
RSP: 0018:ffffc06d86b3fdb8 EFLAGS: 00000002
RAX: 0000000000000001 RBX: 0000000000000200 RCX: ffff9c44c0328000
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffffffffbd790078
RBP: 00000000ffffffff R08: ffffffffbcc66880 R09: 00000000bd794eaa
R10: ffffffffffffffff R11: ffffffffbd794eaa R12: 000000000000009a
R13: 0000000000000000 R14: ffffffffbc5c7720 R15: 0000000000000246
FS:  0000000000000000(0000) GS:ffff9c449fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f141567668 CR3: 000000089a3d6005 CR4: 00000000001706e0
Call Trace:
 printk+0x48/0x4a
 kauditd_hold_skb.cold+0x14/0x19
 ? auditd_conn_free+0x70/0x70
 kauditd_send_queue+0x111/0x150
 ? audit_log_lost+0x90/0x90
 kauditd_thread+0x22b/0x2b0
 ? finish_wait+0x80/0x80
 ? auditd_reset+0x90/0x90
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 27 PID: 0 Comm: swapper/27 Kdump: loaded Not tainted 5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1d0
Code: 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 11 01 00 00 85 c0 74 0e 8b 02 84 c0 74 08 f3 90 8b 02 <84> c0 75 f8 b8 01 00 00 00 66 89 02 c3 8b 37 b9 00 02 00 00 81 fe
RSP: 0018:ffffc06d869f0a38 EFLAGS: 00000002
RAX: 0000000000000101 RBX: ffff9c456690a780 RCX: 0000000000000000
RDX: ffff9c4c9fd6ac80 RSI: 0000000000000000 RDI: ffff9c4c9fd6ac80
RBP: ffff9c4c9fd6ac80 R08: ffff9c4c9fd6a620 R09: ffff9c4c9fd6a620
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000002 R14: ffff9c456690b3c4 R15: 000000000000001b
FS:  0000000000000000(0000) GS:ffff9c4c9fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc11c0710d8 CR3: 0000000984f70004 CR4: 00000000001706e0
Call Trace:
 <IRQ>
 _raw_spin_lock+0x1a/0x20
 try_to_wake_up+0x13f/0x4d0
 ? insert_work+0x46/0xc0
 __queue_work+0x1d1/0x3e0
 queue_work_on+0x31/0x50
 soft_cursor+0x1ab/0x230
 bit_cursor+0x374/0x5a0
 ? fbcon_cursor+0x109/0x130
 hide_cursor+0x2a/0x90
 vt_console_print+0x3c5/0x3d0
 console_unlock+0x383/0x520
 vprintk_emit+0x152/0x260
 printk+0x48/0x4a
 __warn_printk+0x37/0x64
 ? enqueue_entity+0x18c/0x7b0
 enqueue_task_fair+0x26f/0x6a0
 ? psi_task_change+0x9b/0xe0
 ttwu_do_activate+0x75/0x180
 try_to_wake_up+0x19b/0x4d0
 ? __hrtimer_init+0xc0/0xc0
 hrtimer_wakeup+0x1e/0x30
 __hrtimer_run_queues+0x11a/0x250
 hrtimer_interrupt+0x110/0x2c0
 __sysvec_apic_timer_interrupt+0x5f/0xd0
 sysvec_apic_timer_interrupt+0x6d/0x90
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:cpuidle_enter_state+0xc7/0x350
Code: 8b 3d d5 97 69 44 e8 28 bf 79 ff 49 89 c5 0f 1f 44 00 00 31 ff e8 59 d7 79 ff 45 84 ff 0f 85 fa 00 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 06 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d
RSP: 0018:ffffc06d8651beb0 EFLAGS: 00000246
RAX: ffff9c4c9fd6ac80 RBX: 0000000000000002 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 00000000313b18a8 RDI: 0000000000000000
RBP: ffffe06d7fd40168 R08: 00000564af795329 R09: 0000000000000018
R10: 0000000000003b2e R11: 0000000000001b88 R12: ffffffffbce5c500
R13: 00000564af795329 R14: 0000000000000002 R15: 0000000000000000
 cpuidle_enter+0x29/0x40
 do_idle+0x1ce/0x270
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
NMI backtrace for cpu 27
CPU: 27 PID: 0 Comm: swapper/27 Kdump: loaded Tainted: G        W   
Lost 55 message(s)!
rcu: rcu_sched kthread timer wakeup didn't happen for 54446 jiffies! g1841281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
rcu:         Possible timer handling issue on cpu=4 timer-softirq=98640
rcu: rcu_sched kthread starved for 54449 jiffies! g1841281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=4
rcu:         Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_sched       state:I stack:    0 pid:   15 ppid:     2 flags:0x00004000
Call Trace:
 __schedule+0x2f3/0x990
 schedule+0x46/0xb0
 schedule_timeout+0x7d/0x120
 ? prepare_to_swait_event+0x82/0x130
 ? __bpf_trace_tick_stop+0x10/0x10
 rcu_gp_kthread+0x566/0xc30
 ? rcu_all_qs+0x70/0x70
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 29 to CPUs 4:
NMI backtrace for cpu 4
CPU: 4 PID: 177 Comm: kauditd Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
RIP: 0010:vprintk_emit+0x1f1/0x260
Code: c7 78 00 79 bd c6 07 00 0f 1f 40 00 81 e3 00 02 00 00 84 d2 74 5c 0f b6 05 b4 96 64 02 84 c0 74 0d f3 90 0f b6 15 a7 96 64 02 <84> d2 75 f3 e8 36 09 00 00 48 85 db 0f 84 4a ff ff ff fb 66 0f 1f
RSP: 0018:ffffc06d86b3fdb8 EFLAGS: 00000002
RAX: 0000000000000001 RBX: 0000000000000200 RCX: ffff9c44c0328000
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffffffffbd790078
RBP: 00000000ffffffff R08: ffffffffbcc66880 R09: 00000000bd794eaa
R10: ffffffffffffffff R11: ffffffffbd794eaa R12: 000000000000009a
R13: 0000000000000000 R14: ffffffffbc5c7720 R15: 0000000000000246
FS:  0000000000000000(0000) GS:ffff9c449fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f141567668 CR3: 000000089a3d6005 CR4: 00000000001706e0
Call Trace:
 printk+0x48/0x4a
 kauditd_hold_skb.cold+0x14/0x19
 ? auditd_conn_free+0x70/0x70
 kauditd_send_queue+0x111/0x150
 ? audit_log_lost+0x90/0x90
 kauditd_thread+0x22b/0x2b0
 ? finish_wait+0x80/0x80
 ? auditd_reset+0x90/0x90
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#31 stuck for 48s! [migration/31:170]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 31 PID: 170 Comm: migration/31 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0x8f/0xe0
RIP: 0010:multi_cpu_stop+0x9f/0x110
Code: 0f 8b 43 20 8b 4b 10 83 c0 01 89 4b 24 89 43 20 e8 d6 fb fa ff 41 83 ff 04 74 34 45 89 fc 4c 89 ef e8 55 ff ff ff 44 8b 7b 20 <45> 39 fc 75 aa 41 83 ff 01 76 0a e8 61 48 01 00 e8 2c 3f 01 00 e8
RSP: 0000:ffffc06d86a83e68 EFLAGS: 00000202
RAX: ffff9c4c9fdeba40 RBX: ffffc06d8778fb88 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffc06d8778fbe0 RDI: ffffffffbc211c20
RBP: ffffc06d8778fbac R08: ffff9c4c9fddd9b0 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000001f R12: 0000000000000001
R13: ffffffffbc211c20 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff9c4c9fdc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000004f42160 CR3: 000000089a3d6001 CR4: 00000000001706e0
Call Trace:
 ? stop_machine_yield+0x10/0x10
 cpu_stopper_thread+0xd1/0x140
 smpboot_thread_fn+0xc5/0x160
 ? smpboot_register_percpu_thread+0xf0/0xf0
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#20 stuck for 49s! [migration/20:115]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 20 PID: 115 Comm: migration/20 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0x8f/0xe0
RIP: 0010:rcu_momentary_dyntick_idle+0xf/0x30
Code: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c0 40 ba 02 00 65 c6 05 f5 39 ed 44 00 <65> 48 03 05 f9 93 eb 44 ba 04 00 00 00 f0 0f c1 90 20 01 00 00 83
RSP: 0000:ffffc06d8689fe60 EFLAGS: 00000246
RAX: 000000000002ba40 RBX: ffffc06d8c53fb88 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffc06d8c53fbe0 RDI: ffffffffbc212020
RBP: ffffc06d8c53fbac R08: ffff9c449fc9d9b0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000014 R12: 0000000000000001
R13: ffffffffbc212020 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff9c449fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0010ab558 CR3: 000000012ff24006 CR4: 00000000001706e0
Call Trace:
 multi_cpu_stop+0xb9/0x110
 ? stop_machine_yield+0x10/0x10
 cpu_stopper_thread+0xd1/0x140
 smpboot_thread_fn+0xc5/0x160
 ? smpboot_register_percpu_thread+0xf0/0xf0
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#26 stuck for 22s! [migration/26:145]
Modules linked in: ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net veth ipip tunnel4 ip_tunnel bpf_preload wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 curve25519_x86_64 libcurve25519_generic libchacha libblake2s_generic ip6_udp_tunnel udp_tunnel xt_set ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr iptable_mangle xt_comment xt_mark nf_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat br_netfilter bridge stp llc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay team_mode_activebackup team rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_msr dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass snd_pcsp rapl snd_pcm intel_cstate snd_timer snd intel_uncore soundcore ipmi_ssif mgag200 drm_kms_helper mxm_wmi joydev
 cec mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm ip_tables xfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe i2c_algo_bit ghash_clmulni_intel megaraid_sas dca mdio wmi fuse
CPU: 26 PID: 145 Comm: migration/26 Kdump: loaded Tainted: G        W    L    5.13.4-200.fc34.x86_64 #1
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.12.1 12/04/2020
Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0x8f/0xe0
RIP: 0010:stop_machine_yield+0x2/0x10
Code: 94 fc ff ff 84 c0 74 19 e8 db ee a6 00 48 8d 7c 24 10 e8 21 f7 a6 00 8b 44 24 0c 4c 8b 65 f8 c9 c3 b8 fe ff ff ff eb f3 f3 90 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 57 41
RSP: 0000:ffffc06d869a7e60 EFLAGS: 00000202
RAX: ffff9c449fd6ba40 RBX: ffffc06d8c2f7b88 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffc06d8c2f7be0 RDI: ffffffffbc212820
RBP: ffffc06d8c2f7bac R08: ffff9c449fd5d9b0 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000001a R12: 0000000000000001
R13: ffffffffbc212820 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff9c449fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0023e5000 CR3: 000000012ff24003 CR4: 00000000001706e0
Call Trace:
 multi_cpu_stop+0x9b/0x110
 ? stop_machine_yield+0x10/0x10
 cpu_stopper_thread+0xd1/0x140
 smpboot_thread_fn+0xc5/0x160
 ? smpboot_register_percpu_thread+0xf0/0xf0
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [calico:159305]

The last working testing-devel was 34.20210720.20.0, from 34.20210720.20.1 it is broken for us.

The diff

kernel 5.12.15-300.fc34.x86_64 → 5.13.4-200.fc34.x86_64
kernel-core 5.12.15-300.fc34.x86_64 → 5.13.4-200.fc34.x86_64
kernel-modules 5.12.15-300.fc34.x86_64 → 5.13.4-200.fc34.x86_64

We are provisioning with

kernel_arguments:
  should_exist:
    - mitigations=auto
    - crashkernel=512M
  should_not_exist:
    - mitigations=auto,nosmt

We were also provisioning different Kubernetes versions starting with 1.21.0, 1.22.0 and 1.22.1, which all showed same behavior.

Reproduction steps
Steps to reproduce the behavior:

  1. Provision Typhoon on bare metal
  2. Run moderate load (but even without any load we also saw it, it just took a bit longer)

Expected behavior
No panics, no deadlocks.

Actual behavior
Kernel panics resulting in deadlock.

System details

  • Bare Metal, Dell Inc. PowerEdge R630
@dustymabe
Copy link
Member

Hey @wkruse. Thanks for the detailed report.

My experience with kernel bugs like this is that (thankfully) there is usually a fix already landed or in the works upstream). Just to cover all bases here do you mind trying with:

  • latest testing-devel: 34.20210825.20.0 has kernel-5.13.12-200.fc34.x86_64
  • latest rawhide: 36.20210827.91.0 has kernel-5.14.0-0.rc6.20210820gitd992fe5318d8.50.fc36.x86_64

You can find the links in the unofficial builds browser: https://builds.coreos.fedoraproject.org/browser

@dustymabe
Copy link
Member

hey @wkruse - mind trying out the latest testing-devel and rawhid - they have even newer kernels than what I mentioned above.

@wkruse
Copy link
Author

wkruse commented Sep 3, 2021

We tested following versions

  • 34.20210825.20.0 (testing-devel)
  • 36.20210829.91.0 (rawhide)
    Explicit deadlocks in this version were logged
    ======================================================
    WARNING: possible circular locking dependency detected
    5.14.0-0.rc6.20210820gitd992fe5318d8.50.fc36.x86_64 #1 Not tainted
    ------------------------------------------------------
    swapper/1/0 is trying to acquire lock:
    ffffffffa2f671d8 ((console_sem).lock){-...}-{2:2}, at: down_trylock+0xf/0x30
    
    but task is already holding lock:
    ffff89251dfefa18 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1e/0x80
    
    which lock already depends on the new lock.
    
    
    the existing dependency chain (in reverse order) is:
    
    -> #2 (&rq->__lock){-.-.}-{2:2}:
          _raw_spin_lock_nested+0x2f/0x80
          raw_spin_rq_lock_nested+0x1e/0x80
          task_fork_fair+0x32/0x1b0
          sched_fork+0x115/0x280
          copy_process+0x87a/0x1fb0
          kernel_clone+0x8b/0x3d0
          kernel_thread+0x47/0x50
          rest_init+0x1e/0x280
          start_kernel+0x9b4/0x9c4
          secondary_startup_64_no_verify+0xc2/0xcb
    
    -> #1 (&p->pi_lock){-.-.}-{2:2}:
          _raw_spin_lock_irqsave+0x4d/0x90
          try_to_wake_up+0x43/0x910
          up+0x40/0x60
          __up_console_sem+0x3b/0x70
          console_unlock+0x329/0x5e0
          fbmem_init+0xcb/0xe1
          do_one_initcall+0x67/0x320
          kernel_init_freeable+0x284/0x2d0
          kernel_init+0x16/0x120
          ret_from_fork+0x22/0x30
    
    -> #0 ((console_sem).lock){-...}-{2:2}:
          __lock_acquire+0x11fe/0x1e00
          lock_acquire+0xc4/0x2e0
          _raw_spin_lock_irqsave+0x4d/0x90
          down_trylock+0xf/0x30
          __down_trylock_console_sem+0x32/0xa0
          vprintk_emit+0x16b/0x3a0
          printk+0x48/0x4a
          __warn_printk+0x37/0x64
          enqueue_task_fair+0x318/0x7a0
          enqueue_task+0x4a/0x140
          ttwu_do_activate+0x73/0xf0
          sched_ttwu_pending+0x100/0x1d0
          flush_smp_call_function_from_idle+0x59/0x90
          do_idle+0x188/0x2b0
          cpu_startup_entry+0x19/0x20
          secondary_startup_64_no_verify+0xc2/0xcb
    
    other info that might help us debug this:
    
    Chain exists of:
     (console_sem).lock --> &p->pi_lock --> &rq->__lock
    
    Possible unsafe locking scenario:
    
          CPU0                    CPU1
          ----                    ----
     lock(&rq->__lock);
                                  lock(&p->pi_lock);
                                  lock(&rq->__lock);
     lock((console_sem).lock);
    
    *** DEADLOCK ***
    
    1 lock held by swapper/1/0:
    #0: ffff89251dfefa18 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1e/0x80
    
    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 5.14.0-0.rc6.20210820gitd992fe5318d8.50.fc36.x86_64 #1
    Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.12.1 12/04/2020
    Call Trace:
    dump_stack_lvl+0x57/0x72
    check_noncircular+0xdf/0x100
    ? lock_is_held_type+0xa7/0x120
    __lock_acquire+0x11fe/0x1e00
    lock_acquire+0xc4/0x2e0
    ? down_trylock+0xf/0x30
    ? vprintk_store+0x3ba/0x430
    ? printk+0x48/0x4a
    _raw_spin_lock_irqsave+0x4d/0x90
    ? down_trylock+0xf/0x30
    down_trylock+0xf/0x30
    ? printk+0x48/0x4a
    __down_trylock_console_sem+0x32/0xa0
    vprintk_emit+0x16b/0x3a0
    printk+0x48/0x4a
    ? update_load_avg+0x674/0x7d0
    __warn_printk+0x37/0x64
    ? enqueue_entity+0x16a/0x900
    enqueue_task_fair+0x318/0x7a0
    enqueue_task+0x4a/0x140
    ttwu_do_activate+0x73/0xf0
    sched_ttwu_pending+0x100/0x1d0
    flush_smp_call_function_from_idle+0x59/0x90
    do_idle+0x188/0x2b0
    cpu_startup_entry+0x19/0x20
    secondary_startup_64_no_verify+0xc2/0xcb
    
  • 36.20210901.91.0 (rawhide)
    We upgraded the BIOS to 2.13.0 before testing this version

We could reproduce system freezes running moderate load in all of the versions above. It looks like starting with kernel 5.13 it broke for us.

We couldn't reproduce the issue running FCOS 34.20210808.3.0 cluster in VirtualBox for two days though.

We will try 36.20210902.91.1 (rawhide) with kernel 5.15 next.

@wkruse
Copy link
Author

wkruse commented Sep 3, 2021

Also 36.20210902.91.1 (rawhide) produces system freezes on moderate load.

@dustymabe
Copy link
Member

dustymabe commented Sep 3, 2021

so let me summarize:

  • 34.20210711.3.0 (stable) - kernel-5.12.19-300.fc34.x86_64 - no freeze
  • 34.20210725.3.0 (stable) - kernel-5.13.4-200.fc34.x86_64 - freeze
  • 34.20210808.3.0 (stable) - kernel-5.13.7-200.fc34.x86_64 - freeze
  • 36.20210901.91.0 (rawhide) kernel-5.14.0-61.fc36.x86_64 - freeze
  • 36.20210902.91.1 (rawhide) - kernel-5.15.0-0.rc0.20210901git9e9fb7655ed5.2.fc36.x86_64 - freeze
  • 34.20210825.20.0 (testing-devel) - kernel-5.13.12-200.fc34.x86_64 - freeze

@dustymabe
Copy link
Member

@dghubble - are other typhoon users seeing this?

@wkruse
Copy link
Author

wkruse commented Sep 7, 2021

On Bare Metal, Dell Inc. PowerEdge R630

  • 34.20210808.3.0 (stable) - kernel-5.13.7-200.fc34.x86_64 - freeze

(on VirtualBox no freeze, I was just trying to reproduce it in a virtual environment, but didn't succeed, so it has something to do with the real hardware)

@wkruse
Copy link
Author

wkruse commented Sep 10, 2021

Maybe related #957.

@wkruse
Copy link
Author

wkruse commented Sep 10, 2021

We don't have the CPU metrics, but if we compare the power usage in watt

  • machine was running without issues

image

  • machine was in deadlock the whole night

image

we were at 100% also.

@jlebon
Copy link
Member

jlebon commented Sep 10, 2021

From the traces, this looks like it could be console printing-related. Wonder if e.g. dropping the serial console karg might help as a test?

Anyway I think it's probably better at this point to track this in RHBZ where kernel SMEs can take a look. I filed https://bugzilla.redhat.com/show_bug.cgi?id=2003168. Feel free to add details there. In particular, there is one question I wasn't entirely sure about:

  1. Are you running any modules that not shipped with directly Fedora's kernel?

Can anyone confirm this either way in the RHBZ?

@sjenning
Copy link

sjenning commented Sep 13, 2021

This seems to be a scheduler regression in the 5.13 kernel. Unfortunately, the SCHED_WARN_ON that would tell us what is happening causes a circular lock dep in the synchronous printk.

The stack causing the issue is enqueue_task_fair

 console_unlock+0x383/0x520
 vprintk_emit+0x152/0x260
 printk+0x48/0x4a
 __warn_printk+0x37/0x64
 ? enqueue_entity+0x18c/0x7b0
 enqueue_task_fair+0x26f/0x6a0
 ? psi_task_change+0x9b/0xe0
 ttwu_do_activate+0x75/0x180

Looking at the listing file for fair.c and calculating the offset in the function (0x8ac0+0x26f=8d2f)

0000000000008ac0 <enqueue_task_fair>:
...
        SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
    8cfb:       49 8d 85 50 09 00 00    lea    0x950(%r13),%rax
    8d02:       49 39 85 60 09 00 00    cmp    %rax,0x960(%r13)
    8d09:       0f 84 f8 fe ff ff       je     8c07 <enqueue_task_fair+0x147>
    8d0f:       80 3d 00 00 00 00 00    cmpb   $0x0,0x0(%rip)        # 8d16 <enqueue_task_fair+0x256>
                        8d11: R_X86_64_PC32     .data.once+0x4
    8d16:       0f 85 eb fe ff ff       jne    8c07 <enqueue_task_fair+0x147>
    8d1c:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
                        8d1f: R_X86_64_32S      .rodata.str1.8+0x190
    8d23:       c6 05 00 00 00 00 01    movb   $0x1,0x0(%rip)        # 8d2a <enqueue_task_fair+0x26a>
                        8d25: R_X86_64_PC32     .data.once+0x4
    8d2a:       e8 00 00 00 00          callq  8d2f <enqueue_task_fair+0x26f>
                        8d2b: R_X86_64_PLT32    __warn_printk-0x4
    --> 8d2f:       0f 0b                   ud2

That is the inlined call to assert_list_leaf_cfs_rq, indicating a bug in this code. Looking through commits that went into 5.13, I found torvalds/linux@fdaba61ef8a2 which might have some error in it.

There is a secondary bug in that we should not call SCHED_WARN_ON with the rq->lock held as (console_sem).lock eventually needs that lock resulting in a circular lock dependency.

@dustymabe
Copy link
Member

Just started a scratch build with fdaba61 reverted. If it builds I'll create some media and we'll see if the problem goes away for @wkruse.

@dustymabe
Copy link
Member

dustymabe commented Sep 14, 2021

@wkruse can you try this dev build:

Note, you'll have to disable secureboot if you have that enabled.

@dustymabe
Copy link
Member

Alternatively you can just replace the kernel on an existing node:

sudo systemctl stop zincati
sudo rpm-ostree override replace https://kojipkgs.fedoraproject.org//work/tasks/2324/75662324/kernel{,-core,-modules}-5.13.16-200.fc34.dusty.x86_64.rpm --reboot

@wkruse
Copy link
Author

wkruse commented Sep 15, 2021

Running moderate load on FCOS 34.20210821.3.0 with replaced kernel 5.13.16-200.fc34.dusty.x86_64 (with rpm-ostree command above) also results in deadlock, but without the circular dependency logging above. I've attached the kernel log to https://bugzilla.redhat.com/show_bug.cgi?id=2003168

@wkruse
Copy link
Author

wkruse commented Sep 17, 2021

We added Dell Inc. PowerEdge R640 servers to the test cluster, just to make sure that the issue is not specific to the R630.

kernel: DMI: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.10.2 02/24/2021

Same as above, deadlock without the circular dependency logging. Kernel logs is similar to the previous one.

@dustymabe
Copy link
Member

@wkruse - this code was touched again recently in the merge commit 5d3c0db (part of v5.15-rc1 and later). It's a shot in the dark to see if it has any positive effect here.

Want to try with 36.20210921.91.0 (kernel-5.15.0-0.rc1.12.fc36.x86_64)? Also come visit us in #fedora-coreos on IRC (libera.chat) and maybe we can iterate a bit faster on this.

@wkruse
Copy link
Author

wkruse commented Sep 22, 2021

@dustymabe We tried 36.20210921.91.0 and still getting the deadlocks. I'll jump on IRC on Friday.

@dghubble
Copy link
Member

@dustymabe I've also been seeing this on SuperMicro boxes and pinned clusters to the last FCOS stable with 5.12 for now. Unfortunately don't have so much capacity to look into this for a bit 🤕

@dustymabe
Copy link
Member

If we think this isn't limited to specific hardware (i.e. affects lots of bare metal) and we can get a small reproducer (ideally single machine) I can throw the reproducer at some hardware I've got here at home.

@wkruse
Copy link
Author

wkruse commented Oct 1, 2021

We were able to reproduce it on a single node (single node K8s cluster with one controller node provisioned with Typhoon) running a warm-up of our regular test. But we don't have a synthetic reproducer yet. It seems to be related to stuff running in Kubernetes. Running busy loops directly on the node or in a container on the node didn't result in a deadlock. Also it seems that it happens more often on freshly provisioned nodes. On our single test node, the deadlock appeared after 2 warm-ups, then after 3 warm-ups, after that we were able to run 15 warm-ups in a row without deadlocks.

@wkruse
Copy link
Author

wkruse commented Oct 1, 2021

Maybe also related #965.

@wkruse
Copy link
Author

wkruse commented Oct 1, 2021

Is there a way in FCOS to collect more information to debug kernel crashes?

@dustymabe
Copy link
Member

Hey @wkruse - I don't have a lot of personal experience using it, but docs for how to get kdump set up on FCOS are here.

@dustymabe
Copy link
Member

Maybe also related #965.

That one is specific to running aarch64 on OpenStack. You aren't running aarch64 are you?

@wkruse
Copy link
Author

wkruse commented Oct 1, 2021

Nope, we are running x86_64.

@wkruse
Copy link
Author

wkruse commented Oct 8, 2021

Running our test on the latest stable 34.20210919.3.0 with crashkernel=auto we hit the deadlock and let the node stay for a couple of hours. After the reboot the /var/crash was empty.

@dustymabe
Copy link
Member

Hmm. Anything in the logs at all that would indicate issues with kdump running?

I do see our docs say:

crashkernel=auto is likely insufficient memory on Fedora CoreOS. It is recommended to start testing with 300M

so maybe start with crashkernel=300M or higher?

@baryluk
Copy link

baryluk commented Oct 18, 2021

The posted stack traces here, looks very similar to what I posted in the other bug: #957 (comment) (the attached archive has logs from few machines). I did not see anything strictly related to CFS, but who knows.

@wkruse
Copy link
Author

wkruse commented Oct 22, 2021

We were also running crashkernel=512M for a while, still no crash logs.

We would be happy to run bisect kernels to help to pinpoint the kernel commit, which broke it for us.

@dustymabe
Copy link
Member

@wkruse - any chance we ever got down to a small reproducer on a single machine? With that power we could pretty quickly bisect and find the culprit.

@wkruse
Copy link
Author

wkruse commented Nov 1, 2021

@dustymabe We can fast and reliably reproduce it with our tests, but we weren't able to create a synthetic reproducer. For us it looks like running Kubernetes with some load triggers the issue. Looking into #957 that seems to match what OKD fellows observe.

@aneagoe
Copy link

aneagoe commented Nov 1, 2021

@wkruse can you clarify how you managed to reproduce this? On one node running under KVM I've tried doing CPU load testing but was not able to reproduce. I almost thought this is not happening on VMs but another user reported seeing this also on VMs on #957.

@wkruse
Copy link
Author

wkruse commented Nov 1, 2021

@aneagoe We have a custom application running in Kubernetes and a test environment to test it. We don't have a synthetic reproducer. We were also running basic CPU load tests (#940 (comment)) and weren't able to reproduce the issue.

@baryluk
Copy link

baryluk commented Nov 3, 2021

@wkruse Could you share some details about the app that triggers it? Was it the only workload on the node? Any details about the workload, i.e. mostly io, mostly network, compiled C++/Go code, Java, Python, etc? Is it highly multithreaded?

@wkruse
Copy link
Author

wkruse commented Nov 5, 2021

@baryluk It is a distributed, highly multithreaded Java 16/17 app with heavy usage of multiple Redis instances for queueing and storage. Relatively low network bandwidth, but very sensitive regarding the latency.

@wkruse
Copy link
Author

wkruse commented Nov 5, 2021

Maybe one additional hint, starting with F34 we had another problem with two of our services, which had huge lags in responses right after the deployment. The root cause was an old Java 11 base image and cgroups v2 (which were introduced by the F34) not fully supported in that version of Java. The fix was to upgrade to the latest Java 11 base image.

@dustymabe dustymabe changed the title F34: Kernel panics resulting in deadlock Kernel 5.13+ panics resulting in deadlock Nov 11, 2021
@wkruse
Copy link
Author

wkruse commented Nov 12, 2021

We cannot reproduce it anymore. The last broken testing-devel is 34.20211011.20.1 (Kernel 5.14.9), from the 34.20211012.20.0 (Kernel 5.14.10) it seems to work. Also the last two stable 34.20211016.3.0 and 34.20211031.3.0 seems to be fixed. At least we weren't able to reproduce it within a week.

CC: @baryluk, @aneagoe, @jlebon

@jlebon
Copy link
Member

jlebon commented Nov 15, 2021

@wkruse That's great news! Maybe let's keep it open for a few more days to be sure and then we can close this (and the filed RHBZ).

@dustymabe
Copy link
Member

Thanks all for collaborating and helping us to find when this issue was fixed. I wish we could narrow it down to a particular kernel commit that fixed the problem, but the fact that it's fixed in 34.20211016.3.0 and later should suffice.

Note, though, that the F35 rebase is landing in the next stable release, so running your workloads against the current testing will let you know if there are any problems on the horizon.

@graysky2
Copy link

but the fact that it's fixed in 34.20211016.3.0 and later should suffice.

What kernel version is included in that release?

@lucab
Copy link
Contributor

lucab commented Dec 10, 2021

@graysky2 it's Fedora's kernel-5.14.14-200.fc34.

@vinisman
Copy link

vinisman commented Mar 1, 2022

The same problem was on okd 4.8 but after update to okd 4.9.0-0.okd-2022-02-12-140851 (kernel 5.14.14-200.fc34.x86_64 ) now worker nodes sometime reboots but no hangs

@alibo
Copy link

alibo commented Mar 20, 2022

@vinisman

now worker nodes sometime reboots but no hangs

we have a similar issue, after upgrading the kernel to 5.14.14, we're facing sudden reboots/crashes without any patterns. My hypothesis is worker nodes that have more crashlooping pods face this issue more, but I couldn't reproduce it.

Kernel Panic/Crash logs:

[50756.150736] general protection fault, probably for non-canonical address 0x8d67162b31f67a05: 0000 [#1] SMP NOPTI
[50756.155871] CPU: 20 PID: 218835 Comm: erl_child_setup Not tainted 5.14.14-200.fc34.x86_64 #1
[50756.158746] Hardware name: RDO OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
[50756.161210] RIP: 0010:kmem_cache_alloc_node_trace+0xf7/0x2d0
[50756.163345] Code: 00 48 85 c9 0f 84 88 01 00 00 41 83 fd ff 74 10 48 8b 09 48 c1 e9 36 41 39 cd 0f 85 72 01 00 00 8b 4d 28 48 8b 7d 00 48 01 c1 <48> 8b 19 48 89 ce 48 33 9d b8 00 00 00 48 0f ce 48 31 f3 40 f6 c7
[50756.169337] RSP: 0018:ffffa0b28338be50 EFLAGS: 00010286
[50756.171514] RAX: 8d67162b31f67905 RBX: 0000000000000dc0 RCX: 8d67162b31f67a05
[50756.174882] RDX: 00000000004ad1b6 RSI: 0000000000000dc0 RDI: 0000000000030140
[50756.177510] RBP: ffff91ea00042a00 R08: ffff91f8ffd30140 R09: 0000000000000000
[50756.179927] R10: 0000000000000013 R11: 0000000000000000 R12: 0000000000000dc0
[50756.182497] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8212b118
[50756.185046] FS:  00007fa10a088500(0000) GS:ffff91f8ffd00000(0000) knlGS:0000000000000000
[50756.187819] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50756.189807] CR2: 00005609c58c0058 CR3: 0000000b10c6c002 CR4: 00000000007706e0
[50756.192554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[50756.195130] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[50756.198259] PKRU: 55555554
[50756.200022] Call Trace:
[50756.201457]  alloc_fair_sched_group+0x138/0x210
[50756.203214]  sched_create_group+0x2f/0x90
[50756.204846]  sched_autogroup_create_attach+0x3b/0x170
[50756.206663]  ksys_setsid+0xe8/0x100
[50756.208040]  __do_sys_setsid+0xa/0x10
[50756.209453]  do_syscall_64+0x38/0x90
[50756.210874]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[50756.212728] RIP: 0033:0x7fa109f8d797
[50756.214182] Code: 73 01 c3 48 8b 0d f9 36 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 70 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 36 0f 00 f7 d8 64 89 01 48
[50756.220460] RSP: 002b:00007ffe0cace088 EFLAGS: 00000206 ORIG_RAX: 0000000000000070
[50756.223622] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fa109f8d797
[50756.226347] RDX: 0000000000000020 RSI: 0000000000000002 RDI: 0000000000000008
[50756.228894] RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000077
[50756.231194] R10: fffffffffffff81a R11: 0000000000000206 R12: 00007fa10a088480
[50756.233457] R13: 00005609c70993c4 R14: 00005609c709ef30 R15: 0000000000000000
[50756.235830] Modules linked in: cls_bpf sch_ingress xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_CT veth ip_set_hash_ip ip_set nfnetlink xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_raw iptable_raw iptable_mangle ip6table_mangle ip6table_filter xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark ip6table_nat ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill isofs overlay intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm cirrus drm_kms_helper cec i2c_piix4 virtio_balloon joydev drm ip_tables xfs crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel net_failover ata_generic virtio_console virtio_scsi serio_raw failover pata_acpi qemu_fw_cfg dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse
[50756.259270] ---[ end trace 0ba844b14ea0e37e ]---
[50756.261101] RIP: 0010:kmem_cache_alloc_node_trace+0xf7/0x2d0
[50756.263203] Code: 00 48 85 c9 0f 84 88 01 00 00 41 83 fd ff 74 10 48 8b 09 48 c1 e9 36 41 39 cd 0f 85 72 01 00 00 8b 4d 28 48 8b 7d 00 48 01 c1 <48> 8b 19 48 89 ce 48 33 9d b8 00 00 00 48 0f ce 48 31 f3 40 f6 c7
[50756.269343] RSP: 0018:ffffa0b28338be50 EFLAGS: 00010286
[50756.271230] RAX: 8d67162b31f67905 RBX: 0000000000000dc0 RCX: 8d67162b31f67a05
[50756.273715] RDX: 00000000004ad1b6 RSI: 0000000000000dc0 RDI: 0000000000030140
[50756.276185] RBP: ffff91ea00042a00 R08: ffff91f8ffd30140 R09: 0000000000000000
[50756.278623] R10: 0000000000000013 R11: 0000000000000000 R12: 0000000000000dc0
[50756.281080] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8212b118
[50756.283470] FS:  00007fa10a088500(0000) GS:ffff91f8ffd00000(0000) knlGS:0000000000000000
[50756.286069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50756.288032] CR2: 00005609c58c0058 CR3: 0000000b10c6c002 CR4: 00000000007706e0
[50756.290371] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[50756.293097] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[50756.295358] PKRU: 55555554
[50756.296724] Kernel panic - not syncing: Fatal exception
[50756.299485] Kernel Offset: 0x1000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[50756.302702] Rebooting in 10 seconds..
�[H�[J�[1;1H�[?25l�[m�[H�[J�[1;1H�[20;7H�[mUse the ^ and v keys to change the selection.                       

      Press 'e' to edit the selected item, or 'c' for a command prompt.   �[4;80H �[7m�[4;1H      Fedora CoreOS 34 48.34.202111121953-0 (ostree:0)                         �[m�[4;79H�[m�[m�[5;1H      Fedora CoreOS 33.20210117.3.2 (ostree:1)                                 �[m�[5;79H�[m�[m�[6;1H                                                                               �[m�[6;79H�[m�[m�[7;1H 

Kernel version:

Linux okd4-worker-worker-4 5.14.14-200.fc34.x86_64 #1 SMP Wed Oct 20 16:15:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

RPM OSTREE / Fedora Version:

● pivot://quay.io/openshift/okd-content@sha256:934603caf00c11f753c73f71ecebfaf5dc360f40a81c7b641b2694b34506e5d0
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.34.202111121953-0 (2021-11-12T19:57:00Z)

  fedora:fedora/x86_64/coreos/stable
                   Version: 33.20210117.3.2 (2021-02-03T18:13:41Z)
                    Commit: 20de1953c18bd432a8ed4e19b91c64978100dba7d1c4813f91f8cf4d4d2411b4
              GPGSignature: Valid signature by 963A2BEB02009608FE67EA4249FD77499570FF31

OKD Version:

Server Version: 4.8.0-0.okd-2021-11-14-052418

@markusdd
Copy link

markusdd commented Jul 7, 2022

Hi all, we also run an OKD 4.10 cluster and this problem is heavily affecting us right now to a point where this morning suddenly 70% of our cluster (consisting of 7 workers, 3 masters, 1 special purpose node in a VM) was in 'NotReady' state.
In fact, the servers are so 'stuck' then, that only a reset or powercycle helps. In the iLO console sometimes you still see the kernel prints, but no keyboard input etc works.

These are all HP ProLiant DL360 machines of G7/G8/G9 vintage, so a variety of different CPU generations.

We have never observed this with the one node which is a VM, which runs on an oVirt cluster, which in turn in also running on HP ProLiant machines. But this node only hosts one pod (our gitlab runner), all the other nodes essentially act as our CI cluster. So they experience a huge variety of loads from software/firmware builds, simulations, Python Linting etc.

We managed to improve the situation by turning Hyperthreading off in BIOS, but now even this does not seem to help anymore.
In the previous CentOS7-based OKD3-cluster none of this ever happened and many nodes were migrated recently, so all of them having HW issues is more than unlikely.

So even in the newest kernels there must be a fundamental issue and this is turning into a very high prio issue for us as now basically every day we have multiple boot failures we need to attend to manually by restarting the nodes and sometimes we even have to manually repair the file system or delete the CI workspaces because they crashed in the middle of whatever operation.

On the last node I checked today I also get the dreaded smp_call_function_single multi_cpu_stop messages.

Any advice on what this could be and how to workaround solve? This is a huge problem.

@aneagoe
Copy link

aneagoe commented Jul 8, 2022

@markusdd had the same on version 4.10.0-0.okd-2022-05-07-021833. There was something off with that kernel... you can either downgrade or upgrade the kernel to something else. You can try the following and test:

rpm-ostree override replace https://kojipkgs.fedoraproject.org/packages/kernel/5.17.9/200.fc35/x86_64/kernel{,-core,-modules}-5.17.9-200.fc35.x86_64.rpm

The affected kernel version in my case was 5.16.18-200.fc35. You can test with some other versions as well, one that was rock solid for me was 5.14.14; if you want that one you can do this instead:

rpm-ostree override replace https://kojipkgs.fedoraproject.org/packages/kernel/5.14.14/200.fc34/x86_64/kernel{,-core,-modules}-5.14.14-200.fc34.x86_64.rpm

@markusdd
Copy link

markusdd commented Jul 8, 2022

as documented here: #1249 (comment)
the latest OKD4 update lifts the kernel to 5.18.5.

We will observe how this behaves. If problems persist, we might think about trying the older one.

@lucab
Copy link
Contributor

lucab commented Jul 8, 2022

I'm going to lock this ticket as it is becoming an attractor for old kernels misbehavior reports.
At this point in time all FCOS streams are based on F36 and shipping 5.18+ kernels.
If you encounter some issues on kernels >= 5.18, please open a dedicated ticket.
If you encounter some issues with older kernels in use by OKD, please report them to https://github.com/openshift/okd/issues.

@coreos coreos locked as resolved and limited conversation to collaborators Jul 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests