Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于Node节点报CPUkernel:NMI watchdog: BUG: soft lockup - CPU#34 stuck for 22s! [runc:[1:CHILD]:185318] #300

Closed
Hwting opened this issue Aug 26, 2018 · 1 comment

Comments

@Hwting
Copy link

Hwting commented Aug 26, 2018

有遇到同样错误的吗?集群部署一个master两个node。三天两头报错BUG:soft lockup导致负载慢慢增高最后死机。目前集群运行的pod就kube-system内容

Aug 25 21:55:29 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:104]
Aug 25 21:55:45 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#34 stuck for 22s! [runc:[1:CHILD]:185318]
Aug 25 21:55:57 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:104]
Aug 25 21:56:13 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#34 stuck for 22s! [runc:[1:CHILD]:185318]
Aug 25 21:56:25 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:104]
Aug 25 21:56:53 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:104]
Aug 25 21:56:53 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#34 stuck for 23s! [runc:[1:CHILD]:185318]
Aug 25 21:57:21 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#19 stuck for 23s! [migration/19:104]
Aug 25 21:57:21 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#34 stuck for 23s! [runc:[1:CHILD]:185318]
Aug 25 21:57:49 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#19 stuck for 23s! [migration/19:104]
Aug 25 21:57:49 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#34 stuck for 23s! [runc:[1:CHILD]:185318]
-----------------------------------
Aug 23 02:46:45 localhost kubelet: I0823 02:46:45.527160   33275 kubelet.go:1777] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 1h54m19.873028409s ago; threshold is 3m0s]
Aug 23 02:46:45 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#62 stuck for 23s! [migration/62:319]
Aug 23 02:46:45 localhost kernel: Modules linked in: xt_statistic xt_ipvs xt_multiport xt_set iptable_mangle iptable_raw ip_set_hash_ip ip_set_hash_net ipip tunnel4 ip_tunnel veth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_physdev xt_nat ipt_REJECT nf_reject_ipv4 xt_mark xt_comment ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay(T) iptable_filter sunrpc ext4 sb_edac edac_core intel_powerclamp mbcache coretemp jbd2 intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
Aug 23 02:46:45 localhost kernel: ablk_helper iTCO_wdt cryptd iTCO_vendor_support mxm_wmi dcdbas ipmi_ssif pcspkr sg mei_me mei lpc_ich shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter binfmt_misc ip_tables xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata megaraid_sas tg3 i2c_core ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
Aug 23 02:46:45 localhost kernel: CPU: 62 PID: 319 Comm: migration/62 Tainted: G             L ------------ T 3.10.0-693.el7.x86_64 #1
Aug 23 02:46:45 localhost kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.5.5 08/16/2017
Aug 23 02:46:45 localhost kernel: task: ffff8801687a0fd0 ti: ffff8801687ac000 task.ti: ffff8801687ac000
Aug 23 02:46:45 localhost kernel: RIP: 0010:[<ffffffff8111680a>]  [<ffffffff8111680a>] multi_cpu_stop+0x4a/0x100
Aug 23 02:46:45 localhost kernel: RSP: 0000:ffff8801687afd98  EFLAGS: 00000246
Aug 23 02:46:45 localhost kernel: RAX: 0000000000000001 RBX: ffff88103edd6cc0 RCX: dead000000000200
Aug 23 02:46:45 localhost kernel: RDX: ffff88103edcffb0 RSI: 0000000000000286 RDI: ffff882031ecbb70
Aug 23 02:46:45 localhost kernel: RBP: ffff8801687afdc0 R08: ffff882031ecbb40 R09: 0000000000000001
Aug 23 02:46:45 localhost kernel: R10: 000000000000b544 R11: 0000000000000001 R12: 0000000000000000
Aug 23 02:46:45 localhost kernel: R13: ffff8801687afdec R14: ffff88017fc1c000 R15: ffff88016803a808
Aug 23 02:46:45 localhost kernel: FS:  0000000000000000(0000) GS:ffff88103edc0000(0000) knlGS:0000000000000000
Aug 23 02:46:45 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 23 02:46:45 localhost kernel: CR2: 000000c42287e010 CR3: 00000000019f2000 CR4: 00000000003407e0
Aug 23 02:46:45 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 23 02:46:45 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 23 02:46:45 localhost kernel: Stack:
Aug 23 02:46:45 localhost kernel: ffff88103edcffa8 ffff882031ecbb70 ffff88103edcffa0 ffff882031ecbb98
Aug 23 02:46:45 localhost kernel: ffffffff811167c0 ffff8801687afe90 ffffffff81116ab7 ffff8801687affd8
--
Aug 23 02:46:45 localhost kernel: [<ffffffff810b909f>] smpboot_thread_fn+0x12f/0x180
Aug 23 02:46:45 localhost kernel: [<ffffffff810b8f70>] ? lg_double_unlock+0x40/0x40
Aug 23 02:46:45 localhost kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Aug 23 02:46:45 localhost kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Aug 23 02:46:45 localhost kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Aug 23 02:46:45 localhost kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Aug 23 02:46:45 localhost kernel: Code: 1f 44 00 00 49 89 c5 48 8b 47 18 48 85 c0 0f 84 ab 00 00 00 0f a3 18 19 db 85 db 41 0f 95 c6 45 31 ff 31 c0 0f 1f 44 00 00 f3 90 <41> 8b 5c 24 20 39 c3 74 5d 83 fb 02 74 68 83 fb 03 75 05 45 84 
Aug 23 02:46:46 localhost kubelet: E0823 02:46:46.240617   33275 remote_runtime.go:169] ListPodSandbox with filter nil from runtime service failed: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Aug 23 02:46:46 localhost kubelet: E0823 02:46:46.240638   33275 kuberuntime_sandbox.go:195] ListPodSandbox failed: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Aug 23 02:46:46 localhost kubelet: E0823 02:46:46.240646   33275 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Aug 23 02:46:46 localhost kubelet: E0823 02:46:46.649401   33275 kubelet.go:2124] Container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Aug 23 02:46:47 localhost kubelet: E0823 02:46:47.241155   33275 remote_runtime.go:169] ListPodSandbox with filter nil from runtime service failed: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

环境**/
内存:128 cpu:64
centos7.4 3.10.0-693.el7.x86_64
kubernetes V1.11
docker Version: 18.03.1-ce

@gjmzj
Copy link
Collaborator

gjmzj commented Aug 26, 2018

刚刚用bing搜索了一圈,这个错误可能原因:
1.如果是物理机,可能是因为硬件原因,或者内核升级问题
2.如果是虚机,可能是因为资源超分配,如内存/磁盘等
可以考虑升级下centos的内核版本,默认的太老了,参考https://github.com/gjmzj/kubeasz/blob/master/docs/guide/kernel_upgrade.md

@gjmzj gjmzj closed this as completed May 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants