Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory OOM node failure when using windows docker image in talos linux #9215

Closed
datapedd opened this issue Aug 22, 2024 · 11 comments
Closed

Comments

@datapedd
Copy link

Bug Report

Talos is not correctly making sure that the max amount of memory is evicted if running a windows docker image on linux

Description

Running talos linux and deploy windows docker image on talos kubernetes (https://github.com/dockur/windows) (but with priviledged:false and no kvm mapping).
Pods are first all successfully up and running but after some time all pods and services (only the flannel pods are surviving it) are down. Worker node has state: not ready
Kubelet fails with OOM and cant recover full worker node is not reachable anymore and cant recover on its own. Has to be restarted and pod removed.

Logs

support.zip

Environment

  • Talos version: v1.7.6
  • Kubernetes version: v1.30.3
  • Platform: Bare metal
@smira
Copy link
Member

smira commented Aug 22, 2024

Why is that a Talos issue?

You can configure reservation for system cgroups yourself via the kubelet configuration, anything besides that is up to your deployment - make sure your pods have proper resource limits sets.

@datapedd
Copy link
Author

datapedd commented Aug 22, 2024

Yes, but even with memory limits set this happens. But I would not expect that the full node crashes with its services (only the pod should be removed).

Normal Starting 2m39s kube-proxy
Normal NodeHasSufficientMemory 44m (x3 over 65m) kubelet Node talos-default-worker-1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 44m (x3 over 65m) kubelet Node talos-default-worker-1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 44m (x3 over 65m) kubelet Node talos-default-worker-1 status is now: NodeHasSufficientPID
Warning ContainerGCFailed 44m kubelet rpc error: code = DeadlineExceeded desc = context deadline exceeded
Normal NodeNotReady 44m kubelet Node talos-default-worker-1 status is now: NodeNotReady
Normal NodeReady 43m (x2 over 65m) kubelet Node talos-default-worker-1 status is now: NodeReady
Warning InvalidDiskCapacity 2m41s kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 2m41s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 2m41s (x3 over 2m41s) kubelet Node talos-default-worker-1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m41s (x3 over 2m41s) kubelet Node talos-default-worker-1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m41s (x3 over 2m41s) kubelet Node talos-default-worker-1 status is now: NodeHasSufficientPID
Normal NodeNotReady 2m41s kubelet Node talos-default-worker-1 status is now: NodeNotReady
Normal NodeReady 2m41s kubelet Node talos-default-worker-1 status is now: NodeReady
Normal Starting 2m41s kubelet Starting kubelet.
Normal NodeNotReady 25s (x4 over 111m) node-controller Node talos-default-worker-1 status is now: NodeNotReady

@datapedd
Copy link
Author

datapedd commented Aug 22, 2024

Is Evictionhard in the machineconfig still working? Cant find any documentation. Have set memory limits on the pods and the node is still crashing.

kubelet:
    image: ghcr.io/siderolabs/kubelet:v1.27.1
    defaultRuntimeSeccompProfileEnabled: true
    disableManifestsDirectory: true

    extraArgs:
        feature-gates: NewVolumeManagerReconstruction=false

    extraConfig:
        maxPods: 250
        kubeReserved:
            cpu: "1"
            memory: 2Gi
            ephemeral-storage: 1Gi
        evictionHard:
            memory.available: 500Mi

image
Seems like the kubelet is failing

support_2.zip

@datapedd
Copy link
Author

pglazyfreed 0
thp_fault_alloc 2376
thp_collapse_alloc 39
kern: info: [2024-08-22T17:35:35.092500011Z]: Tasks state (memory values in pages):
kern: info: [2024-08-22T17:35:35.092972011Z]: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
kern: info: [2024-08-22T17:35:35.093882011Z]: [ 111399] 0 111399 327797 14980 315392 3597 0 init
kern: info: [2024-08-22T17:35:35.094530011Z]: [ 111502] 0 111502 321409 7285 237568 1464 -999 containerd
kern: info: [2024-08-22T17:35:35.095307011Z]: [ 111741] 0 111741 309537 3636 114688 0 -998 containerd-shim
kern: info: [2024-08-22T17:35:35.096105011Z]: [ 111763] 50 111763 327117 10310 274432 1724 -998 apid
kern: info: [2024-08-22T17:35:35.096807011Z]: [ 111729] 0 111729 323226 12559 282624 1507 -500 containerd
kern: info: [2024-08-22T17:35:35.097565011Z]: [ 112021] 0 112021 309537 2229 118784 1436 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.098517011Z]: [ 112899] 0 112899 309473 2305 118784 1945 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.099420011Z]: [ 113021] 0 113021 309473 3114 122880 1019 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.100297011Z]: [ 115727] 0 115727 309537 2575 118784 1213 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.101230011Z]: [ 116268] 0 116268 309537 2263 114688 1341 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.102081011Z]: [ 116833] 0 116833 309537 2284 114688 1720 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.102916011Z]: [ 117205] 0 117205 309537 3301 122880 115 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.103751011Z]: [ 124201] 0 124201 309601 3355 114688 38 -499 containerd-shim
kern: info: [2024-08-22T17:35:35.104744011Z]: [ 112040] 0 112040 664611 7290 499712 3216 -450 kubelet
kern: info: [2024-08-22T17:35:35.105783011Z]: [ 113044] 65535 113044 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.106466011Z]: [ 113605] 0 113605 317612 2538 200704 1048 -997 flanneld
kern: info: [2024-08-22T17:35:35.107332011Z]: [ 116861] 65535 116861 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.108096011Z]: [ 116943] 101 116943 55 0 24576 3 998 dumb-init
kern: info: [2024-08-22T17:35:35.108894011Z]: [ 116978] 101 116978 318004 2618 204800 838 998 nginx-ingress-c
kern: info: [2024-08-22T17:35:35.109816011Z]: [ 117446] 101 117446 30538 418 122880 742 998 nginx
kern: info: [2024-08-22T17:35:35.110627011Z]: [ 117452] 101 117452 33436 523 110592 2248 998 nginx
kern: info: [2024-08-22T17:35:35.111391011Z]: [ 117453] 101 117453 33437 456 118784 2425 998 nginx
kern: info: [2024-08-22T17:35:35.112224011Z]: [ 117454] 101 117454 33436 723 110592 2189 998 nginx
kern: info: [2024-08-22T17:35:35.113035011Z]: [ 117455] 101 117455 33436 707 110592 2190 998 nginx
kern: info: [2024-08-22T17:35:35.113762011Z]: [ 117456] 101 117456 33436 2333 110592 526 998 nginx
kern: info: [2024-08-22T17:35:35.114684011Z]: [ 117457] 101 117457 33436 2433 110592 444 998 nginx
kern: info: [2024-08-22T17:35:35.115489011Z]: [ 117458] 101 117458 33436 2360 110592 471 998 nginx
kern: info: [2024-08-22T17:35:35.116153011Z]: [ 117459] 101 117459 33436 2475 110592 396 998 nginx
kern: info: [2024-08-22T17:35:35.116797011Z]: [ 117460] 101 117460 33436 2508 110592 395 998 nginx
kern: info: [2024-08-22T17:35:35.117729011Z]: [ 117461] 101 117461 33436 2510 110592 397 998 nginx
kern: info: [2024-08-22T17:35:35.118638011Z]: [ 117462] 101 117462 33436 2484 110592 395 998 nginx
kern: info: [2024-08-22T17:35:35.119334011Z]: [ 117463] 101 117463 33436 481 110592 2351 998 nginx
kern: info: [2024-08-22T17:35:35.120115011Z]: [ 117464] 101 117464 30088 314 73728 275 998 nginx
kern: info: [2024-08-22T17:35:35.120979011Z]: [ 117233] 65535 117233 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.121554011Z]: [ 117370] 1001 117370 53829 3361 155648 4 996 postgres
kern: info: [2024-08-22T17:35:35.122292011Z]: [ 118312] 1001 118312 53857 628 135168 8 996 postgres
kern: info: [2024-08-22T17:35:35.122965011Z]: [ 118313] 1001 118313 53829 566 135168 8 996 postgres
kern: info: [2024-08-22T17:35:35.123862011Z]: [ 118336] 1001 118336 53829 1632 131072 6 996 postgres
kern: info: [2024-08-22T17:35:35.124721011Z]: [ 118337] 1001 118337 54223 920 143360 6 996 postgres
kern: info: [2024-08-22T17:35:35.125847011Z]: [ 118338] 1001 118338 54193 772 135168 7 996 postgres
kern: info: [2024-08-22T17:35:35.126975011Z]: [ 119538] 1001 119538 54327 1534 151552 9 996 postgres
kern: info: [2024-08-22T17:35:35.128095011Z]: [ 119586] 1001 119586 54868 2926 163840 5 996 postgres
kern: info: [2024-08-22T17:35:35.129279011Z]: [ 119588] 1001 119588 55434 3501 176128 5 996 postgres
kern: info: [2024-08-22T17:35:35.130302011Z]: [ 119799] 1001 119799 54794 3017 159744 15 996 postgres
kern: info: [2024-08-22T17:35:35.131207011Z]: [ 112919] 65535 112919 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.132143011Z]: [ 113074] 0 113074 321452 1884 237568 3078 -999 kube-proxy
kern: info: [2024-08-22T17:35:35.133049011Z]: [ 115757] 65535 115757 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.133855011Z]: [ 115807] 0 115807 316614 1687 180224 574 1000 local-path-prov
kern: info: [2024-08-22T17:35:35.134632011Z]: [ 116288] 65535 116288 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.140756011Z]: [ 116319] 1000 116319 416657 16745 1011712 58063 1000 coder
kern: info: [2024-08-22T17:35:35.141670011Z]: [ 124221] 65535 124221 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:35:35.142430011Z]: [ 124285] 0 124285 628 7 45056 16 1000 tini
kern: info: [2024-08-22T17:35:35.143162011Z]: [ 124297] 0 124297 1801 9 45056 870 1000 bash
kern: info: [2024-08-22T17:35:35.143933011Z]: [ 124355] 0 124355 3112 0 57344 269 1000 nginx
kern: info: [2024-08-22T17:35:35.144636011Z]: [ 124356] 33 124356 3199 0 57344 352 1000 nginx
kern: info: [2024-08-22T17:35:35.145414011Z]: [ 124860] 0 124860 9338 0 114688 4542 1000 python3
kern: info: [2024-08-22T17:35:35.146701011Z]: [ 124889] 0 124889 1486354 423024 7168000 434063 1000 windows
kern: info: [2024-08-22T17:35:35.148265011Z]: [ 124896] 0 124896 1801 8 45056 870 1000 bash
kern: info: [2024-08-22T17:35:35.149767011Z]: [ 124897] 0 124897 640 0 45056 26 1000 tail
kern: info: [2024-08-22T17:35:35.150710011Z]: [ 124898] 0 124898 631 0 40960 24 1000 sleep
kern: info: [2024-08-22T17:35:35.151539011Z]: [ 124899] 0 124899 667 0 40960 26 1000 cat
kern: info: [2024-08-22T17:35:35.152480011Z]: [ 124900] 0 124900 633 0 45056 26 1000 tee
kern: info: [2024-08-22T17:35:35.153595011Z]: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=9b88d91e632e74c3d6c978b471211eaefe6f3b73a2bcae5155c208e9aa71e188,mems_allowed=0,oom_memcg=/docker/163b1cd0beee59c717f6c76df0f3d45f9af192d904819a1b763903cabab27c35,task_memcg=/docker/163b1cd0beee59c717f6c76df0f3d45f9af192d904819a1b763903cabab27c35/kubepods/besteffort/pod7c0c5dd6-b0f7-4d0f-aa80-c47675ba8bf0/bb7edb368f8a81ea125b1aa2429eeee6beb633ec4fa95fc7daf35fa749ffd884,task=windows,pid=124889,uid=0
kern: err: [2024-08-22T17:35:35.157300011Z]: Memory cgroup out of memory: Killed process 124889 (windows) total-vm:5945416kB, anon-rss:1692096kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:7000kB oom_score_adj:1000
kern: warning: [2024-08-22T17:52:26.311330011Z]: flanneld invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-997
kern: alert: [2024-08-22T17:52:26.313537011Z]: CPU: 11 PID: 113617 Comm: flanneld Not tainted 5.15.153.1-microsoft-standard-WSL2 #1
kern: alert: [2024-08-22T17:52:26.314086011Z]: Call Trace:
kern: alert: [2024-08-22T17:52:26.314268011Z]:
kern: alert: [2024-08-22T17:52:26.314566011Z]: dump_stack_lvl+0x34/0x48
kern: alert: [2024-08-22T17:52:26.315004011Z]: dump_header+0x4a/0x18b
kern: alert: [2024-08-22T17:52:26.315369011Z]: oom_kill_process.cold+0xb/0x10
kern: alert: [2024-08-22T17:52:26.315770011Z]: out_of_memory+0x1f7/0x2a0
kern: alert: [2024-08-22T17:52:26.316150011Z]: mem_cgroup_out_of_memory+0x13a/0x150
kern: alert: [2024-08-22T17:52:26.317050011Z]: try_charge_memcg+0x6f4/0x7b0
kern: alert: [2024-08-22T17:52:26.317464011Z]: charge_memcg+0x3f/0x90
kern: alert: [2024-08-22T17:52:26.317797011Z]: __mem_cgroup_charge+0x2c/0x90
kern: alert: [2024-08-22T17:52:26.318206011Z]: __add_to_page_cache_locked+0x2eb/0x360
kern: alert: [2024-08-22T17:52:26.318668011Z]: ? scan_shadow_nodes+0x30/0x30
kern: alert: [2024-08-22T17:52:26.319045011Z]: add_to_page_cache_lru+0x48/0xd0
kern: alert: [2024-08-22T17:52:26.319597011Z]: pagecache_get_page+0x194/0x4f0
kern: alert: [2024-08-22T17:52:26.319902011Z]: filemap_fault+0x436/0xa00
kern: alert: [2024-08-22T17:52:26.320182011Z]: ? filemap_map_pages+0x120/0x5e0
kern: alert: [2024-08-22T17:52:26.320495011Z]: __do_fault+0x35/0x90
kern: alert: [2024-08-22T17:52:26.320740011Z]: __handle_mm_fault+0xc3a/0x13b0
kern: alert: [2024-08-22T17:52:26.321245011Z]: handle_mm_fault+0xbf/0x290
kern: alert: [2024-08-22T17:52:26.321566011Z]: do_user_addr_fault+0x1b2/0x650
kern: alert: [2024-08-22T17:52:26.321886011Z]: exc_page_fault+0x5d/0x100
kern: alert: [2024-08-22T17:52:26.322313011Z]: asm_exc_page_fault+0x22/0x30
kern: alert: [2024-08-22T17:52:26.322578011Z]: RIP: 0033:0x45c6f0
kern: alert: [2024-08-22T17:52:26.322941011Z]: Code: Unable to access opcode bytes at RIP 0x45c6c6.
kern: alert: [2024-08-22T17:52:26.323366011Z]: RSP: 002b:000000c0000b1950 EFLAGS: 00010216
kern: alert: [2024-08-22T17:52:26.323715011Z]: RAX: 000000000005b590 RBX: 000000c0000b19d8 RCX: 0000000002bd2300
kern: alert: [2024-08-22T17:52:26.324194011Z]: RDX: 00000000000004a9 RSI: 000000000000e868 RDI: 00000000000004aa
kern: alert: [2024-08-22T17:52:26.324883011Z]: RBP: 000000c0000b1960 R08: 0000000002635fa0 R09: 00000000026384f0
kern: alert: [2024-08-22T17:52:26.325517011Z]: R10: 00007fe15b2f56e0 R11: 0000000000000018 R12: 000000c00009ff40
kern: alert: [2024-08-22T17:52:26.326236011Z]: R13: 0000000000000000 R14: 000000c000006c40 R15: 0000000000000010
kern: alert: [2024-08-22T17:52:26.326790011Z]:
kern: info: [2024-08-22T17:52:26.327003011Z]: memory: usage 2097004kB, limit 2097152kB, failcnt 382754
kern: info: [2024-08-22T17:52:26.327511011Z]: memory+swap: usage 4194144kB, limit 4194304kB, failcnt 13544518
kern: info: [2024-08-22T17:52:26.327979011Z]: kmem: usage 40668kB, limit 9007199254740988kB, failcnt 0
kern: info: [2024-08-22T17:52:26.328529011Z]: Memory cgroup stats for /docker/163b1cd0beee59c717f6c76df0f3d45f9af192d904819a1b763903cabab27c35:
kern: info: [2024-08-22T17:52:26.328570011Z]: anon 2015346688
file 14581760
kernel_stack 11632640
pagetables 15847424
percpu 1147200
sock 0
shmem 13230080
file_mapped 13176832
file_dirty 0
file_writeback 0
swapcached 4399087616
anon_thp 1826619392
file_thp 0
shmem_thp 0
inactive_anon 1871720448
active_anon 232488960
inactive_file 847872
active_file 0
unevictable 0
slab_reclaimable 2793496
slab_unreclaimable 8409256
slab 11202752
workingset_refault_anon 38155
workingset_refault_file 43552078
workingset_activate_anon 816
workingset_activate_file 42568134
workingset_restore_anon 125
workingset_restore_file 42152184
workingset_nodereclaim 871
pgfault 8730084
pgmajfault 2196837
pgrefill 61209748
pgscan 1058497423
pgsteal 44779812
pgactivate 15905462
pgdeactivate 58641401
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 4186
thp_collapse_alloc 45
kern: info: [2024-08-22T17:52:26.335013011Z]: Tasks state (memory values in pages):
kern: info: [2024-08-22T17:52:26.335335011Z]: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
kern: info: [2024-08-22T17:52:26.336133011Z]: [ 111399] 0 111399 327797 17048 315392 2113 0 init
kern: info: [2024-08-22T17:52:26.336650011Z]: [ 111502] 0 111502 321409 7873 237568 1011 -999 containerd
kern: info: [2024-08-22T17:52:26.337212011Z]: [ 111741] 0 111741 309537 3179 114688 372 -998 containerd-shim
kern: info: [2024-08-22T17:52:26.337785011Z]: [ 111763] 50 111763 327117 11360 274432 1014 -998 apid
kern: info: [2024-08-22T17:52:26.338224011Z]: [ 111729] 0 111729 323261 12157 282624 1100 -500 containerd
kern: info: [2024-08-22T17:52:26.338751011Z]: [ 112021] 0 112021 309537 2935 118784 644 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.339314011Z]: [ 112899] 0 112899 309473 3069 118784 953 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.339900011Z]: [ 113021] 0 113021 309473 3104 122880 950 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.340402011Z]: [ 115727] 0 115727 309537 3245 118784 607 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.340956011Z]: [ 116268] 0 116268 309537 3143 118784 490 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.341527011Z]: [ 116833] 0 116833 309537 3141 114688 838 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.342154011Z]: [ 117205] 0 117205 309537 3048 122880 393 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.342675011Z]: [ 124201] 0 124201 309601 3690 114688 18 -499 containerd-shim
kern: info: [2024-08-22T17:52:26.343295011Z]: [ 112040] 0 112040 719910 8186 532480 3073 -450 kubelet
kern: info: [2024-08-22T17:52:26.343766011Z]: [ 113044] 65535 113044 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.344296011Z]: [ 113605] 0 113605 317682 2678 200704 1045 -997 flanneld
kern: info: [2024-08-22T17:52:26.344908011Z]: [ 135794] 0 135794 418 31 45056 0 -997 iptables
kern: info: [2024-08-22T17:52:26.345433011Z]: [ 116861] 65535 116861 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.346027011Z]: [ 116943] 101 116943 55 0 24576 3 998 dumb-init
kern: info: [2024-08-22T17:52:26.346559011Z]: [ 116978] 101 116978 318260 3412 212992 274 998 nginx-ingress-c
kern: info: [2024-08-22T17:52:26.347132011Z]: [ 117446] 101 117446 30538 418 122880 742 998 nginx
kern: info: [2024-08-22T17:52:26.347649011Z]: [ 117452] 101 117452 33500 448 114688 2454 998 nginx
kern: info: [2024-08-22T17:52:26.348253011Z]: [ 117453] 101 117453 33501 551 122880 2420 998 nginx
kern: info: [2024-08-22T17:52:26.348817011Z]: [ 117454] 101 117454 33500 443 114688 2452 998 nginx
kern: info: [2024-08-22T17:52:26.349399011Z]: [ 117455] 101 117455 33500 464 114688 2453 998 nginx
kern: info: [2024-08-22T17:52:26.350041011Z]: [ 117456] 101 117456 33500 675 114688 2175 998 nginx
kern: info: [2024-08-22T17:52:26.350495011Z]: [ 117457] 101 117457 33500 960 114688 1934 998 nginx
kern: info: [2024-08-22T17:52:26.350908011Z]: [ 117458] 101 117458 33500 442 114688 2452 998 nginx
kern: info: [2024-08-22T17:52:26.351336011Z]: [ 117459] 101 117459 33500 1579 114688 1329 998 nginx
kern: info: [2024-08-22T17:52:26.351725011Z]: [ 117460] 101 117460 33500 1804 114688 1068 998 nginx
kern: info: [2024-08-22T17:52:26.352159011Z]: [ 117461] 101 117461 33500 975 114688 1935 998 nginx
kern: info: [2024-08-22T17:52:26.352630011Z]: [ 117462] 101 117462 33500 1608 114688 1278 998 nginx
kern: info: [2024-08-22T17:52:26.353151011Z]: [ 117463] 101 117463 33500 417 114688 2451 998 nginx
kern: info: [2024-08-22T17:52:26.353738011Z]: [ 117464] 101 117464 30088 295 73728 350 998 nginx
kern: info: [2024-08-22T17:52:26.354330011Z]: [ 117233] 65535 117233 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.354912011Z]: [ 117370] 1001 117370 53829 2180 155648 379 996 postgres
kern: info: [2024-08-22T17:52:26.355630011Z]: [ 118312] 1001 118312 53857 703 147456 382 996 postgres
kern: info: [2024-08-22T17:52:26.356309011Z]: [ 118313] 1001 118313 53829 507 135168 403 996 postgres
kern: info: [2024-08-22T17:52:26.356970011Z]: [ 118336] 1001 118336 53829 1313 131072 406 996 postgres
kern: info: [2024-08-22T17:52:26.357481011Z]: [ 118337] 1001 118337 54223 499 143360 441 996 postgres
kern: info: [2024-08-22T17:52:26.358107011Z]: [ 118338] 1001 118338 54193 391 135168 440 996 postgres
kern: info: [2024-08-22T17:52:26.358587011Z]: [ 119538] 1001 119538 54327 901 151552 573 996 postgres
kern: info: [2024-08-22T17:52:26.359313011Z]: [ 119586] 1001 119586 54868 2576 163840 426 996 postgres
kern: info: [2024-08-22T17:52:26.359965011Z]: [ 119588] 1001 119588 55434 3066 176128 404 996 postgres
kern: info: [2024-08-22T17:52:26.360536011Z]: [ 119799] 1001 119799 54794 2610 159744 322 996 postgres
kern: info: [2024-08-22T17:52:26.361083011Z]: [ 126930] 1001 126930 54319 1333 151552 338 996 postgres
kern: info: [2024-08-22T17:52:26.361542011Z]: [ 127726] 1001 127726 54319 1102 147456 338 996 postgres
kern: info: [2024-08-22T17:52:26.362124011Z]: [ 128737] 1001 128737 53864 373 122880 319 996 postgres
kern: info: [2024-08-22T17:52:26.362639011Z]: [ 132010] 1001 132010 53864 346 122880 324 996 postgres
kern: info: [2024-08-22T17:52:26.363202011Z]: [ 135052] 1001 135052 53829 168 114688 371 996 postgres
kern: info: [2024-08-22T17:52:26.363853011Z]: [ 112919] 65535 112919 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.364544011Z]: [ 113074] 0 113074 321516 1899 237568 3146 -999 kube-proxy
kern: info: [2024-08-22T17:52:26.365234011Z]: [ 115757] 65535 115757 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.366039011Z]: [ 115807] 0 115807 316614 2105 180224 676 1000 local-path-prov
kern: info: [2024-08-22T17:52:26.366708011Z]: [ 116288] 65535 116288 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.367126011Z]: [ 116319] 1000 116319 416657 5157 946176 3146 1000 coder
kern: info: [2024-08-22T17:52:26.367550011Z]: [ 124221] 65535 124221 249 1 28672 0 -998 pause
kern: info: [2024-08-22T17:52:26.367971011Z]: [ 125784] 0 125784 628 8 40960 15 1000 tini
kern: info: [2024-08-22T17:52:26.368372011Z]: [ 125796] 0 125796 1801 1 53248 869 1000 bash
kern: info: [2024-08-22T17:52:26.368817011Z]: [ 125855] 0 125855 3112 1 61440 267 1000 nginx
kern: info: [2024-08-22T17:52:26.369229011Z]: [ 125856] 33 125856 3199 0 61440 350 1000 nginx
kern: info: [2024-08-22T17:52:26.369722011Z]: [ 126019] 0 126019 9338 0 110592 4541 1000 python3
kern: info: [2024-08-22T17:52:26.370286011Z]: [ 126051] 0 126051 1487379 429020 7688192 492387 1000 windows
kern: info: [2024-08-22T17:52:26.370870011Z]: [ 126062] 0 126062 640 0 40960 24 1000 tail
kern: info: [2024-08-22T17:52:26.371548011Z]: [ 126064] 0 126064 667 0 40960 26 1000 cat
kern: info: [2024-08-22T17:52:26.372015011Z]: [ 126065] 0 126065 633 0 45056 25 1000 tee
kern: info: [2024-08-22T17:52:26.372637011Z]: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=eb1664d77f7533eaeec0caa232a87d44ff493cf4932ec24504b9a6371fe7157b,mems_allowed=0,oom_memcg=/docker/163b1cd0beee59c717f6c76df0f3d45f9af192d904819a1b763903cabab27c35,task_memcg=/docker/163b1cd0beee59c717f6c76df0f3d45f9af192d904819a1b763903cabab27c35/kubepods/besteffort/pod7c0c5dd6-b0f7-4d0f-aa80-c47675ba8bf0/49be279589fe699f097308a0d6c23d49d18c70f34918c4825b09af5dd91dded2,task=windows,pid=126051,uid=0
kern: err: [2024-08-22T17:52:26.376733011Z]: Memory cgroup out of memory: Killed process 126051 (windows) total-vm:5949516kB, anon-rss:1716080kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:7508kB oom_score_adj:1000
kern: info: [2024-08-22T17:52:34.489223011Z]: cni0: port 5(vethe87eeeaf) entered disabled state
kern: info: [2024-08-22T17:52:34.495543011Z]: device vethe87eeeaf left promiscuous mode

@smira
Copy link
Member

smira commented Aug 23, 2024

So I'm not sure what is the issue we should fix, as Kubernetes cluster operator should set proper memory limits and reservations.

We will consider tuning reservations for Talos components, but kubelet resource usage depends on the workload, so it can't be set once and for every cluster.

See #7081

@smira smira closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2024
@datapedd
Copy link
Author

datapedd commented Aug 24, 2024

The resource limits are properly set in kubernets. Also I have 31GB allocatable memory on the node and still the windows docker pod is killing the services due to OOM.

@datapedd
Copy link
Author

datapedd commented Aug 24, 2024

for me this looks very strange as it seems more windows related. You could clone this and apply the kubernetes.yml from
https://github.com/dockur/windows to see that it will also crash your node with limits in place.

@datapedd
Copy link
Author

Could this potentially also be because of docker and WSL2 as I use talos inside docker on my win 11 host. And have inside that a win 11 pod.

kern: warning: [2024-08-24T20:31:19.332892508Z]: init invoked oom-killer: gfp_mask=0xdc0(GFP_KERNEL|__GFP_ZERO), order=0,
oom_score_adj=0
kern: alert: [2024-08-24T20:31:19.338718508Z]: CPU: 3 PID: 1628 Comm: init Not tainted 5.15.153.1-microsoft-standard-WSL2 #1
kern: alert: [2024-08-24T20:31:19.340969508Z]: Call Trace:
kern: alert: [2024-08-24T20:31:19.341335508Z]:
kern: alert: [2024-08-24T20:31:19.341646508Z]: dump_stack_lvl+0x34/0x48
kern: alert: [2024-08-24T20:31:19.342108508Z]: dump_header+0x4a/0x18b
kern: alert: [2024-08-24T20:31:19.342557508Z]: oom_kill_process.cold+0xb/0x10
kern: alert: [2024-08-24T20:31:19.343033508Z]: out_of_memory+0x1f7/0x2a0
kern: alert: [2024-08-24T20:31:19.343664508Z]: mem_cgroup_out_of_memory+0x13a/0x150

@datapedd
Copy link
Author

dmesg.log

@datapedd
Copy link
Author

datapedd commented Aug 24, 2024

Seems to be docker related as there is only 2gb max for it allocated. Should be a notice on the talos docs to change that (as other docker container get default larger memory for me)
image
.

@datapedd
Copy link
Author

with 10gb on docker this if fixed and no crash anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants