Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet should track tcp_mem stats also along with cpu/ram/disk #62334

Open
shahidhk opened this issue Apr 10, 2018 · 48 comments
Open

kubelet should track tcp_mem stats also along with cpu/ram/disk #62334

shahidhk opened this issue Apr 10, 2018 · 48 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@shahidhk
Copy link

shahidhk commented Apr 10, 2018

/kind feature
/sig node

What happened:

A program started leaking TCP memory, which filled up the node's TCP stack memory. The network performance on the node degraded and connections to pods running on the node either times out or will hang for a long time.

Node's dmesg had lines mentioning TCP: out of memory -- consider tuning tcp_mem

Further reading and investigation reveals that this could happen when TCP stack runs out of memory pages allocated by kernel or when there are lot of orphaned/open sockets.

TCP stack limits: max 86514

$  cat /proc/sys/net/ipv4/tcp_mem
43257	57676	86514
# min pressure max

Usage when issue happened: mem 87916

$ cat /proc/net/sockstat
sockets: used 1386
TCP: inuse 24 orphan 0 tw 58 alloc 863 mem 87916
UDP: inuse 3 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

kubelet posts node status as ready.

What you expected to happen:

kubelet should say node is not ready.

It would be great if kubelet could track the tcp_mem stats also along with CPU/RAM/disk as network is also an important factor. If tcp_mem limit is hit, for some reason, the node is not usable. Notifying the user that node has some issue can help debugging and further identifying the cause.

How to reproduce it (as minimally and precisely as possible):

  1. Create a GKE cluster
  2. Label a node to carry out tests
$ kubectl label node <node-name> node=leak-test
  1. Create an nginx deployment with Loadbalancer, with can serve a large file
$ kubectl create -f https://raw.githubusercontent.com/shahidhk/k8s-tcp-mem-leak/master/nginx.yaml
  1. Check if you can download the large file
$ curl -o large-file <ip>/large-file
  1. Create a deployment that can fill up the TCP stack memory
$ kubectl create -f https://raw.githubusercontent.com/shahidhk/k8s-tcp-mem-leak/master/leak-repro.yaml
  1. SSH into the node and observe cat /proc/sys/net/ipv4/tcp_mem and cat /proc/net/sockstat and scale the deployment until the current mem exceeds the limit
  2. Try downloading the large file again. It will either become very slow or will not happen at all

Anything else we need to know?:

This is more of a feature request for kubelet rather than a bug. TCP mem can get filled if the node is running a lot of TCP heavy workloads, need not necessarily be a leak. Since kubelet is ultimately responsible for reporting node's health, network should also be a parameter.

Environment:

  • Kubernetes version (use kubectl version): v1.9.6-gke.0
  • Cloud provider or hardware configuration: GKE, 1 node, n1-standard-1
  • OS (e.g. from /etc/os-release): Container-Optimized OS
  • Kernel (e.g. uname -a): 4.4.111+
  • Install tools: -
  • Others: -
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 10, 2018
@shahidhk
Copy link
Author

shahidhk commented Apr 10, 2018

/cc @thockin in continuation to our discussion at https://twitter.com/thockin/status/973965476173725696, took me some time to fixup the repro steps 😄

@cizixs
Copy link

cizixs commented Apr 11, 2018

Since we are collecting network status information, should connection tracking count/limit be considered?

@shahidhk
Copy link
Author

@cizixs I think it should. Do you know any other parameters which can cause a network failure/degradation, but easy to detect?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2018
@jaredallard
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 5, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2018
@bgrant0607 bgrant0607 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 7, 2018
@utkarshmani1997
Copy link

/remove-lifecycle stale

@AWKIF
Copy link

AWKIF commented Jan 29, 2019

just ran into this issue, and i'd like that feature as well :)

@somashekhar
Copy link

Observed the issue in our systems.
Looks like this feature would help lot.

@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019
@cyrus-mc
Copy link

Somewhat tangential to this, but more of an informational thing. From the network perspective, what sort of things are namespaced and what isn't? I am currently trying to debug a "performance" issue and was starting to focus on the network.

From my research it appears settings like tcp_rmem and tpc_wmem (read and right buffers) are namespaced. Meaning you can set those values within a container and they don't affect the host settings.

But a setting like tcp_mem (which list the max page allocations for tcp stack) seems to only be set at the host level. Yet I would think tcp_mem's setting directly affects what you can set in tcp_rmem and wmem.

@anjuls
Copy link

anjuls commented Mar 29, 2019

Having same issue on the master node. The resources are not getting deleted as the API server is unable to take new request. Kubelet is showing healthy though.


[1667891.052298] TCP: out of memory -- consider tuning tcp_mem
[1668316.259318] TCP: out of memory -- consider tuning tcp_mem
[1668316.997397] TCP: out of memory -- consider tuning tcp_mem
[admin@xx~]$ cat /proc/sys/net/ipv4/tcp_mem
4096    4096    4096
[admin@xx~]$  cat /proc/net/sockstat
sockets: used 582
TCP: inuse 259 orphan 0 tw 18 alloc 289 mem 36
UDP: inuse 4 mem 0
UDPLITE: inuse 0
RAW: inuse 1
FRAG: inuse 0 memory 0
[admin@xx~]$

@freehan freehan removed the triage/unresolved Indicates an issue that can not or will not be resolved. label May 16, 2019
@asc-adean
Copy link

asc-adean commented May 21, 2019

This caught me today, had several socket hang up in running applications resulting in HTTP timeouts. Went and did a drain/rolling restart of all nodes to get us back to a happy place.

Azure AKS, Kubernetes v1.13.5

@tanya-borisova
Copy link

We have run into this issue as well, we had a was leaking open connections which lead to a whole node being unusable and introduced a noisy neighbor problem.

@suryababy
Copy link

suryababy commented Sep 3, 2019

Recently we experienced an interesting production problem. This application was running on multiple AWS EC2 instances behind Elastic Load Balancer. The application was running on GNU/Linux OS, Java 8, Tomcat 8 application server. All of sudden one of the application instances became unresponsive. All other application instances were handling the traffic properly. Whenever the HTTP request was sent to this application instance from the browser, we were getting following response to be printed on the browser.

Proxy Error

The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /.

Reason: Error reading from remote server

Let us see how we resolved this issue by assigning values for these properties in the server:

net.core.netdev_max_backlog=30000 net.core.rmem_max=134217728 net.core.wmem_max=134217728 net.ipv4.tcp_max_syn_backlog=8192 net.ipv4.tcp_rmem=4096 87380 67108864 net.ipv4.tcp_wmem=4096 87380 67108864

TCP: out of memory — consider tuning tcp_mem

@linjmeyer
Copy link

We've had this problem as well on GKE nodes (Container-Optimized OS). It would be great to see Kubernetes handle this as it can effectively break the network stack of an entire node.

Slightly off topic, does anyone have any tips for determining which container/process is leaking the TCP memory? As a quick workaround we have increased the TCP memory but that can't work forever.

@swatisehgal
Copy link
Contributor

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 20, 2024
@thockin
Copy link
Member

thockin commented Mar 14, 2024

Workarounds are good, but it's not clear to me if we should be doing more here - anyone who has direct context?

@utix
Copy link

utix commented Mar 14, 2024

Workarounds are good, but it's not clear to me if we should be doing more here - anyone who has direct context?

If a pod is leaking connections the pod will kill the node, without any alert or monitoring.
A pod should not be able to kill the node, or at least we need to monitor it.

Happy to give more context if you need it.

@thockin
Copy link
Member

thockin commented Mar 14, 2024

This is an old issue, which I won't have time to tackle in the near future - any context you can add here, to make it more approachable by some volunteer (could be you!) would help.

@shaneutt
Copy link
Member

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 22, 2024
@kong62
Copy link

kong62 commented May 23, 2024

[root@pub-k8stx-mgt-prd-004037-cvm ~]# dmesg -T
[Wed May 22 19:08:35 2024] TCP: out of memory -- consider tuning tcp_mem
[Wed May 22 19:08:47 2024] TCP: out of memory -- consider tuning tcp_mem

[root@pub-k8stx-mgt-prd-004037-cvm ~]# sysctl -a 2>&1 | grep tcp_mem
net.ipv4.tcp_mem = 1501206      2001609 3002412

[root@pub-k8stx-mgt-prd-004037-cvm ~]# cat /proc/net/sockstat
sockets: used 8165
TCP: inuse 64 orphan 0 tw 1157 alloc 6881 mem 3003353
UDP: inuse 6 mem 2
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

@adrianmoisey
Copy link
Member

/assign

@ssichynskyi
Copy link

Still experience a periodic GKE nodes outage because of tcp oom. Pods that are hosted on a deceased node becomes inoperable.

@adrianmoisey
Copy link
Member

👍 Thanks for the info. I'm busy working on this. I hope to have a PR created soon.

@adrianmoisey
Copy link
Member

/accept

@adrianmoisey
Copy link
Member

/triage accepted

@k8s-ci-robot
Copy link
Contributor

@adrianmoisey: The label triage/accepted cannot be applied. Only GitHub organization members can add the label.

In response to this:

/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@shaneutt
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2024
@adrianmoisey
Copy link
Member

I've been working on this. It's taking me some time, sorry about that.

I've gotten my code to mostly work, but I need to spend time finishing up the specifics.
I just want to get an idea of exactly what this change should be doing.

When the socket buffer is full, which of these should happen:

  1. Node becomes unready (specifically, the "Ready" condition becomes False)
  2. kubelet evicts Pods
  3. Both 1 and 2 ?

Based on this conversation, I assume only bullet point 1 should happen (Node becomes unready).

Additionally, does this feature need to be feature gated?

(cc @thockin @aojea @shaneutt)

@MartinEmrich
Copy link

@adrianmoisey I would prefer "2", Kubelet should evict the pod causing the memory usage, just as it would evict pods exceeding their memory or ephemeralStorage allowance.
If the pod in question is evicted (thus its processes ended, the sockets claiming the tcp_mem closed), the tcp_mem usage goes down and the node itself stays operational.

@aojea
Copy link
Member

aojea commented Aug 15, 2024

/cc @aojea
From sig network discussion, some follow. up questions

is tcp_mem accounted as part of the global memory ? is namespaced? what are the behaviours we want to implement, TCP is bursty by nature, what happens if there are peaks of congestion?

@bowei
Copy link
Member

bowei commented Aug 15, 2024

I see a reference here in the Linux kernel documentation:

Search for section 2.7.1: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

2.7.1 Current Kernel Memory resources accounted

  • stack pages: every process consumes some stack pages. By accounting into
    kernel memory, we prevent new processes from being created when the kernel
    memory usage is too high.

  • slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
    of each kmem_cache is created every time the cache is touched by the first time
    from inside the memcg. The creation is done lazily, so some objects can still be
    skipped while the cache is being created. All objects in a slab page should
    belong to the same memcg. This only fails to hold when a task is migrated to a
    different memcg during the page allocation by the cache.

  • sockets memory pressure: some sockets protocols have memory pressure
    thresholds. The Memory Controller allows them to be controlled individually
    per cgroup, instead of globally.

  • tcp memory pressure: sockets memory pressure for the tcp protocol.

@aojea
Copy link
Member

aojea commented Aug 17, 2024

This is more of a feature request for kubelet rather than a bug. TCP mem can get filled if the node is running a lot of TCP heavy workloads, need not necessarily be a leak. Since kubelet is ultimately responsible for reporting node's health, network should also be a parameter.

Independently of everything, we depend of the mechanisms exposed by the kernel, based on https://lpc.events/event/16/contributions/1212/attachments/1079/2052/LPC%202022%20-%20TCP%20memory%20isolation.pdf this is still WIP, there are also some interesting lessons learned

● For multi-tenant servers, static tcp_mem is harmful.

@adrianmoisey
Copy link
Member

Interesting share, thanks @aojea
I did some digging and I found that the memory.stat file in a cgroup has a sock counter, here's an example:

cat memory.stat
anon 0
file 126976
kernel 24576
kernel_stack 0
pagetables 0
sec_pagetables 0
percpu 0
sock 0                  <-------------
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 0
active_anon 0
inactive_file 126976
active_file 0
unevictable 0
slab_reclaimable 21008
slab_unreclaimable 1064
slab 22072
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 19
pgsteal 19
pgscan_kswapd 19
pgscan_direct 0
pgscan_khugepaged 0
pgsteal_kswapd 19
pgsteal_direct 0
pgsteal_khugepaged 0
pgfault 235
pgmajfault 8
pgrefill 32
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
zswpwb 0
thp_fault_alloc 0
thp_collapse_alloc 0
thp_swpout 0
thp_swpout_fallback 0

From https://docs.kernel.org/admin-guide/cgroup-v2.html:

sock (npn)
Amount of memory used in network transmission buffers

This makes me think that it may be possible to evict pods that are using up too much TCP transmission buffers.

I'm not sure if it's what we want to do though.

From an end user perspective, if kubelet is going to be evicting pods based on some behaviour, I'd like the ability to determine the bounds of what is good and what is bad. (ie: memory limits as an example).

Which makes me think that this would work better as a Pod resource, much like memory and CPU.

@thockin
Copy link
Member

thockin commented Sep 12, 2024

Is this accounted per-cgroup or just at the root? It looks like cgroup v2 is per-cgroup. Does that get accumulated into the total memory usage of the container?

Obviously, the ideal would be to kill the process/cgroup which is abusive. But it's not easy to know who that is unless it is accounted properly.

Anything with a global (machine-wide) limit which is shared by cgroups is likely to be an isolation problem.

@uablrek
Copy link
Contributor

uablrek commented Sep 13, 2024

This may be related #116895.

It's in the same area ("invisible" pod memory causing oom), but seem to need privileged:true.

@MartinEmrich
Copy link

@thockin just a thought: If it is accounted per cgroup, it could be used to evict the culprit pod. But even if not, it could be handled on node level (like DiskPressure, PIDPressure, ..) which could lead to the node being marked as NotReady, or even be drained and removed completely. Any well-designed application could then fail over to other pods.

@thockin
Copy link
Member

thockin commented Sep 16, 2024

If a regular-privilege pod can cause a machine to go NotReady, that's a DoS vector. Now, I know that pods with memory limit > request fall into this category, but that is something an admin can prevent by policy. I am far enough away from kubelet's resource management code now that I am hand-waving. The pattern we want, I think is:

  1. If we can account it to the cgroup, do that and set sane limits
  2. If we can't manage it by cgroup but can otherwise tell who is using too much, do that and do something when they use too much
  3. If we can't tell who is using too much, but we can limit the usage, do that in hopes of protecting the system "most of the time"
  4. If we can't limit individual usage, but we can measure the total, report that
  5. Otherwise, cry

@adrianmoisey
Copy link
Member

I'm going to unassign myself from this issue for now. I've got other tasks I'm working on at the moment, and this one seems to be a little complicated for me right now. I'll happily pick it up in the future, if nobody else has done it.

/unassign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests