-
Notifications
You must be signed in to change notification settings - Fork 63
Fixing out of memory issue when scaling doppler instances #366
Conversation
So chasing down all the layers of stuff, the core problem is that when we get the memory usage we parse /proc/meminfo (via here) which doesn't reflect the container limits? Just bumping it down doesn't really help, since now you can't scale past 4 instances (on the same node). The necessary fix is really to change how we get the available memory limits. Edit: Actually, do we even set any resource constraints yet? If not, would doing so get reflected in |
@mook-as I think we had it crash also with rather low requests==limits set. |
The pod is terminated if the memory limit is exceeded. |
This mechanism was designed while having VMs in mind, ideal solution would be to fix it on upstream by relying on a different mechanism, which the Also, correct me if I am wrong - |
As reported, this bug only shows up when using more than one instance of log-cache. |
Taking some notes for how the memory limit is used:
So, I think we might be able to get away with mounting over |
---
apiVersion: v1
kind: ConfigMap
metadata:
name: proc-meminfo
data:
meminfo: |-
MemTotal: 256000 kB
MemFree: 256000 kB
MemAvailable: 256000 kB
Buffers: 0
Cached: 0
SwapCached: 0
Active: 0
Inactive: 0
Active(anon): 0
Inactive(anon): 0
Active(file): 0
Inactive(file): 0
Unevictable: 0
Mlocked: 0
SwapTotal: 0
SwapFree: 0
Dirty: 0
Writeback: 0
AnonPages: 0
Mapped: 0
Shmem: 0
Slab: 0
SReclaimable: 0
SUnreclaim: 0
KernelStack: 0
PageTables: 0
NFS_Unstable: 0
Bounce: 0
WritebackTmp: 0
CommitLimit: 0
Committed_AS: 0
VmallocTotal: 0
VmallocUsed: 0
VmallocChunk: 0
HardwareCorrupted: 0
AnonHugePages: 0
ShmemHugePages: 0
ShmemPmdMapped: 0
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 0
DirectMap4k: 0
DirectMap2M: 0
DirectMap1G: 0
---
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
- name: opensuse
image: opensuse/leap:15.1
command: ["/bin/sh", "-c", "cat /proc/meminfo"]
volumeMounts:
- name: proc-meminfo
mountPath: /proc/meminfo
subPath: meminfo
volumes:
- name: proc-meminfo
configMap:
name: proc-meminfo
restartPolicy: Never |
I'm closing this PR since it doesn't introduce a fix. The bug is still tracked on #241. |
Description
fixes: #241
Motivation and Context
log-cache
by default is configured to 50% of total memory available to the instance:https://github.com/cloudfoundry/log-cache-release/blob/v2.6.4/jobs/log-cache/spec#L35-L37
which exceeds the amount of memory actually available to be used:
Therefore, this fix will limit the memory to 25% which falls under the actual available memory.
Discussion on CF Slack: https://cloudfoundry.slack.com/archives/CBFB7NP9B/p1580161281014400
How Has This Been Tested?
Locally on minikube by scaling the doppler instances to two and running smoke-tests.
Screenshots (if appropriate):
Types of changes
Checklist: