Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

Fixing out of memory issue when scaling doppler instances #366

Closed
wants to merge 2 commits into from

Conversation

bikramnehra
Copy link
Member

Description

fixes: #241

Motivation and Context

log-cache by default is configured to 50% of total memory available to the instance:

https://github.com/cloudfoundry/log-cache-release/blob/v2.6.4/jobs/log-cache/spec#L35-L37

which exceeds the amount of memory actually available to be used:

/:/var/vcap/jobs/log-cache# cat /proc/meminfo   
MemTotal:       15652300 kB
MemFree:          569484 kB
MemAvailable:    9008472 kB

Therefore, this fix will limit the memory to 25% which falls under the actual available memory.

Discussion on CF Slack: https://cloudfoundry.slack.com/archives/CBFB7NP9B/p1580161281014400

How Has This Been Tested?

Locally on minikube by scaling the doppler instances to two and running smoke-tests.

❯ k logs kubecf-smoke-tests-ca77bb96d4d42c8c-znzvg -n kubecf -c smoke-tests-smoke-tests -f
kubectl logs kubecf-smoke-tests-ca77bb96d4d42c8c-znzvg -n kubecf -c smoke-tests-smoke-tests -f
Running smoke tests...
Running binaries smoke/isolation_segments/isolation_segments.test
smoke/logging/logging.test
smoke/runtime/runtime.test
[1580245198] CF-Isolation-Segment-Smoke-Tests - 4 specs - 4 nodes SSSS SUCCESS! 13.545755112s 
[1580245198] CF-Logging-Smoke-Tests - 2 specs - 4 nodes S• SUCCESS! 1m3.746576247s 
[1580245198] CF-Runtime-Smoke-Tests - 2 specs - 4 nodes S• SUCCESS! 1m4.298290082s 

Ginkgo ran 3 suites in 2m21.645606928s
Test Suite Passed

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code has security implications.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@bikramnehra bikramnehra requested a review from f0rmiga January 29, 2020 04:02
@mook-as
Copy link
Contributor

mook-as commented Jan 29, 2020

So chasing down all the layers of stuff, the core problem is that when we get the memory usage we parse /proc/meminfo (via here) which doesn't reflect the container limits?

Just bumping it down doesn't really help, since now you can't scale past 4 instances (on the same node). The necessary fix is really to change how we get the available memory limits.

Edit: Actually, do we even set any resource constraints yet? If not, would doing so get reflected in /proc/meminfo?

@loewenstein
Copy link

@mook-as I think we had it crash also with rather low requests==limits set.

@f0rmiga
Copy link
Member

f0rmiga commented Jan 29, 2020

@mook-as I think we had it crash also with rather low requests==limits set.

The pod is terminated if the memory limit is exceeded. /proc/meminfo always reports the total memory in the system, not what the container is allowed to use. Being only able to set a percentage is not really helpful.

@bikramnehra
Copy link
Member Author

bikramnehra commented Jan 29, 2020

This mechanism was designed while having VMs in mind, ideal solution would be to fix it on upstream by relying on a different mechanism, which the log-cache team tried but haven't been successful yet: https://cloudfoundry.slack.com/archives/CBFB7NP9B/p1580254462021300?thread_ts=1580161281.014400&cid=CBFB7NP9B

Also, correct me if I am wrong - log-cache is just an in-memory caching layer for logs therefore if there isn't enough memory available for caching things should still work though slightly slower. How would this impact scaling?

@f0rmiga
Copy link
Member

f0rmiga commented Jan 29, 2020

How would this impact scaling?

As reported, this bug only shows up when using more than one instance of log-cache.

@mook-as
Copy link
Contributor

mook-as commented Jan 30, 2020

Taking some notes for how the memory limit is used:

So, I think we might be able to get away with mounting over /proc/meminfo for the short term; we'll most likely need to experiment.

@f0rmiga
Copy link
Member

f0rmiga commented Jan 30, 2020

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: proc-meminfo
data:
  meminfo: |-
    MemTotal:         256000 kB
    MemFree:          256000 kB
    MemAvailable:     256000 kB
    Buffers:               0
    Cached:                0
    SwapCached:            0
    Active:                0
    Inactive:              0
    Active(anon):          0
    Inactive(anon):        0
    Active(file):          0
    Inactive(file):        0
    Unevictable:           0
    Mlocked:               0
    SwapTotal:             0
    SwapFree:              0
    Dirty:                 0
    Writeback:             0
    AnonPages:             0
    Mapped:                0
    Shmem:                 0
    Slab:                  0
    SReclaimable:          0
    SUnreclaim:            0
    KernelStack:           0
    PageTables:            0
    NFS_Unstable:          0
    Bounce:                0
    WritebackTmp:          0
    CommitLimit:           0
    Committed_AS:          0
    VmallocTotal:          0
    VmallocUsed:           0
    VmallocChunk:          0
    HardwareCorrupted:     0
    AnonHugePages:         0
    ShmemHugePages:        0
    ShmemPmdMapped:        0
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:          0
    DirectMap4k:           0
    DirectMap2M:           0
    DirectMap1G:           0
---
apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
    - name: opensuse
      image: opensuse/leap:15.1
      command: ["/bin/sh", "-c", "cat /proc/meminfo"]
      volumeMounts:
      - name: proc-meminfo
        mountPath: /proc/meminfo
        subPath: meminfo
  volumes:
  - name: proc-meminfo
    configMap:
      name: proc-meminfo
  restartPolicy: Never

@fargozhu fargozhu added Priority: High Status: Verification Needed Issue must be verified before closed Type: Bug Something isn't working labels Jan 31, 2020
@fargozhu fargozhu added this to the 1.0.0 milestone Jan 31, 2020
@f0rmiga
Copy link
Member

f0rmiga commented Feb 7, 2020

I'm closing this PR since it doesn't introduce a fix. The bug is still tracked on #241.

@f0rmiga f0rmiga closed this Feb 7, 2020
@f0rmiga f0rmiga deleted the bisingh/doppler branch February 7, 2020 13:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Priority: High Status: Verification Needed Issue must be verified before closed Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Doppler out of memory when scaled
5 participants