Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

Doppler out of memory when scaled #241

Closed
jkbschmid opened this issue Dec 9, 2019 · 8 comments · Fixed by #619
Closed

Doppler out of memory when scaled #241

jkbschmid opened this issue Dec 9, 2019 · 8 comments · Fixed by #619
Assignees
Labels
Priority: High Size: 8 Status: Verification Needed Issue must be verified before closed SUSE SUSE is pursuing a solution Type: Bug Something isn't working
Milestone

Comments

@jkbschmid
Copy link

Describe the bug
When scaling to two Doppler instances and deploying an application, the memory consumption increases until either the k8s node crashes or Doppler is killed by the OOMKiller.

To Reproduce
Install kubecf master with cf-operator master.
Run cf push with the example Dora app.

Expected behavior
Doppler should consume reasonable amount of memory.

Environment

  • CF Version: v8 and v12
@loewenstein
Copy link

@viovanov blocker for 1.0?

@f0rmiga f0rmiga added Priority: Critical Type: Bug Something isn't working labels Dec 30, 2019
@f0rmiga f0rmiga added this to the 0.2.0 milestone Dec 30, 2019
@f0rmiga f0rmiga self-assigned this Jan 6, 2020
@fargozhu fargozhu added the SUSE SUSE is pursuing a solution label Jan 7, 2020
@f0rmiga
Copy link
Member

f0rmiga commented Jan 23, 2020

The problem here seems to be with log-cache. In my initial investigation, I couldn't find the exact problem. For the 0.2 release, I'm going to restrict HA for doppler. We should get back to it for 1.0 though.

@f0rmiga f0rmiga modified the milestones: 0.2.0, 1.0.0 Jan 23, 2020
@f0rmiga f0rmiga removed their assignment Jan 23, 2020
@bikramnehra bikramnehra self-assigned this Jan 24, 2020
@f0rmiga
Copy link
Member

f0rmiga commented Jan 24, 2020

I changed the priority from Critical to High given that there is a temporary fix in place already.

@loewenstein
Copy link

I wouldn't call "don't scale it" a temporary fix for "this can't be scaled".
What does it mean for logs, if Doppler is a singleton? Potential delay or loss of logs? If it is the latter I would still judge it critical I guess. Unless there are other issues left that are comparable to completely failing k8s nodes, of course.

@f0rmiga
Copy link
Member

f0rmiga commented Jan 25, 2020

@loewenstein Do you have spare cycles to help to debug this?

@bikramnehra
Copy link
Member

Seems like the original solution that was proposed is not going to work out. An issue has been filed upstream though it might take some time to be fixed as the root cause hasn't been identified yet.

Since the problematic part is the log-cache job therefore one potential fix as proposed by @f0rmiga is to leave the log-cache job outside of the doppler instance group and avoid scaling it altogether.

@f0rmiga f0rmiga self-assigned this Mar 2, 2020
@f0rmiga
Copy link
Member

f0rmiga commented Mar 6, 2020

As discussed with @viovanov, we will keep it as is for now.

@f0rmiga f0rmiga assigned viovanov and unassigned f0rmiga Mar 6, 2020
@fargozhu fargozhu modified the milestones: 1.0.0, 1.1.0 Mar 12, 2020
@fargozhu fargozhu modified the milestones: 1.0.1, 1.2.0 Mar 19, 2020
@viovanov
Copy link
Member

viovanov commented Mar 26, 2020

Upstream's VM deployments don't seem to be suffering from this issue based on discussions.

We should implement the split for the log cache from doppler.

@aduffeck aduffeck self-assigned this Mar 30, 2020
@f0rmiga f0rmiga mentioned this issue Apr 3, 2020
7 tasks
@fargozhu fargozhu modified the milestones: Next Release, 2.0.0 Apr 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Priority: High Size: 8 Status: Verification Needed Issue must be verified before closed SUSE SUSE is pursuing a solution Type: Bug Something isn't working
Projects
None yet
7 participants