Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Why does Flux consume so much memory? answer: > 200k resources in the cluster #2633

Closed
hiddeco opened this issue Nov 26, 2019 · 19 comments
Closed
Assignees
Labels

Comments

@hiddeco
Copy link
Member

hiddeco commented Nov 26, 2019

Describe the bug
The Flux daemon uses an increasing amount of RAM (up to 2.5GB before getting OOM killed for Flux 1.16) with --sync-garbage-collection=true and --manifest-generation=true set, and just 1.9M resource files on disk.

To Reproduce
Steps to reproduce the behaviour:

  1. Spin up a Flux daemon with --sync-garbage-collection=true and --manifest-generation=true set. The .flux.yaml file on this setup is:

    version: 1
    patchUpdated:
      generators:
      - command: cat *.yaml
      patchFile: flux-patch.conf

Expected behavior
A more graceful amount of memory being used.

Additional context

  • Flux version: behaviour has been reported for version 1.15.0 and 1.16.0
@hiddeco hiddeco added bug blocked-needs-validation Issue is waiting to be validated before we can proceed labels Nov 26, 2019
@hiddeco
Copy link
Member Author

hiddeco commented Nov 26, 2019

heap and goroutine dumps

User was requested to provide heap and goroutines dumps, the easiest way for to do this is by making two curl calls:

/home/flux # curl -sK -v http://localhost:3030/debug/pprof/heap > heap.out
/home/flux # curl -sK -v http://localhost:3030/debug/pprof/goroutines > goroutines.out

No spurious amount of goroutines can be observed, and the reported inuse_space only totals for about 7-9MB.

/tmp

Garbage collection of temporary resources seems to be okay:

/home/flux # ls /tmp | wc -l
5
/home/flux # ls /tmp
flux-gitclone236288646  flux-working285393832   flux-working842178345   goroutine-2.4g          heap-2.4g

top

top shows a total VSZ (virtual set size) of the fluxd process of 2495m:

PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2495m  16%   3   0% fluxd --memcached-hostname=<snip> --listen-metrics=:3031 --git-ci-skip --ssh-keygen-dir=/var/fluxd/keygen --k8s-secret-name=f

pmap

pmap 7                       
7: fluxd --memcached-hostname=<snip> --listen-metrics=:3031 --git-ci-skip --ssh-keygen-dir=/var/fluxd/keygen --k8s-secret-name=flux-git-deploy --git-url=<snip> --git-bra                                                            
0000000000400000   16780K r-xp  /usr/local/bin/fluxd                                                      
0000000001463000   17104K r--p  /usr/local/bin/fluxd                                                      
0000000002517000     512K rw-p  /usr/local/bin/fluxd                                                      
0000000002597000     204K rw-p    [ anon ]                                                                
000000c000000000   67584K rw-p    [ anon ]                                                                
000000c004200000   20480K rw-p    [ anon ]                   
...
... TONS OF ANON
00007fca40213000     960K rw-p    [ anon ]
00007fca40307000   24172K rw-p    [ anon ]
00007fca41aaa000   41096K rw-p    [ anon ]
00007ffd66deb000     132K rw-p  [stack]
00007ffd66e0f000      12K r--p  [vvar]
00007ffd66e12000       8K r-xp  [vdso]
ffffffffff600000       4K r-xp  [vsyscall]
mapped: 2555840K

@2opremio
Copy link
Contributor

It seems like the memory consumption peaks up almost immediately (after Flux 1.16 starts up) and then Flux gets killed.

Also, it seems like Flux is not killed anymore when setting GOGC=25, with a stable memory consumption (no build up in memory usage, indicating no leaks are happening).

This points to the garbage collector not running often enough to satisfy both:

  • The memory limit of the pod
  • The frequency and size of Flux's allocations

If the memory consumption of Flux keeps being stable (i.e. indicating no leaks) while running with GOGC=25, I am not sure what else can we do (or rather what's worth doing) in order to fix this.

However, 2.5 GB of memory seems like a huge amount of memory for program like Flux for a normal-sized cluster.

So. I think that at least we should gather more information to fully explain why that is happening. e.g. amount of resources in the cluster, size of the git repo etc ...

@nabadger
Copy link

nabadger commented Nov 26, 2019

Hi, I work with Francesco who reported this issue via slack.

We've been running v1.16.0 with GOGC=25 for about an hour, with only 1 OOMKilled event which is not bad.

2019-11-26_16-30

The Git repo which Flux references is 77M - this is a monorepo, so contains kustomize and rendered manifests for all environments.

The Git path that Flux points to inside this repo is 1.8M (the rendered manifests for a specific environment).

We deploy 236 Kubernetes resources (where a resource can be something like a ClusterRole, Deployment, DaemonSet, Ingress, ConfigMap etc).

Our flux.yaml (we also run fluxcloud/filebeat sidecars, but i've removed those)

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
  labels:
    instance: flux-apps
    name: flux
  name: flux-apps
  namespace: flux-apps
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  selector:
    matchLabels:
      instance: flux-apps-fluxd
      name: flux
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        instance: flux-apps-fluxd
        name: flux
    spec:
      automountServiceAccountToken: false
      containers:
      - args:
        - --memcached-hostname=flux-apps-memcached.flux-apps.svc.cluster.local
        - --listen-metrics=:3031
        - --git-ci-skip
        - --ssh-keygen-dir=/var/fluxd/keygen
        - --k8s-secret-name=flux-git-deploy
        - --git-url=<redacted>
        - --git-branch=master
        - --git-label=flux-dev-nb1-sync
        - --git-poll-interval=1m
        - --sync-interval=5m
        - --sync-garbage-collection=true
        - --git-path=rendered/environments/dev/nb1
        - --manifest-generation=true
        env:
        - name: GOGC
          value: "25"
        image: fluxcd/flux:1.16.0
        imagePullPolicy: IfNotPresent
        name: flux
        ports:
        - containerPort: 3030
          name: api
          protocol: TCP
        - containerPort: 3031
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: 750m
            memory: 1500Mi
          requests:
            cpu: 250m
            memory: 512Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/fluxd/ssh
          mountPropagation: None
          name: git-key
          readOnly: true
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          mountPropagation: None
          name: flux-apps-token-sc2w7
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: flux-apps
      serviceAccountName: flux-apps
      shareProcessNamespace: false
      terminationGracePeriodSeconds: 30
      volumes:
      - name: git-key
        secret:
          defaultMode: 256
          optional: false
          secretName: flux-git-deploy
      - name: flux-apps-token-sc2w7
        secret:
          defaultMode: 420
          optional: false
          secretName: flux-apps-token-sc2w7

I'll leave this cluster running over night so we can if there's any more OOM events :)

Tomorrow I'll try pointing flux at another Git repo which only contains the 1.8M of manifests (to see whether the size of the repo it's fetching has an impact here).

@2opremio
Copy link
Contributor

Thanks for all the info @nabadger !!

for about an hour, with only 1 OOMKilled event which is not bad.

Thanks for being positive :) , but we are not happy about Flux getting killed without a clear explanation. Let's see if we can pinpoint the cause!

I'll leave this cluster running over night so we can if there's any more OOM events :)

Thanks. While you are at it, can you also get us:

  • Some Go memory graphs:
    1. The size of the reachable objects: go_memstats_heap_alloc_bytes
    2. The size of the reachable and unreachable objects (yet to be freed by the GC): go_memstats_alloc_bytes.
    3. The allocation rate over time: go_memstats_alloc_bytes_total
      If it's easier, use a pre-baked dashboard like https://grafana.com/grafana/dashboards/240 which contains some of the metrics.
  • The amount of kubernetes resources reachable by Flux in the cluster (from the Flux Deployment it seems like Flux can reach them all, but maybe you have some RBAC on top).

@nabadger
Copy link

grafana

@nabadger
Copy link

go_memstats_heap_alloc_bytes

@nabadger
Copy link

go_memstats_alloc_bytes

@nabadger
Copy link

Also confirming that there is no ResourceQuota affecting cpu/ram in the flux-apps namespace where this pod runs. There are also no LimitRange setting in this namespace.

@nabadger
Copy link

nabadger commented Nov 26, 2019

Bit more info (it's a GKE cluster on 1.13.11)

gke-nb1-nb1-standard-43bb2336-nhmk   Ready    <none>   11h   v1.13.11-gke.14   10.0.0.32                   Container-Optimized OS from Google   4.14.138+        docker://18.9.7
gke-nb1-nb1-standard-835fd619-2rmx   Ready    <none>   11h   v1.13.11-gke.14   10.0.0.31                   Container-Optimized OS from Google   4.14.138+        docker://18.9.7
gke-nb1-nb1-standard-d6ffcf1d-3wm5   Ready    <none>   11h   v1.13.11-gke.14   10.0.0.33                   Container-Optimized OS from Google   4.14.138+        docker://18.9.7
gke-nb1-nb1-standard-d6ffcf1d-p092   Ready    <none>   11h   v1.13.11-gke.14   10.0.0.30                   Container-Optimized OS from Google   4.14.138+        docker://18.9.7

@nabadger
Copy link

Some more info

/proc/6 # cat /proc/6/status 
Name:	fluxd
Umask:	0022
State:	S (sleeping)
Tgid:	6
Ngid:	0
Pid:	6
PPid:	1
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	64
Groups:	0 1 2 3 4 6 10 11 20 26 27 
NStgid:	6
NSpid:	6
NSpgid:	6
NSsid:	1
VmPeak:	 1995196 kB
VmSize:	 1995196 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	 1543992 kB
VmRSS:	 1418272 kB
RssAnon:	 1390584 kB
RssFile:	   27688 kB
RssShmem:	       0 kB
VmData:	 1961220 kB
VmStk:	     132 kB
VmExe:	   16748 kB
VmLib:	       8 kB
VmPTE:	    3776 kB
VmPMD:	      20 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
Threads:	16
SigQ:	0/59997
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	fffffffe7fc1feff
CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Speculation_Store_Bypass:	thread vulnerable
Cpus_allowed:	f
Cpus_allowed_list:	0-3
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	1427571
nonvoluntary_ctxt_switches:	402216
Mem: 6807336K used, 8590980K free, 1980K shrd, 233828K buff, 3519764K cached
CPU:   2% usr   2% sys   0% nic  95% idle   0% io   0% irq   0% sirq
Load average: 0.18 0.28 0.51 1/929 4633
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    6     1 root     S    1948m  12%   0   0% fluxd 

@nabadger
Copy link

golang/go#16843 might be of interest.

@2opremio
Copy link
Contributor

@nabadger Thanks a lot for all the information!

After looking at the graphs, now I am pretty sure that the problem doesn't come from a memory leak, but from the fact that Flux has huge peaks of memory allocations. Without setting GOGC, it gets an implicit value of 100, meaning that the garbage collector waits for the same amount (100%) of unreferenced garbage as it deallocated last time before to pile up. So, what I think happens is that Flux gets OOM-killed while the garbage piles up (doubles), before the GC kicks in (which is why things are better with GOGC=25).

@nabadger Are things going well with GOGC=25 after running Flux with it for a while? Did the CPU consumption raise a lot?

It would be interesting to assess the nature of those peaks. My guess is that a peak happens for every sync with the cluster. During a sync all the manifests in the repo are read and all the resources in the cluster are read too, which may explain the peak. Interestingly, you have set the sync time to 5 minutes, but the peaks happen every 6 minutes, which I can't explain.

@primeroz
Copy link

a little update on the situation, with a bit of context

  • we run 2 flux instances on each cluster, one called bootstrap and one called apps.
  • both instances are clusterwide in terms of namespaces on kubernetes
  • the bootstrap one is much smaller than the apps one since it just create namespaces, psps and few other things

After raising the memory limit to 2 GB and setting GOGC=25 ( on both instances ) they have been stable for a whole night


FLUX-APPS
flux-apps-memory-usage-1 16-stable


FLUX-BOOTSTRAP
flux-bootstrap-memory-usage-1 16-stable


Looking at the pattern of memory growth i believe that the memory consumption come from

clusterResources, err := c.getAllowedResourcesBySelector("")
when flux collect the resources it supposed to manage from kubernetes.

I also believe that our OOM Killing problems started because we added more resources to the cluster and so pushed the memory usage of flux above the limit we had set and worked before.

Both our flux instance watch the clusterwide resources with no namespace filtering so it make sense that the memory usage of both instances increased even though the bootstrap instance is much smaller in terms of managed manifests ( and uses raw manifests )

During a sync all the manifests in the repo are read and all the resources in the cluster are read too, which may explain the peak. Interestingly, you have set the sync time to 5 minutes, but the peaks happen every 6 minutes, which I can't explain.

My main concern now is what kind of growth in memory usage has to be expected when i add new resources to my cluster and so the amount of data Read during a sync increase ?

In the example graph i posted here the growth in memory consumption happened at bigger intervals than the 5m sync ... so that is very weird to me

@primeroz
Copy link

primeroz commented Nov 28, 2019

I think we found our issue , we have way too many Transient resources because of DEX and this issue which is still open

we can't even "kubectl get authrequests.dex.coreos.com" because it times out due to the massive amount of them most of the time , when i get it to succed i can see 268K of them

$ kubectl get authrequests.dex.coreos.com --all-namespaces | wc -l
268451

without counting those we still have plenty of resources though :

Number of alertmanagers.monitoring.coreos.com : 1
Number of globalfelixconfigs.crd.projectcalico.org : 0
Number of podsecuritypolicies.extensions : 6
Number of apiservices.apiregistration.k8s.io : 35
Number of globalnetworkpolicies.crd.projectcalico.org : 0
Number of podsecuritypolicies.policy : 6
Number of authcodes.dex.coreos.com : 1
Number of globalnetworksets.crd.projectcalico.org : 0
Number of pods.metrics.k8s.io : 82
Number of horizontalpodautoscalers.autoscaling : 0
Number of podtemplates : 0
Number of bgpconfigurations.crd.projectcalico.org : 0
Number of hostendpoints.crd.projectcalico.org : 0
Number of priorityclasses.scheduling.k8s.io : 4
Number of capacityrequests.internal.autoscaling.k8s.io : 0
Number of ingresses.extensions : 10
Number of prometheuses.monitoring.coreos.com : 1
Number of certificaterequests.cert-manager.io : 4
Number of ippools.crd.projectcalico.org : 0
Number of prometheusrules.monitoring.coreos.com : 3
Number of certificates.cert-manager.io : 4
Number of issuers.cert-manager.io : 2
Number of redisfailovers.databases.spotahome.com : 0
Number of certificatesigningrequests.certificates.k8s.io : 0
Number of jobs.batch : 12
Number of refreshtokens.dex.coreos.com : 0
Number of challenges.acme.cert-manager.io : 0
Number of leases.coordination.k8s.io : 0
Number of replicasets.apps : 63
Number of clusterinformations.crd.projectcalico.org : 1
Number of limitranges : 14
Number of replicasets.extensions : 63
Number of clusterissuers.cert-manager.io : 1
Number of managedcertificates.networking.gke.io : 0
Number of replicationcontrollers : 0
Number of clusterrolebindings.rbac.authorization.k8s.io : 80
Number of mutatingwebhookconfigurations.admissionregistration.k8s.io : 0
Number of resourcequotas : 32
Number of clusterroles.rbac.authorization.k8s.io : 97
Number of namespaces : 14
Number of rolebindings.rbac.authorization.k8s.io : 39
Number of componentstatuses : 4
Number of networkpolicies.crd.projectcalico.org : 0
Number of roles.rbac.authorization.k8s.io : 22
Number of configmaps : 37
Number of networkpolicies.extensions : 45
Number of sealedsecrets.bitnami.com : 14
Number of connectors.dex.coreos.com : 0
Number of networkpolicies.networking.k8s.io : 45
Number of secrets : 109
Number of controllerrevisions.apps : 14
Number of nodes : 4
Number of serviceaccounts : 72
Number of cronjobs.batch : 2
Number of nodes.metrics.k8s.io : 4
Number of servicemonitors.monitoring.coreos.com : 18
Number of customresourcedefinitions.apiextensions.k8s.io : 34
Number of oauth2clients.dex.coreos.com : 5
Number of services : 41
Number of daemonsets.apps : 7
Number of offlinesessionses.dex.coreos.com : 0
Number of signingkeies.dex.coreos.com : 1
Number of daemonsets.extensions : 7
Number of orders.acme.cert-manager.io : 1
Number of statefulsets.apps : 4
TOTAL: 1065

so the real totaly is 1065+268000 ... that is probably too many :)

I wonder if something like

// exclude the *.metrics.k8s.io resources to avoid querying the cluster metrics
this filter could help

If we could add an arbitrary filter to flux i think it would be great

I will try to manually cleanup the authrequests and see if it helps

@2opremio
Copy link
Contributor

2opremio commented Nov 28, 2019

flux-debug-log.gz

After taking a look at the debug logs above (using GODEBUG=gctrace=1) I am pretty sure that the peaks come from syncing (I still don't understand why they happen every 6 minutes and not 5, but I can live with that).

Also, the logs seem to explain why the resident memory usage of the process stays up (instead of showing peaks, like the GO allocation graphs). The OS is not taking the memory released by the GC back right away. For instance:

flux-apps-668f7459cf-d4s7l flux scvg: inuse: 13, idle: 1265, sys: 1278, released: 1264, consumed: 14 (MB)

which means that the Go program has only 14 MB allocated, but that the OS has 1.3 GB of memory to take back.

After reading the runtime docs and this blogpost it seems like using GODEBUG=madvdontneed=1 should release the memory immediately.

I don't think this will fully fix the problem; the culprit are the peaks and the memory limit is enforced by cgroups which behave like the global memory manager, so I think the cgroup implementation would reclaim the released memory before resorting to killing the process . But, it may help understand the problem better.

@2opremio
Copy link
Contributor

2opremio commented Nov 28, 2019

we can't even "kubectl get authrequests.dex.coreos.com" because it times out due to the massive amount of them most of the time , when i get it to succed i can see 268K of them

Wow. Yes, that would explain it.

For the record, I was just looking at the alloc_space heap profile (the memory sum allocated by the program call graph) from
heap-2.4g.gz ...

profile001

... and I was going to remind you to give me the resource-count in the cluster (the accumulated memory allocated by the cluster resource decoding path was huge).

I will try to manually cleanup the authrequests and see if it helps

I assume most of the authrequests.dex.coreos.com resources can be considered leaks? If not, I think there is a big design problem with DEX.

this filter could help
If we could add an arbitrary filter to flux i think it would be great

Yes, that would help. I think that makes sense.

So, things we can do:

  • Add a --k8s-exclude-resources flag defaulting to the metrics we already exclude
  • Print out the amount of resources in the cluster found in every sync (this would had helped diagnose the problem earlier).
  • Avoid reading all the resources from the cluster. It kinda makes sense to do that during garbage collection (since we need to find unreferenced resources but, even then, we could make sure that we only access resources with the Flux tag). However, it doesn't make sense to do that when we are not garbage-collecting .

@primeroz
Copy link

I assume most of the authrequests.dex.coreos.com resources can be considered leaks? If not, I think there is a big design problem with DEX.

it is a bug indeed see dexidp/dex#1292 .
I am not sure anyone posted any good solution yet so we will prbably just not use kubernetes as storage backend for dex and maybe use a dedicated ETCD that way we will be able to control the situation better and also not affect kubernetes with this issue

Add a --k8s-exclude-resources flag defaulting to the metrics we already exclude
Print out the amount of resources in the cluster found in every sync (this would had helped diagnose the problem earlier).

those would both be great for us.
The filter would be useful anyway when you, as cluster operator, knows that some crd have nothing to share with flux

the Diagnostic would be great as well, it would have made this problem very obvious :)

thanks for your help on this .... in the meantime , i am still deleting resources ! :)

@2opremio 2opremio added question and removed blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Nov 28, 2019
@primeroz
Copy link

@2opremio @hiddeco This is now solved.
Since deleting the 260K objects from kubernetes and fixing the cause of them flux is back at normal memory usage pattern.

160MB for our big one and 40MB for the small one.

Feel free to close this case, i will keep an eye out for the proposed changes of adding a filter argument

thanks again for staying with us on this!

@2opremio 2opremio self-assigned this Nov 29, 2019
@2opremio 2opremio changed the title Flux daemon uses a high amount of not directly identifiable memory Why does Flux consume so much memory? answer: > 200k resources in the cluster Nov 29, 2019
@2opremio
Copy link
Contributor

@primeroz Great!

I have created #2642 , #2643 and #2645 as follow-ups.

i will keep an eye out for the proposed changes of adding a filter argument

Now I am thinking that we could do that without extra flags, using RBAC (i.e. forbidding Flux from listing or getting that resource), see #2642 (comment)

Let's continue the conversation there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants