-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Why does Flux consume so much memory? answer: > 200k resources in the cluster #2633
Comments
heap and goroutine dumpsUser was requested to provide heap and goroutines dumps, the easiest way for to do this is by making two /home/flux # curl -sK -v http://localhost:3030/debug/pprof/heap > heap.out
/home/flux # curl -sK -v http://localhost:3030/debug/pprof/goroutines > goroutines.out No spurious amount of goroutines can be observed, and the reported
|
It seems like the memory consumption peaks up almost immediately (after Flux 1.16 starts up) and then Flux gets killed. Also, it seems like Flux is not killed anymore when setting This points to the garbage collector not running often enough to satisfy both:
If the memory consumption of Flux keeps being stable (i.e. indicating no leaks) while running with However, 2.5 GB of memory seems like a huge amount of memory for program like Flux for a normal-sized cluster. So. I think that at least we should gather more information to fully explain why that is happening. e.g. amount of resources in the cluster, size of the git repo etc ... |
Hi, I work with Francesco who reported this issue via slack. We've been running The Git repo which Flux references is The Git path that Flux points to inside this repo is We deploy Our flux.yaml (we also run fluxcloud/filebeat sidecars, but i've removed those) apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
labels:
instance: flux-apps
name: flux
name: flux-apps
namespace: flux-apps
spec:
progressDeadlineSeconds: 600
replicas: 1
selector:
matchLabels:
instance: flux-apps-fluxd
name: flux
strategy:
type: Recreate
template:
metadata:
labels:
instance: flux-apps-fluxd
name: flux
spec:
automountServiceAccountToken: false
containers:
- args:
- --memcached-hostname=flux-apps-memcached.flux-apps.svc.cluster.local
- --listen-metrics=:3031
- --git-ci-skip
- --ssh-keygen-dir=/var/fluxd/keygen
- --k8s-secret-name=flux-git-deploy
- --git-url=<redacted>
- --git-branch=master
- --git-label=flux-dev-nb1-sync
- --git-poll-interval=1m
- --sync-interval=5m
- --sync-garbage-collection=true
- --git-path=rendered/environments/dev/nb1
- --manifest-generation=true
env:
- name: GOGC
value: "25"
image: fluxcd/flux:1.16.0
imagePullPolicy: IfNotPresent
name: flux
ports:
- containerPort: 3030
name: api
protocol: TCP
- containerPort: 3031
name: metrics
protocol: TCP
resources:
limits:
cpu: 750m
memory: 1500Mi
requests:
cpu: 250m
memory: 512Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/fluxd/ssh
mountPropagation: None
name: git-key
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
mountPropagation: None
name: flux-apps-token-sc2w7
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: flux-apps
serviceAccountName: flux-apps
shareProcessNamespace: false
terminationGracePeriodSeconds: 30
volumes:
- name: git-key
secret:
defaultMode: 256
optional: false
secretName: flux-git-deploy
- name: flux-apps-token-sc2w7
secret:
defaultMode: 420
optional: false
secretName: flux-apps-token-sc2w7 I'll leave this cluster running over night so we can if there's any more OOM events :) Tomorrow I'll try pointing flux at another Git repo which only contains the |
Thanks for all the info @nabadger !!
Thanks for being positive :) , but we are not happy about Flux getting killed without a clear explanation. Let's see if we can pinpoint the cause!
Thanks. While you are at it, can you also get us:
|
Also confirming that there is no |
Bit more info (it's a GKE cluster on 1.13.11)
|
Some more info
|
golang/go#16843 might be of interest. |
@nabadger Thanks a lot for all the information! After looking at the graphs, now I am pretty sure that the problem doesn't come from a memory leak, but from the fact that Flux has huge peaks of memory allocations. Without setting @nabadger Are things going well with It would be interesting to assess the nature of those peaks. My guess is that a peak happens for every sync with the cluster. During a sync all the manifests in the repo are read and all the resources in the cluster are read too, which may explain the peak. Interestingly, you have set the sync time to 5 minutes, but the peaks happen every 6 minutes, which I can't explain. |
a little update on the situation, with a bit of context
After raising the memory limit to 2 GB and setting GOGC=25 ( on both instances ) they have been stable for a whole night Looking at the pattern of memory growth i believe that the memory consumption come from flux/pkg/cluster/kubernetes/sync.go Line 57 in ae9da52
I also believe that our OOM Killing problems started because we added more resources to the cluster and so pushed the memory usage of flux above the limit we had set and worked before. Both our flux instance watch the clusterwide resources with no namespace filtering so it make sense that the memory usage of both instances increased even though the bootstrap instance is much smaller in terms of managed manifests ( and uses raw manifests )
My main concern now is what kind of growth in memory usage has to be expected when i add new resources to my cluster and so the amount of data Read during a sync increase ? In the example graph i posted here the growth in memory consumption happened at bigger intervals than the 5m sync ... so that is very weird to me |
I think we found our issue , we have way too many Transient resources because of DEX and this issue which is still open we can't even "kubectl get authrequests.dex.coreos.com" because it times out due to the massive amount of them most of the time , when i get it to succed i can see 268K of them
without counting those we still have plenty of resources though :
so the real totaly is 1065+268000 ... that is probably too many :) I wonder if something like flux/pkg/cluster/kubernetes/sync.go Line 218 in ae9da52
If we could add an arbitrary filter to flux i think it would be great I will try to manually cleanup the authrequests and see if it helps |
After taking a look at the debug logs above (using Also, the logs seem to explain why the resident memory usage of the process stays up (instead of showing peaks, like the GO allocation graphs). The OS is not taking the memory released by the GC back right away. For instance:
which means that the Go program has only 14 MB allocated, but that the OS has 1.3 GB of memory to take back. After reading the runtime docs and this blogpost it seems like using I don't think this will fully fix the problem; the culprit are the peaks and the memory limit is enforced by cgroups which behave like the global memory manager, so I think the cgroup implementation would reclaim the released memory before resorting to killing the process . But, it may help understand the problem better. |
Wow. Yes, that would explain it. For the record, I was just looking at the ... and I was going to remind you to give me the resource-count in the cluster (the accumulated memory allocated by the cluster resource decoding path was huge).
I assume most of the
Yes, that would help. I think that makes sense. So, things we can do:
|
it is a bug indeed see dexidp/dex#1292 .
those would both be great for us. the Diagnostic would be great as well, it would have made this problem very obvious :) thanks for your help on this .... in the meantime , i am still deleting resources ! :) |
@2opremio @hiddeco This is now solved. 160MB for our big one and 40MB for the small one. Feel free to close this case, i will keep an eye out for the proposed changes of adding a filter argument thanks again for staying with us on this! |
@primeroz Great! I have created #2642 , #2643 and #2645 as follow-ups.
Now I am thinking that we could do that without extra flags, using RBAC (i.e. forbidding Flux from listing or getting that resource), see #2642 (comment) Let's continue the conversation there. |
Describe the bug
The Flux daemon uses an increasing amount of RAM (up to 2.5GB before getting OOM killed for Flux
1.16
) with--sync-garbage-collection=true
and--manifest-generation=true
set, and just1.9M
resource files on disk.To Reproduce
Steps to reproduce the behaviour:
Spin up a Flux daemon with
--sync-garbage-collection=true
and--manifest-generation=true
set. The.flux.yaml
file on this setup is:Expected behavior
A more graceful amount of memory being used.
Additional context
1.15.0
and1.16.0
The text was updated successfully, but these errors were encountered: