Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-controller pod restarting (OOMKilled) #192

Closed
avacaru opened this issue Nov 2, 2020 · 47 comments
Closed

source-controller pod restarting (OOMKilled) #192

avacaru opened this issue Nov 2, 2020 · 47 comments

Comments

@avacaru
Copy link

avacaru commented Nov 2, 2020

I have noticed that the source-controller pod of my gotk deployment restarting a huge number of times over the weekend (148 times -- version 0.1.1). I've re-deployed a newer version (0.2.1) but the restarts keep happening (about 2 every half hour).

$> k describe po -n gotk-system source-controller-5cc54c757c-ccwz8
Name:         source-controller-5cc54c757c-ccwz8
Namespace:    gotk-system
Priority:     0
Node:         my-node/10.0.10.11
Start Time:   Mon, 02 Nov 2020 13:57:18 +0000
Labels:       app=source-controller
              pod-template-hash=5cc54c757c
Annotations:  prometheus.io/port: 8080
              prometheus.io/scrape: true
Status:       Running
IP:           10.0.10.12
IPs:
  IP:           10.0.10.12
Controlled By:  ReplicaSet/source-controller-5cc54c757c
Containers:
  manager:
    Container ID:  docker://6b4a1a89311360cb832fe1d540b4f4cb96c9b8a6591fb01349390ffcdfc99b90
    Image:         my-registry.com/fluxcd/source-controller:v0.2.1
    Image ID:      docker-pullable://my-registry.com/fluxcd/source-controller@sha256:e8b708159f6d651a9577695af14bf3291ef844ca5cd7e85f182416b76561d27c
    Ports:         9090/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --events-addr=
      --watch-all-namespaces=true
      --log-level=info
      --log-json
      --enable-leader-election
      --storage-path=/data
    State:          Running
      Started:      Mon, 02 Nov 2020 14:34:16 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 02 Nov 2020 14:13:59 +0000
      Finished:     Mon, 02 Nov 2020 14:34:15 +0000
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      50m
      memory:   64Mi
    Liveness:   http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  gotk-system (v1:metadata.namespace)
      HTTPS_PROXY:        http://http.my-proxy.com:8000
      NO_PROXY:           10.0.0.0/8,172.0.0.0/8
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bfr47 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-bfr47:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bfr47
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/arch=amd64
                 kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age                 From                                       Message
  ----    ------     ----                ----                                       -------
  Normal  Scheduled  <unknown>           default-scheduler                          Successfully assigned gotk-system/source-controller-5cc54c757c-ccwz8 to my-node
  Normal  Pulled     4m6s (x3 over 41m)  kubelet, my-node  Container image "my-registry.com/fluxcd/source-controller:v0.2.1" already present on machine
  Normal  Created    4m6s (x3 over 41m)  kubelet, my-node  Created container manager
  Normal  Started    4m6s (x3 over 41m)  kubelet, my-node  Started container manager

This causes the helm-controller to not be able to reconcile HelmReleases:

$> k get hr --all-namespaces
NAMESPACE                NAME                                       READY   STATUS                                                                                                                                                                                                                                      AGE
namespace1           chart1        False   Get "http://source-controller.gotk-system/helmchart/namespace1/chart1/chart1-v0.15.5.tgz": dial tcp 172.20.225.87:80: connect: connection refused              2d18h
( . . .)
( . . .)
( . . .)
namespace2          chart11       False   Get "http://source-controller.gotk-system/helmchart/namespace2/chart11/chart11-v0.1.3.tgz": dial tcp 172.20.225.87:80: connect: connection refused              2d18h

The source controller manages one GitRepository and two HelmRepositories.
The helm controller takes care of 11 HelmReleases, each with similar configuration:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: my-release
  namespace: namespace1
spec:
  install:
    remediation:
      retries: -1
  upgrade:
    remediation:
      retries: -1
  interval: 1m0s
  releaseName: my-release
  chart:
    spec:
      version: 1.0.2
      chart: my-chart
      sourceRef:
        kind: HelmRepository
        name: my-repository
        namespace: namespace1
  valuesFrom:
  - kind: ConfigMap
    name: my-values
    valuesKey: environment
    targetPath: myEnv
  values:
    my-value: 30

While writing up this issue the source-controller restarted 3 more times
Logs from the source controller don't indicate any errors:

{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart1"}
( . . . )
( . . . )
( . . . )
{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart11"}
{"level":"info","ts":"2020-11-02T15:01:12.527Z","logger":"controllers.GitRepository","msg":"Reconciliation finished in 1.631398488s, next run in 3m0s","controller":"gitrepository","request":"namespace1/my-git-repo"}
{"level":"info","ts":"2020-11-02T15:01:12.870Z","logger":"controllers.HelmRepository","msg":"Reconciliation finished in 1.165803995s, next run in 3m0s","controller":"helmrepository","request":"namespace1/my-repository"}
@stefanprodan
Copy link
Member

stefanprodan commented Nov 2, 2020

Can you post here the kubelet error, should be under describe replicaset or pod. Can you also post what interval are you using in the GitRepository.

@avacaru
Copy link
Author

avacaru commented Nov 2, 2020

The pod description doesn't show any error, just that the pod was Terminated with reason OOMKilled.
Here's the ReplicaSet descritption:

$> k describe replicasets.apps -n gotk-system source-controller-5cc54c757c
Name:           source-controller-5cc54c757c
Namespace:      gotk-system
Selector:       app=source-controller,pod-template-hash=5cc54c757c
Labels:         app=source-controller
                pod-template-hash=5cc54c757c
Annotations:    deployment.kubernetes.io/desired-replicas: 1
                deployment.kubernetes.io/max-replicas: 2
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/source-controller
Replicas:       1 current / 1 desired
Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       app=source-controller
                pod-template-hash=5cc54c757c
  Annotations:  prometheus.io/port: 8080
                prometheus.io/scrape: true
  Containers:
   manager:
    Image:       my-registry.com/fluxcd/source-controller:v0.2.1
    Ports:       9090/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --events-addr=
      --watch-all-namespaces=true
      --log-level=info
      --log-json
      --enable-leader-election
      --storage-path=/data
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      50m
      memory:   64Mi
    Liveness:   http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:   (v1:metadata.namespace)
      HTTPS_PROXY:        http://http.my-proxy.com:8000
      NO_PROXY:           10.0.0.0/8,172.0.0.0/8
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
  Volumes:
   data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Events:         <none>

And here are the events in gotk-system:

$> k get events -n gotk-system
LAST SEEN   TYPE      REASON           OBJECT                                   MESSAGE
60m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_37899998-7d0f-4107-8ec2-cfb907cfe7a3 became leader
39m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_99debb56-a4d6-43d7-90c1-d44ee06d08e1 became leader
29m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_5c11d40a-c436-4281-97c3-66205ab3d16d became leader
19m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_1a1aa41c-5a5e-440e-941b-498aaf80d59a became leader
18m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_c48a4f03-14f8-4d66-87dc-a38fdb8cc608 became leader
16m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_8f619f94-12ba-4acd-ae8c-346325b1e77a became leader
10m         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_16f7528e-336f-4f78-9ce5-c6cc424658b5 became leader
70s         Normal    LeaderElection   configmap/305740c0.fluxcd.io             source-controller-5cc54c757c-ccwz8_bf7cc933-c400-46b6-a7d4-fcbf30e6fd49 became leader
16m         Normal    Pulled           pod/source-controller-5cc54c757c-ccwz8   Container image "my-registry.com/fluxcd/source-controller:v0.2.1" already present on machine
16m         Normal    Created          pod/source-controller-5cc54c757c-ccwz8   Created container manager
16m         Normal    Started          pod/source-controller-5cc54c757c-ccwz8   Started container manager
4m8s        Warning   BackOff          pod/source-controller-5cc54c757c-ccwz8   Back-off restarting failed container
16m         Warning   Unhealthy        pod/source-controller-5cc54c757c-ccwz8   Readiness probe failed: Get http://10.0.10.11:9090/: dial tcp 10.0.10.11:9090: connect: connection refused
16m         Warning   Unhealthy        pod/source-controller-5cc54c757c-ccwz8   Liveness probe failed: Get http://10.0.10.11:9090/: dial tcp 10.0.10.11:9090: connect: connection refused

This is how the GitRepository is defined:

apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: my-git-repo
spec:
  url: https://my-git-server.com/scm/my-repo
  secretRef:
    name: git-secret
  interval: 3m
  timeout: 60s
  ref:
    branch: my-branch

@stefanprodan
Copy link
Member

Are all your HelmReleases coming from HelmRepositories or you do have charts in GitRepositories?

@avacaru
Copy link
Author

avacaru commented Nov 2, 2020

All the HelmReleases have a HelmRepository as source (the same repository reference)

@hiddeco
Copy link
Member

hiddeco commented Nov 2, 2020

Can you share the sizes of the .tgz files as produced by the source-controller for the HelmChart resources, the size of the YAML files produced for the HelmRepository resources, and the size of the artifact created for the GitRepository?

Also: note that the interval you have set for the HelmRelease is extremely low, and is inherited in the template for the HelmChart defined in the spec.chart.spec. This means the source-controller will load (parts) of the chart into memory to make observations every minute, in addition to the index file for the repository (times 11 in a short time span, but not simultaneously given the limited amount of workers).

@avacaru
Copy link
Author

avacaru commented Nov 2, 2020

I'm not sure if this is exactly what you're asking for but here are the sizes on the source-controller pod:

data/helmchart: 304K
data/helmrepository: 44.4M
data/gitrepository: 40K
data: 44.7M

@stefanprodan
Copy link
Member

stefanprodan commented Nov 2, 2020

A 44MB index would explain the OOM, every minute the index is loaded into memory for each release and parsed, with the default number of workers means: 3*4*44=528MB, if GC is slow for some reason (busy Kubernetes node) at the 2nd run it will OOM.

@avacaru
Copy link
Author

avacaru commented Nov 2, 2020

I've just increased the interval on the HelmReleases to 3m0s, I will update if I still see the issue.

@avacaru
Copy link
Author

avacaru commented Nov 3, 2020

Increasing the interval from 1m to 3m in the HelmRelease didn't solve the issue. There have been over 70 restarts in 17h. When doing kubectl get hr --all-namespaces all of them are Ready=False with the error message: Get "http://source-controller.gotk-system/helmchart/namespace1/chart1/chart1-v0.15.5.tgz": dial tcp 172.70.139.122:80: connect: connection refused.

@stefanprodan
Copy link
Member

Yeah that's expected, doesn't matter the interval if it's the same for all HRs. You either increase the memory limit or you trim down the 44MB index, for reference, the stable Helm repository index is 7MB.

@brianpham
Copy link

We are experiencing a similar issue too. The repo that we clone is pretty big
443.7M a15bf30c07b0378b262003cd99ce4c3fb19f0c8a.tar.gz which causes us to see this error every once in awhile.

failed to download artifact from http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/firespotter-bpham/41b16bd8b344a40836eec3ced8b8a031d78c7c4c.tar.gz, error: Get "http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/firespotter-bpham/41b16bd8b344a40836eec3ced8b8a031d78c7c4c.tar.gz": dial tcp 10.239.249.31:80: connect: connection refused

Is the only way to fix this would be to increase the limit on the source-controller? What did you end up setting your limit to? @avacaru

@stefanprodan
Copy link
Member

You can change any field of Flux manifests with Kustomize patches without interfering with bootstrap, please read the docs https://toolkit.fluxcd.io/guides/installation/#customize-flux-manifests

@stefanprodan
Copy link
Member

@brianpham make sure you use .sourceignore and you exclude everything else but the yaml manifests or consider having the manifests in a dedicated branch.

@billimek
Copy link

billimek commented Feb 6, 2021

I believe that I'm encountering this issue as well with source controller (v0.7.4)

From the latest OOM kill last night:
image

Last 5 OOM kills:
image

Interestingly, the memory usage 'spike' coincides with a bunch of errors logged from source-controller, but I'm not certain if the errors are the cause or the symptom of the memory issue:

  |   | 2021-02-06 01:07:18 | 2021-02-06T06:07:18.849711004Z stderr F {"level":"info","ts":"2021-02-06T06:07:18.845Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
  |   | 2021-02-06 01:07:13 | 2021-02-06T06:07:13.474216905Z stderr F {"level":"info","ts":"2021-02-06T06:07:13.474Z","logger":"controller.helmchart","msg":"Reconciliation finished in 12.27240148s, next run in 5m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system"}
  |   | 2021-02-06 01:07:13 | 2021-02-06T06:07:13.082799388Z stderr F {"level":"info","ts":"2021-02-06T06:07:13.082Z","logger":"controller.helmchart","msg":"Reconciliation finished in 12.153198745s, next run in 5m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-nfs-client-provisioner","namespace":"flux-system"}
  |   | 2021-02-06 01:07:01 | 2021-02-06T06:07:01.201897549Z stderr F {"level":"info","ts":"2021-02-06T06:07:01.201Z","logger":"controller.helmchart","msg":"Reconciliation finished in 6.532376284s, next run in 5m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system"}
  |   | 2021-02-06 01:07:00 | 2021-02-06T06:07:00.931256162Z stderr F {"level":"info","ts":"2021-02-06T06:07:00.929Z","logger":"controller.helmchart","msg":"Reconciliation finished in 6.283877764s, next run in 5m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-oauth2-proxy","namespace":"flux-system"}
  |   | 2021-02-06 01:06:54 | 2021-02-06T06:06:54.664863416Z stderr F {"level":"error","ts":"2021-02-06T06:06:54.664Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/coredns-1.13.8.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:54 | 2021-02-06T06:06:54.657228588Z stderr F {"level":"error","ts":"2021-02-06T06:06:54.656Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"kubernetes-stable-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.helm.sh/stable/index.yaml\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:54 | 2021-02-06T06:06:54.645472257Z stderr F {"level":"error","ts":"2021-02-06T06:06:54.645Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-nfs-client-provisioner","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/nfs-client-provisioner-1.2.11.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:50 | 2021-02-06T06:06:50.463077309Z stderr F {"level":"error","ts":"2021-02-06T06:06:50.462Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:50 | 2021-02-06T06:06:50.462720564Z stderr F {"level":"error","ts":"2021-02-06T06:06:50.462Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:48 | 2021-02-06T06:06:48.402854269Z stderr F {"level":"info","ts":"2021-02-06T06:06:48.402Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 6.915541887s, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
  |   | 2021-02-06 01:06:46 | 2021-02-06T06:06:46.600008065Z stderr F {"level":"error","ts":"2021-02-06T06:06:46.599Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"kubernetes-stable-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.helm.sh/stable/index.yaml\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:46 | 2021-02-06T06:06:46.597701935Z stderr F {"level":"error","ts":"2021-02-06T06:06:46.597Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/coredns-1.13.8.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:41 | 2021-02-06T06:06:41.158465791Z stderr F {"level":"error","ts":"2021-02-06T06:06:41.158Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:41 | 2021-02-06T06:06:41.158411393Z stderr F {"level":"error","ts":"2021-02-06T06:06:41.158Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:40 | 2021-02-06T06:06:40.212160284Z stderr F {"level":"error","ts":"2021-02-06T06:06:40.207Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:38 | 2021-02-06T06:06:38.560144001Z stderr F {"level":"error","ts":"2021-02-06T06:06:38.560Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"kubernetes-stable-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.helm.sh/stable/index.yaml\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:38 | 2021-02-06T06:06:38.549495113Z stderr F {"level":"error","ts":"2021-02-06T06:06:38.548Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/coredns-1.13.8.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:32 | 2021-02-06T06:06:32.486828454Z stderr F {"level":"error","ts":"2021-02-06T06:06:32.486Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:32 | 2021-02-06T06:06:32.485786292Z stderr F {"level":"error","ts":"2021-02-06T06:06:32.485Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:32 | 2021-02-06T06:06:31.450189766Z stderr F {"level":"error","ts":"2021-02-06T06:06:31.450Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:30 | 2021-02-06T06:06:30.517170051Z stderr F {"level":"error","ts":"2021-02-06T06:06:30.517Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"kubernetes-stable-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.helm.sh/stable/index.yaml\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:30 | 2021-02-06T06:06:30.512759542Z stderr F {"level":"error","ts":"2021-02-06T06:06:30.512Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/coredns-1.13.8.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:24 | 2021-02-06T06:06:24.129828108Z stderr F {"level":"error","ts":"2021-02-06T06:06:24.129Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:24 | 2021-02-06T06:06:24.12976421Z stderr F {"level":"error","ts":"2021-02-06T06:06:24.129Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:23 | 2021-02-06T06:06:23.108242718Z stderr F {"level":"error","ts":"2021-02-06T06:06:23.108Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:21 | 2021-02-06T06:06:20.840575366Z stderr F {"level":"error","ts":"2021-02-06T06:06:20.840Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/coredns-1.13.8.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:16 | 2021-02-06T06:06:15.933484828Z stderr F {"level":"error","ts":"2021-02-06T06:06:15.933Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:16 | 2021-02-06T06:06:15.909249047Z stderr F {"level":"error","ts":"2021-02-06T06:06:15.908Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:15 | 2021-02-06T06:06:14.926537152Z stderr F {"level":"error","ts":"2021-02-06T06:06:14.926Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:11 | 2021-02-06T06:06:11.211943496Z stderr F {"level":"error","ts":"2021-02-06T06:06:11.211Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-coredns","namespace":"flux-system","error":"Get \"https://charts.helm.sh/stable/packages/coredns-1.13.8.tgz\": dial tcp: lookup charts.helm.sh on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:07 | 2021-02-06T06:06:07.791478986Z stderr F {"level":"error","ts":"2021-02-06T06:06:07.791Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:07 | 2021-02-06T06:06:07.790164871Z stderr F {"level":"error","ts":"2021-02-06T06:06:07.790Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:06:06 | 2021-02-06T06:06:06.79442691Z stderr F {"level":"error","ts":"2021-02-06T06:06:06.794Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:59 | 2021-02-06T06:05:59.722096551Z stderr F {"level":"error","ts":"2021-02-06T06:05:59.721Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:59 | 2021-02-06T06:05:59.720861799Z stderr F {"level":"error","ts":"2021-02-06T06:05:59.720Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:58 | 2021-02-06T06:05:58.71582877Z stderr F {"level":"error","ts":"2021-02-06T06:05:58.715Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:51 | 2021-02-06T06:05:51.674075128Z stderr F {"level":"error","ts":"2021-02-06T06:05:51.673Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:51 | 2021-02-06T06:05:51.674032835Z stderr F {"level":"error","ts":"2021-02-06T06:05:51.673Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:50 | 2021-02-06T06:05:50.652427545Z stderr F {"level":"error","ts":"2021-02-06T06:05:50.652Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:43 | 2021-02-06T06:05:43.640698151Z stderr F {"level":"error","ts":"2021-02-06T06:05:43.640Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:43 | 2021-02-06T06:05:43.640360015Z stderr F {"level":"error","ts":"2021-02-06T06:05:43.640Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:42 | 2021-02-06T06:05:42.617590202Z stderr F {"level":"error","ts":"2021-02-06T06:05:42.617Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:35 | 2021-02-06T06:05:35.610319682Z stderr F {"level":"error","ts":"2021-02-06T06:05:35.609Z","logger":"controller.helmrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"bitnami-charts","namespace":"flux-system","error":"failed to download repository index: Get \"https://charts.bitnami.com/bitnami/index.yaml\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:35 | 2021-02-06T06:05:35.608077914Z stderr F {"level":"error","ts":"2021-02-06T06:05:35.607Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"kube-system-metallb","namespace":"flux-system","error":"Get \"https://charts.bitnami.com/bitnami/metallb-1.1.0.tgz\": dial tcp: lookup charts.bitnami.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:34 | 2021-02-06T06:05:34.588944813Z stderr F {"level":"error","ts":"2021-02-06T06:05:34.588Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/billimek/k8s-gitops', error: dial tcp: lookup github.com on 10.43.0.10:53: server misbehaving"}
  |   | 2021-02-06 01:05:01 | 2021-02-06T06:05:01.722442354Z stderr F {"level":"info","ts":"2021-02-06T06:05:01.722Z","logger":"controller.helmrepository","msg":"Reconciliation finished in 6.71510648s, next run in 10m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"banzaicloud-charts","namespace":"flux-system"}
  |   | 2021-02-06 01:04:50 | 2021-02-06T06:04:50.10747884Z stderr F {"level":"info","ts":"2021-02-06T06:04:50.106Z","logger":"controller.helmchart","msg":"Reconciliation finished in 738.712206ms, next run in 5m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"cert-manager-cert-manager","namespace":"flux-system"}
  |   | 2021-02-06 01:04:44 | 2021-02-06T06:04:44.664699112Z stderr F {"level":"info","ts":"2021-02-06T06:04:44.664Z","logger":"controller.helmchart","msg":"Reconciliation finished in 423.6548ms, next run in 5m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"default-ser2sock","namespace":"flux-system"}

Is the appropriate remedy to increase the memory limit for source controller? It's currently set to 1Gi.

@hiddeco
Copy link
Member

hiddeco commented Feb 6, 2021

@billimek if you have many Helm related resources in your cluster, you may want to try this, as for some operations we need to read e.g. whole repository indexes from memory.

@billimek
Copy link

billimek commented Feb 6, 2021

@billimek if you have many Helm related resources in your cluster, you may want to try this, as for some operations we need to read e.g. whole repository indexes from memory.

Thanks @hiddeco, I beleive that there are probably a lot in this case. I bumped the limit to 2Gi. Appreciate the super fast response!

❯ k get helmreleases.helm.toolkit.fluxcd.io -A | wc -l
44

❯ k get helmcharts.source.toolkit.fluxcd.io -A | wc -l
44

❯ k get helmrepositories.source.toolkit.fluxcd.io -A | wc -l
24
/data $ du -hs /data/*
124.0K  /data/gitrepository
1.6M    /data/helmchart
19.9M   /data/helmrepository

image

@hiddeco
Copy link
Member

hiddeco commented Feb 6, 2021

@billimek the files are not as enormous as I would have expected (I have seen indexes of ~50MiB).

I have created a PR to enable pprof endpoints on the metrics server so that we can get a better insight into the resource consumption of your controller.

@Ayatallah
Copy link

You can change any field of Flux manifests with Kustomize patches without interfering with bootstrap, please read the docs https://toolkit.fluxcd.io/guides/installation/#customize-flux-manifests

Should I add the following part, to the kustomization.yaml that is in the same directory alongside with gotk-sync.yaml and gotk-components.yaml ??
patchesStrategicMerge:

  • flux-patch.yaml
    Cause I did that and still the flux instance in my cluster did not reflect the customization i written in flux-patch.yaml

@Ayatallah
Copy link

You can change any field of Flux manifests with Kustomize patches without interfering with bootstrap, please read the docs https://toolkit.fluxcd.io/guides/installation/#customize-flux-manifests

Should I add the following part, to the kustomization.yaml that is in the same directory alongside with gotk-sync.yaml and gotk-components.yaml ??
patchesStrategicMerge:

  • flux-patch.yaml
    Cause I did that and still the flux instance in my cluster did not reflect the customization i written in flux-patch.yaml

@stefanprodan

@onedr0p
Copy link
Contributor

onedr0p commented Apr 5, 2021

@Ayatallah see my example here and here.

@Ayatallah
Copy link

@Ayatallah see my example here and here.

Thank you! I almost did the same and source-controller memory limit still the same, do you do anything specific for the flux instance to sync with these kustomization or just commit and push them to git and it sync automatically?!

@Ayatallah
Copy link

@Ayatallah see my example here and here.

Thank you! I almost did the same and source-controller memory limit still the same, do you do anything specific for the flux instance to sync with these kustomization or just commit and push them to git and it sync automatically?!

@onedr0p

@onedr0p
Copy link
Contributor

onedr0p commented Apr 5, 2021

@Ayatallah see my example here and here.

Thank you! I almost did the same and source-controller memory limit still the same, do you do anything specific for the flux instance to sync with these kustomization or just commit and push them to git and it sync automatically?!

@onedr0p

IIRC it was synced automatically.

@Ayatallah
Copy link

Ayatallah commented Apr 5, 2021

@Ayatallah see my example here and here.

Thank you! I almost did the same and source-controller memory limit still the same, do you do anything specific for the flux instance to sync with these kustomization or just commit and push them to git and it sync automatically?!

@onedr0p

IIRC it was synced automatically.

Okay, can you let me know if I'm missing out anyth:
-- I bootstraped flux instance called staging using bootstrap command, and I got the following directory created automatically
staging/
gotk-sync.yaml
gotk-components.yaml
kustomization.yaml

I added the following to kustomization.yaml:
patchesStrategicMerge:
- gotk-patches.yaml

so it now looks like this:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gotk-components.yaml
-gotk-sync.yaml
patchesStrategicMerge:
-gotk-patches.yam

and gotk-patches.yaml content is as follows:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: source-controller
namespace: staging
spec:
template:
spec:
containers:
- name: manager
resources:
limits:
memory: 1.3Gi

so staging directory now contains 4 files:
staging/
gotk-sync.yaml
gotk-components.yaml
kustomization.yaml
gotk-patches.yaml

then commit and push to git but not automatic sync happening!

@Ayatallah
Copy link

Ayatallah commented Apr 5, 2021

@Ayatallah see my example here and here.

Thank you! I almost did the same and source-controller memory limit still the same, do you do anything specific for the flux instance to sync with these kustomization or just commit and push them to git and it sync automatically?!

@onedr0p

IIRC it was synced automatically.

Okay, can you let me know if I'm missing out anyth:
-- I bootstraped flux instance called staging using bootstrap command, and I got the following directory created automatically
staging/
gotk-sync.yaml
gotk-components.yaml
kustomization.yaml

I added the following to kustomization.yaml:
patchesStrategicMerge:
- gotk-patches.yaml

so it now looks like this:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gotk-components.yaml
-gotk-sync.yaml
patchesStrategicMerge:
-gotk-patches.yaml

and gotk-patches.yaml content is as follows:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: source-controller
namespace: staging
spec:
template:
spec:
containers:
- name: manager
resources:
limits:
memory: 1.3Gi

so staging directory now contains 4 files:
staging/
gotk-sync.yaml
gotk-components.yaml
kustomization.yaml
gotk-patches.yaml

then commit and push to git but not automatic sync happening!

@onedr0p @stefanprodan

@stefanprodan
Copy link
Member

@Ayatallah many things look wrong in there, there is a typo in the patch file name, also the namespace is wrong, should be flux-system. Please use code blocks and paste the YAML inside them.

@Ayatallah
Copy link

@Ayatallah many things look wrong in there, there is a typo in the patch file name, also the namespace is wrong, should be flux-system. Please use code blocks and paste the YAML inside them.

Should I use flux-system as namespace even if its not the namespace i bootstrapped the flux instance in?!

flux bootstrap gitlab --owner=name --repository=name --branch=name --path=path/to/ --token-auth --namespace=staging

@Ayatallah
Copy link

@Ayatallah many things look wrong in there, there is a typo in the patch file name, also the namespace is wrong, should be flux-system. Please use code blocks and paste the YAML inside them.

Should I use flux-system as namespace even if its not the namespace i bootstrapped the flux instance in?!

flux bootstrap gitlab --owner=name --repository=name --branch=name --path=path/to/ --token-auth --namespace=staging

@stefanprodan

@Ayatallah
Copy link

Is it a must to use namespace=flux-system ?!
Like, if I need to create multiple flux instances in my cluster, one instance per namespace, I should not create flux instance in each namespace of them, I should instead create flux instance in flux-system namespace and replicate it using Kustomization patches, and give each Kustomization different path to track

@stefanprodan
Copy link
Member

@Ayatallah Flux v2 is not meant to be installed more than once per cluster. See https://github.com/fluxcd/flux2-multi-tenancy on how to do multi-tenancy if that's what you're after.

@Ayatallah
Copy link

@Ayatallah Flux v2 is not meant to be installed more than once per cluster. See https://github.com/fluxcd/flux2-multi-tenancy on how to do multi-tenancy if that's what you're after.
@stefanprodan
I'm looking for doing multi tenancy yes, but my tenant repository architecture is not :
Base/
       app1/
       app2/
Overlay/
       app1/
       app2/
However, its the following
App1/
       base/
       overlays/
             dev/
             prod/
App2/
       base/
       overlays/
             dev/
             prod/
That's why I thought of creating more than one flux instance per namespace (prod, dev, ..etc), would that be manageable with 1 flux instance in the cluster?! Seems like the steps in the link is based on specific architecture for tenant repository.

@Ayatallah
Copy link

Ayatallah commented Apr 6, 2021

@Ayatallah Flux v2 is not meant to be installed more than once per cluster. See https://github.com/fluxcd/flux2-multi-tenancy on how to do multi-tenancy if that's what you're after.
@stefanprodan
I'm looking for doing multi tenancy yes, where I have one cluster having several namespaces (prod, dev, ..etc), and for dev namespace, it should have the dev instance of APP1 and dev instance of APP2 running, and same for prod, but my tenant repository architecture is not :
Base/
       app1/
       app2/
Overlay/
       app1/
       app2/
However, its the following
App1/
       base/
       overlays/
             dev/
             prod/
App2/
       base/
       overlays/
             dev/
             prod/
That's why I thought of creating more than one flux instance per namespace (prod, dev, ..etc), would that be manageable with 1 flux instance in the cluster?! Seems like the steps in the link is based on specific architecture for tenant repository.

or for flux v2 multi tenancy to be applied properly, I have to re-structure my repo?!

@hiddeco
Copy link
Member

hiddeco commented Apr 6, 2021

No, you can create multiple Kustomization resources (in different namespaces) that all select a different (environment) folder.

@gdoctor
Copy link

gdoctor commented Apr 27, 2021

Screen Shot 2021-04-27 at 10 18 30 AM

Screen Shot 2021-04-27 at 10 29 58 AM

Nothing in my git or helm repositories are particularly large (screenshot); yet, during source-controller pod startup, the pod spikes past 2.5Gb of memory. What are known implementation decisions that would cause a large memory spike on startup? After a couple minutes, it comes back down to around 1Gb where it seems to be staying. I do have decently fast interval on my GitRepository (2 total objects, 1 min interval each) and HelmRepository objects (3 total objects, 1 min interval each). But I also have the same configurations running in about 8 other clusters with no issues.

So my question is what is causing this in only one cluster and not elsewhere? I suspect it could be related to the way I have my helm charts configured in this cluster. I have 3 charts being fetched and packaged directly from a GitRepository. I am not doing this in other clusters so I am guessing that could be the root cause. Is there known performance trade offs with using GitRepositories in a helm chart?

@stefanprodan
Copy link
Member

I have 3 charts being fetched and packaged directly from a GitRepository

Do you have .helmignore files in there that exclude .git/?

@gdoctor
Copy link

gdoctor commented Apr 28, 2021

I have 3 charts being fetched and packaged directly from a GitRepository

Do you have .helmignore files in there that exclude .git/?

Yes some of my charts have a .helmignore that excludes .git/; however, that is only true for the charts that have a Source of HelmRepository. None of the charts that have a Source of GitRepository have that condition. Could that still create issues?

Edit: I should add that ALL of the charts are stored in the same git repo. So there are definitely charts that have a .helmignore excluding .git/ in that git repo.

@kingdonb
Copy link
Member

Is .helmignore actually honored? I don't find any references to it in the source or documentation, so I would assume it is a still a helm-client-only feature and not something you can count on Flux to honor when using GitRepository helm source. (This is perhaps unfortunate, but it looks like this feature hasn't been requested before!)

Documentation at helm.sh about the purpose of this file indicates that .helmignore is for use by helm chart packagers, this is ostensibly something that source-controller should consider doing, but I don't think it is implemented currently at all.

What would be really nice is if you could forward .helmignore to source-controller, (since that's where the OOM condition is encountered,) and at that point source-controller doesn't really even know if it is being used to carry a helm chart, it is simply a git repository.

If source controller gitrepositories could somehow know that they'll be used for serving a helm chart, and they should honor .helmignore and consider it equal to .sourceignore, that would make it possible to use with repos like this one as an upstream, which is currently not possible according to report from Slack user: https://github.com/neo4j-contrib/neo4j-helm

Right now I think the only other way to accomplish this installation is to fork and add .sourceignore, manually copy the content from the .helmignore file, or write it into spec of a gitrepo source and keep it up to date somehow, at spec.ignore.

@rajivchirania
Copy link

@stefanprodan Also facing the same OOM issue with source controller.
However in our case flux v2 is installed using terraform provider.
Can you please tell how to set the limit for source controller in this case.

@stefanprodan
Copy link
Member

@rajivchirania
Copy link

@rajivchirania see fluxcd/terraform-provider-flux#178

@stefanprodan

So i made the changes but still it does apply the limit or the request that i have set for source controller

This is my kustomization template file

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
  name: ${sync_name}
  namespace: flux-system
spec:
  force: false
  interval: ${interval}
  path: ./${target}
  prune: true
  sourceRef:
    kind: GitRepository
    name: ${sync_name}
  validation: client
  patchesStrategicMerge:
  - apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: source-controller
      namespace: flux-system
    spec:
      template:
        spec:
          containers:
            resources:
              limits:
                cpu: "1"
                memory: 2Gi
              requests:
                cpu: 50m
                memory: 256Mi

My kustomization.tf file is like this where i apply this above template

locals {
  kustomization_template = templatefile("${path.module}/values/kustomization.yaml.tpl", {
    sync_name = var.sync_name
    target    = var.flux_target_path
    interval  = var.interval_default_kustomization
  })

resource "kubectl_manifest" "kustomization" {
  yaml_body = local.kustomization_template
}

Please let me know if i am doing something wrong

@matt-woodruff-f3
Copy link

Hey all. I'm currently experiencing the same issue along with @ekosov-form3 on our work. This thread has been helpful and increasing the memory limit is our solution for now.

However we'd like to understand more about the source controller's memory requirements so we can understand why its so high and look at alternative solutions like reducing the index size.

We'd like to know

  • roughly how to calculate the memory requirements for a reconciliation. @stefanprodan you touched on it here but we were unable to understand the significance of the 3 in the equation. Also we're assuming the 4 relates to the number of workers.
  • is the index loaded for every reconciliation or is it cached for other reconciliations using the same source?
  • does the reconciliation interval have any effect on memory? Maybe if its too frequent garbage collection may not trigger in time?

Currently we're seeing memory fluctuate roughly between 600MB and 1200MB with the following resources and a default installation except for the source controller's memory limit.

All helm charts have 1m interval. All git/helm repositories used by helm releases have 5m interval.

73	helm charts
14	git repositories
5	helm repositories

Total sizes in /data
21M	gitrepository
804K	helmchart
21M	helmrepository

1 helm chart    -> helm repo 1 [9.4MB]
1 helm chart    -> helm repo 2 [9.4MB]
1 helm chart    -> helm repo 3 [1.5MB]
1 helm chart    -> helm repo 4 [148k]
1 helm chart    -> helm repo 5 [8k]

2 helm charts	-> git repo 1	[9.7MB]
1 helm chart	-> git repo 2	[548k]
2 helm charts	-> git repo 3	[548k]
1 helm chart	-> git repo 4	[548k]
1 helm chart	-> git repo 5	[548k]
1 helm chart	-> git repo 6	[548k]
1 helm chart	-> git repo 7	[548k]
5 helm charts	-> git repo 8	[188k]
20 helm charts  -> git repo 9	[188k]
5 helm charts	-> git repo 10	[188k]
22 helm charts  -> git repo 11	[188k]
3 helm charts	-> git repo 12	[188k]
4 helm charts	-> git repo 13	[188k]

1 kustomization -> git repo 14  [2.2MB, 30s interval]

Thanks

@hiddeco
Copy link
Member

hiddeco commented Nov 17, 2021

@matt-woodruff-f3 given you have collected such detailed statistics about your Helm usage, would you be willing to give an image based on #485 a spin? This is getting into a shape that it'll likely end up in a release soon, and will greatly effect the answers to your questions (and should heavily improve performance).

If so, please reach out to me on Slack (@hidde), or comment here.

@hiddeco
Copy link
Member

hiddeco commented Nov 18, 2021

Release candidate for the above PR has been made available, and instructions are added to the PR for testing purposes. It would be great if some of you could try this out and share results, as simulating real-world Helm setups has proven to be extremely difficult.

@kingdonb
Copy link
Member

kingdonb commented Dec 7, 2021

I believe these changes are in source-controller 0.19.0 and Flux 0.24.0, so this issue can be closed out now.

(Is that correct?)

@hiddeco
Copy link
Member

hiddeco commented Dec 8, 2021

The changes have indeed been released in 0.19.x, but I would like to see a confirmation from e.g. @matt-woodruff-f3 around resource usage reduction before I think this can be closed.

@matt-woodruff-f3
Copy link

matt-woodruff-f3 commented Dec 8, 2021

@hiddeco Thanks for the update! We've been running 0.19.0 in 3 of our environments for a few days now and can report no OOM issues. We've even reverted the memory requirements back to default from Max 2Gi to 1Gi.

@kingdonb
Copy link
Member

kingdonb commented Dec 8, 2021

Awesome. Thanks for the confirmation @matt-woodruff-f3 – I'll close this now, based on your confirmation!

@kingdonb kingdonb closed this as completed Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests