Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image-automation-controller CrashLoopBack >= 0.20.2 #339

Closed
saltyblu opened this issue Mar 30, 2022 · 71 comments · Fixed by #411
Closed

image-automation-controller CrashLoopBack >= 0.20.2 #339

saltyblu opened this issue Mar 30, 2022 · 71 comments · Fixed by #411
Assignees
Labels
bug Something isn't working
Milestone

Comments

@saltyblu
Copy link

saltyblu commented Mar 30, 2022

Issue

Since upgrading to 0.20.2 the Image-Update-Automation Controller stopped working,

flux reconcile image update runs forever

The image-update-controller fails periodically.
/readyz is not returning 200, but also provides no infos.

The service provides no Logs for the failures.

How To reproduce

In our case: update image-automation-crontroller > 0.20.1
Trigger an image update

@pjbgf
Copy link
Member

pjbgf commented Mar 30, 2022

@saltyblu thank you very much for reporting this. Would you be able to share with us the contents of the commands below:

kubectl logs -n flux-system -l app=image-automation-controller

And the over version of your other controllers via:

flux version

I noticed that you are not running on the latest version of the image automation controller. Would you be able to run against the latest version and see if the error persists please?

@saltyblu
Copy link
Author

saltyblu commented Mar 30, 2022

I've upgraded the image-automation-controller back to the latest verison. I guess.

flux version

$ flux version 
flux: v0.28.4
helm-controller: v0.18.2
image-automation-controller: v0.21.2
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.2
notification-controller: v0.23.1
source-controller: v0.22.4

Image update automation controller restarted already once:

$ kubectl get pods -n flux-system
NAME                                          READY   STATUS    RESTARTS       AGE
helm-controller-bdf5dcf88-xbrdw               1/1     Running   0              47h
image-automation-controller-849f7965f-g7kqs   1/1     Running   2 (7m2s ago)   21m
image-reflector-controller-65d8b46696-zql8j   1/1     Running   1 (2d1h ago)   4d4h
kustomize-controller-7c9d88bcf7-zmc5k         1/1     Running   0              47h
notification-controller-665b85d44c-2b585      1/1     Running   1 (2d1h ago)   4d4h
source-controller-7c5cb99569-4zj48            1/1     Running   0              152m
$ kubectl logs -n flux-system -l app=image-automation-controller


{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0330 09:53:13.102182       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 09:53:52.171972       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T09:53:52.172Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.274Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"info","ts":"2022-03-30T09:54:43.342Z","logger":"controller.imageupdateautomation","msg":"pushed commit to origin","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","revision":"44bdac384742c82f85a49a1aa8a27be0d5d69393","branch":"main"}
$ kubectl describe pod -n flux-system image-automation-controller-849f7965f-g7kqs
Name:                 image-automation-controller-849f7965f-g7kqs
Namespace:            flux-system
Priority:             100000
Priority Class Name:  default
Node:                 gke-pt-dev-default-090510cc-brz7/10.5.0.44
Start Time:           Wed, 30 Mar 2022 11:38:38 +0200
Labels:               app=image-automation-controller
                      pod-template-hash=849f7965f
Annotations:          container.seccomp.security.alpha.kubernetes.io/manager: runtime/default
                      prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Running
IP:                   10.6.0.61
IPs:
  IP:           10.6.0.61
Controlled By:  ReplicaSet/image-automation-controller-849f7965f
Containers:
  manager:
    Container ID:  containerd://25f0fe05107b604b1bb55209891c247d4549a342ac91521b573c64a65ada94d6
    Image:         fluxcd/image-automation-controller:v0.21.2
    Image ID:      docker.io/fluxcd/image-automation-controller@sha256:b25892f1829261f2175f36773a742bc67e5be16edb6884676a21136f74a31e98
    Ports:         8080/TCP, 9440/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    State:          Running
      Started:      Wed, 30 Mar 2022 11:53:12 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Wed, 30 Mar 2022 11:51:06 +0200
      Finished:     Wed, 30 Mar 2022 11:52:53 +0200
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
    Mounts:
      /tmp from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h4pgk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-h4pgk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  22m                    default-scheduler  Successfully assigned flux-system/image-automation-controller-849f7965f-g7kqs to gke-pt-dev-default-090510cc-brz7
  Normal   Pulling    22m                    kubelet            Pulling image "fluxcd/image-automation-controller:v0.21.2"
  Normal   Pulled     22m                    kubelet            Successfully pulled image "fluxcd/image-automation-controller:v0.21.2" in 4.261787274s
  Warning  BackOff    8m24s (x3 over 8m30s)  kubelet            Back-off restarting failed container
  Normal   Created    8m12s (x3 over 22m)    kubelet            Created container manager
  Normal   Pulled     8m12s (x2 over 10m)    kubelet            Container image "fluxcd/image-automation-controller:v0.21.2" already present on machine
  Normal   Started    8m11s (x3 over 22m)    kubelet            Started container manager
  Warning  Unhealthy  8m11s (x2 over 10m)    kubelet            Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp 10.6.0.61:9440: connect: connection refused
$ kubectl logs -n flux-system image-automation-controller-849f7965f-g7kqs
{"level":"info","ts":"2022-03-30T09:53:13.000Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-03-30T09:53:13.001Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0330 09:53:13.102182       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 09:53:52.171972       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T09:53:52.172Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.274Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"info","ts":"2022-03-30T09:54:43.342Z","logger":"controller.imageupdateautomation","msg":"pushed commit to origin","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","revision":"44bdac384742c82f85a49a1aa8a27be0d5d69393","branch":"main"}

$ kubectl logs -n flux-system image-automation-controller-849f7965f-g7kqs --previous
{"level":"info","ts":"2022-03-30T09:51:07.020Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-03-30T09:51:07.023Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-03-30T09:51:07.023Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2022-03-30T09:51:07.024Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
I0330 09:51:07.124255       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 09:51:43.246877       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:51:43.349Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}

@saltyblu
Copy link
Author

Image controller is in CrahsLoopBackOff:

$ kubectl get pods -n flux-system
NAME                                          READY   STATUS             RESTARTS       AGE
helm-controller-bdf5dcf88-xbrdw               1/1     Running            0              2d1h
image-automation-controller-849f7965f-g7kqs   0/1     CrashLoopBackOff   7 (82s ago)    178m
image-reflector-controller-65d8b46696-zql8j   1/1     Running            1 (2d4h ago)   4d7h
kustomize-controller-7c9d88bcf7-zmc5k         1/1     Running            0              2d1h
notification-controller-665b85d44c-2b585      1/1     Running            1 (2d4h ago)   4d7h
source-controller-7c5cb99569-4zj48            1/1     Running            0              5h8m

Pod logs are still not very great:

$ kubectl logs -n flux-system -l app=image-automation-controller
{"level":"info","ts":"2022-03-30T12:38:11.936Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-03-30T12:38:11.937Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-03-30T12:38:11.938Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2022-03-30T12:38:11.938Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
I0330 12:38:12.039777       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 12:38:49.663889       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T12:38:49.664Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T12:38:49.664Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T12:38:49.664Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T12:38:49.665Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
$ kubectl describe pod -n flux-system image-automation-controller-849f7965f-g7kqs

<snip>
Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Normal   Started    12m (x6 over 3h1m)     kubelet  Started container manager
  Warning  Unhealthy  12m (x5 over 168m)     kubelet  Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp 10.6.0.61:9440: connect: connection refused
  Normal   Created    9m36s (x7 over 3h1m)   kubelet  Created container manager
  Normal   Pulled     9m36s (x6 over 169m)   kubelet  Container image "fluxcd/image-automation-controller:v0.21.2" already present on machine
  Warning  BackOff    4m41s (x24 over 167m)  kubelet  Back-off restarting failed container

@agrib-01
Copy link

Same problem.

@saltyblu
Copy link
Author

  • The issue still accoures with the latest flux version,
  • The debug output shows nothing more
  • the experimental git feature is also not solving the issue

The image-automation-controller fails silently with an: Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp

@pjbgf
Copy link
Member

pjbgf commented Mar 31, 2022

@saltyblu thank you very much for the information. Would you mind dialling up the log verbosity to trace level and share the logs again please?

That can be configured on the controller's deployment arguments with --log-level=trace. To set directly at your kustomization.yaml it would look like this:

patches:
- patch: |
    - op: replace
      path: /spec/template/spec/containers/0/args/2
      value: --log-level=trace
  target:
    kind: Deployment
    name: "image-automation-controller"

It would be great if you could also provide the results of: kubectl get events -n flux-system

@saltyblu
Copy link
Author

saltyblu commented Mar 31, 2022

I did;
There is nothing new inside the logs sadly :(

  |   | 2022-03-31 15:31:27 | I0331 13:31:27.761838       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
-- | -- | -- | --
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.660Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.660Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.660Z","logger":"setup","msg":"starting manager"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/goroutine"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/heap"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/trace"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/block"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/threadcreate"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/symbol"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/profile"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/cmdline"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/mutex"}
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.659Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
  |   | 2022-03-31 15:31:25 | {"level":"debug","ts":"2022-03-31T13:31:25.629Z","logger":"controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","working":"/tmp/flux-system-flux-system678623668"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.629Z","logger":"controller.imageupdateautomation","msg":"applying setter","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","setter":"playground:example-nginx:tag","old":"main-1647936368","new":"main-1647936369"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.629Z","logger":"controller.imageupdateautomation","msg":"found schema extension","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","path":".spec.images.newTag"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.628Z","logger":"controller.imageupdateautomation","msg":"applying setter","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","setter":"playground:example-nginx:name","old":"europe-west1-docker.pkg.dev/pt-cicd/docker-snapshots/example-nginx","new":"europe-west1-docker.pkg.dev/pt-cicd/docker-snapshots/example-nginx"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.628Z","logger":"controller.imageupdateautomation","msg":"found schema extension","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","path":".spec.images.newName"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.628Z","logger":"controller.imageupdateautomation","msg":"reading file","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","path":"component.yaml"}

The service fails directly after trying to update an image tag :( without any logs.

We are using an GKE Cluster, the GCP Registry.

$ kubectl get events -n flux-system
LAST SEEN   TYPE      REASON                       OBJECT                                                  MESSAGE
3m13s       Normal    ReconciliationSucceeded      kustomization/flux-alertmanager-notifications           (combined from similar events): Reconciliation finished in 351.269513ms, next run in 5m0s
42m         Normal    Progressing                  kustomization/flux-system                               Deployment/flux-system/image-automation-controller configured...
2m44s       Normal    ReconciliationSucceeded      kustomization/flux-system                               (combined from similar events): Reconciliation finished in 8.447039598s, next run in 5m0s
45m         Normal    NewArtifact                  gitrepository/flux-system                               stored artifact for commit 'europe-west1-docker.pkg.dev/pt-cicd/docker-snapsho...'
50m         Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/5c750a9720c0fd1ee4c0bd7bf869692813609a32'
47m         Normal    Progressing                  kustomization/flux-system                               Kustomization/**/** configured
6m25s       Normal    Progressing                  kustomization/flux-system                               Kustomization/**/** configured
42m         Normal    NewArtifact                  gitrepository/flux-system                               stored artifact for commit 'SPI-138: add trace logging'
13m         Normal    GarbageCollectionSucceeded   gitrepository/flux-system                               garbage collected old artifacts
15m         Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/a10c49d6b65c81bc12541530a8148bfdb750e8ab'
14m         Normal    NewArtifact                  gitrepository/flux-system                               stored artifact for commit 'Update Flux to 0.28.5 (#642)'
6m42s       Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/32bb06318e1ae5c76775aa7c2a6b081f11d62fd9'
98s         Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/21bc2f10fa8e587788ffa3231772724d3f04dccd'
42m         Normal    Killing                      pod/image-automation-controller-546f4c8c46-m9k6r        Stopping container manager
42m         Normal    SuccessfulDelete             replicaset/image-automation-controller-546f4c8c46       Deleted pod: image-automation-controller-546f4c8c46-m9k6r
31s         Normal    Scheduled                    pod/image-automation-controller-6dc48cb94-pwg76         Successfully assigned flux-system/image-automation-controller-6dc48cb94-pwg76 to gke-pt-dev-default-090510cc-b305
29s         Normal    Pulled                       pod/image-automation-controller-6dc48cb94-pwg76         Container image "fluxcd/image-automation-controller:v0.20.1" already present on machine
29s         Normal    Created                      pod/image-automation-controller-6dc48cb94-pwg76         Created container manager
29s         Normal    Started                      pod/image-automation-controller-6dc48cb94-pwg76         Started container manager
27s         Warning   Unhealthy                    pod/image-automation-controller-6dc48cb94-pwg76         Readiness probe failed: Get "http://10.6.5.100:9440/readyz": dial tcp 10.6.5.100:9440: connect: connection refused
31s         Normal    SuccessfulCreate             replicaset/image-automation-controller-6dc48cb94        Created pod: image-automation-controller-6dc48cb94-pwg76
42m         Normal    Scheduled                    pod/image-automation-controller-78948c9bcf-hwhl2        Successfully assigned flux-system/image-automation-controller-78948c9bcf-hwhl2 to gke-pt-dev-default-d26ad449-4w1l
42m         Normal    Pulling                      pod/image-automation-controller-78948c9bcf-hwhl2        Pulling image "fluxcd/image-automation-controller:v0.21.3"
42m         Normal    Pulled                       pod/image-automation-controller-78948c9bcf-hwhl2        Successfully pulled image "fluxcd/image-automation-controller:v0.21.3" in 4.212085528s
3m37s       Normal    Created                      pod/image-automation-controller-78948c9bcf-hwhl2        Created container manager
3m36s       Normal    Started                      pod/image-automation-controller-78948c9bcf-hwhl2        Started container manager
7m43s       Warning   Unhealthy                    pod/image-automation-controller-78948c9bcf-hwhl2        Readiness probe failed: Get "http://10.6.3.188:9440/readyz": dial tcp 10.6.3.188:9440: connect: connection refused
3m37s       Normal    Pulled                       pod/image-automation-controller-78948c9bcf-hwhl2        Container image "fluxcd/image-automation-controller:v0.21.3" already present on machine
3m48s       Warning   BackOff                      pod/image-automation-controller-78948c9bcf-hwhl2        Back-off restarting failed container
42m         Normal    SuccessfulCreate             replicaset/image-automation-controller-78948c9bcf       Created pod: image-automation-controller-78948c9bcf-hwhl2
21s         Normal    SuccessfulDelete             replicaset/image-automation-controller-78948c9bcf       Deleted pod: image-automation-controller-78948c9bcf-hwhl2
42m         Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_a49c08ff-4432-4c7f-89ab-71f28e4e7278 became leader
42m         Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_a49c08ff-4432-4c7f-89ab-71f28e4e7278 became leader
8m11s       Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_19ed182f-c0d3-4adc-b59f-29a4f0395c36 became leader
8m11s       Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_19ed182f-c0d3-4adc-b59f-29a4f0395c36 became leader
7m1s        Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_f2daa8e9-d97d-415c-b53a-a1224b96f50a became leader
7m1s        Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_f2daa8e9-d97d-415c-b53a-a1224b96f50a became leader
4m50s       Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_fd07af93-0494-494a-88be-4e8e576bc123 became leader
4m50s       Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_fd07af93-0494-494a-88be-4e8e576bc123 became leader
2m58s       Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_59ae06da-75ab-4eea-9f94-8605cc6c6304 became leader
2m58s       Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_59ae06da-75ab-4eea-9f94-8605cc6c6304 became leader
42m         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled up replica set image-automation-controller-78948c9bcf to 1
42m         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled down replica set image-automation-controller-546f4c8c46 to 0
31s         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled up replica set image-automation-controller-6dc48cb94 to 1
21s         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled down replica set image-automation-controller-78948c9bcf to 0
42m         Normal    Killing                      pod/source-controller-7644467db-5pkqm                   Stopping container manager
42m         Normal    SuccessfulDelete             replicaset/source-controller-7644467db                  Deleted pod: source-controller-7644467db-5pkqm
42m         Normal    Scheduled                    pod/source-controller-95684c946-6fwh5                   Successfully assigned flux-system/source-controller-95684c946-6fwh5 to gke-pt-dev-default-090510cc-brz7
42m         Normal    Pulling                      pod/source-controller-95684c946-6fwh5                   Pulling image "fluxcd/source-controller:v0.22.5"
42m         Normal    Pulled                       pod/source-controller-95684c946-6fwh5                   Successfully pulled image "fluxcd/source-controller:v0.22.5" in 4.206627015s
42m         Normal    Created                      pod/source-controller-95684c946-6fwh5                   Created container manager
42m         Normal    Started                      pod/source-controller-95684c946-6fwh5                   Started container manager
42m         Warning   Unhealthy                    pod/source-controller-95684c946-6fwh5                   Readiness probe failed: Get "http://10.6.0.76:9090/": dial tcp 10.6.0.76:9090: connect: connection refused
42m         Normal    SuccessfulCreate             replicaset/source-controller-95684c946                  Created pod: source-controller-95684c946-6fwh5
42m         Normal    LeaderElection               configmap/source-controller-leader-election             source-controller-95684c946-6fwh5_45489842-c150-48bd-bb8e-8c82ee5ca62d became leader
42m         Normal    LeaderElection               lease/source-controller-leader-election                 source-controller-95684c946-6fwh5_45489842-c150-48bd-bb8e-8c82ee5ca62d became leader
42m         Normal    ScalingReplicaSet            deployment/source-controller                            Scaled up replica set source-controller-95684c946 to 1
42m         Normal    ScalingReplicaSet            deployment/source-controller                            Scaled down replica set source-controller-7644467db to 0

I've updated and downgraded the service in this time perioud, we can't leave the service like this because it blocks our development :)

@pjbgf
Copy link
Member

pjbgf commented Mar 31, 2022

@saltyblu Thanks again for the quick reply.

We are investigating this and still trying to reproduce. Can you please provide more details about your cluster? What version of Kubernetes are you running on?

It seems that the controller is working properly, but the probes are misbehaving for some reason in your setup. Do you mind disabling both livenessProbe and readinessProbe temporarily in your deployment to confirm the theory?

@saltyblu
Copy link
Author

Kubernetes Version is: v1.22.7-gke.300 with 6 nodes

62 Flux Kustomizations
31 Image Updates/Repositories
31 Source gits
1 flux-system git that is taking care of all image update automations.

@pjbgf
Copy link
Member

pjbgf commented Mar 31, 2022

Do you have any additional NetworkPolicy applied to that namespace that could potentially be blocking the access to that endpoint?

@saltyblu
Copy link
Author

We have 3 in place but no new ones there:

NAMESPACE     NAME             POD-SELECTOR                  AGE
flux-system   allow-egress     <none>                        181d
flux-system   allow-scraping   <none>                        181d
flux-system   allow-webhooks   app=notification-controller   181d

@saltyblu
Copy link
Author

I've also disabled health and readiness probe, directly after triggering an image-update-automation the pod dies and gets recreated :(

@saltyblu
Copy link
Author

saltyblu commented Apr 5, 2022

after some debugging the container seems to silently fail somwhere below that code line . . .

var signingEntity *openpgp.Entity

After debugging the controller with delve we are not able to get a "certain" line of code where the controller stops working.

@brettstewart
Copy link

We are having an issue similar to this. Anytime there is a NEW change that needs to be committed the automation controller crashes and ends up in a crashbackloop until we suspend automation.

This is the last line we get in our logs before crash: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L340

We DO NOT make it to this line: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L371

We think the failure is occurring around here since the Crash only happens when there is a new change: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L378

Images

helm-controller               ghcr.io/fluxcd/helm-controller:v0.18.2
image-automation-controller   ghcr.io/fluxcd/image-automation-controller:v0.21.3
image-reflector-controller    ghcr.io/fluxcd/image-reflector-controller:v0.17.1
kustomize-controller          ghcr.io/fluxcd/kustomize-controller:v0.22.3
notification-controller       ghcr.io/fluxcd/notification-controller:v0.23.2
source-controller             ghcr.io/fluxcd/source-controller:v0.22.5

Env

- name: EXPERIMENTAL_GIT_TRANSPORT
   value: “true”

Amount
Kustomizations: 80
Helm Releases: 166

Additional Information based on previous comments

  • No Network Policy would be blocking this

@pjbgf pjbgf self-assigned this Apr 8, 2022
@pjbgf pjbgf added the bug Something isn't working label Apr 8, 2022
@pjbgf
Copy link
Member

pjbgf commented Apr 8, 2022

We have recently merged some fixes on source-controller but they haven't made their way into image-automation-controller yet. I managed to reproduce some of the issues you mentioned so I am assuming this may also resolve your situation.

The version below is working quite well with most Git providers when using SSH, with the exception of BitBucket and Azure DevOps. On that point, would you be able to confirm what Git providers you are using @saltyblu @brettstewart?

If you could test ahead of the release and confirm the fixes/improves your issue that would be great.
In the mean time, I created a draft PR, so you can either:

  1. Wait until the changes are released (at some point next week).
  2. Test changes using my image: quay.io/paulinhu/image-automation-controller:selfheal@sha256:0c78d0ff5aec404c724b623e9d4ae00d5c0348320c397626900d1e073e1d7ac5:
  3. Create your own image using the PR and running: IMG=repo/and/image-name TAG=your-tag BUILD_PLATFORMS="linux/amd64" BUILD_ARGS="--push" make docker-build

@brettstewart
Copy link

@pjbgf Github - I reviewed the draft PR, we will get an image spun up and report back.

Thanks for the quick update!

@saltyblu
Copy link
Author

Hi,
we'll also use the draft image to check if the issue still exists.

Currently we are using only Github Enterprise as git provider.

Thanks for the updates

@saltyblu saltyblu changed the title image-automation-controller CrashLoopBack 0.20.2 image-automation-controller CrashLoopBack >= 0.20.2 Apr 11, 2022
@saltyblu
Copy link
Author

We used the updated image for half the day now and the issue seems not to be solved. The controller is able to update 1 or 2 ImageAutomations after a random time period the controller is failing silently again (means without any error logs)

@thedarkwriter
Copy link

thedarkwriter commented Apr 11, 2022

I can concur, I'm getting the exact same behavior.
The only strange thing to add is if I change the logging from info to debug, the very next run it will update the image to the newest tag, then it will start the crash loop again.
Nothing in the logs to indicate the cause of the failure.

I also tried the @pjbgf container quay.io/paulinhu/image-automation-controller:selfheal@sha256:0c78d0ff5aec404c724b623e9d4ae00d5c0348320c397626900d1e073e1d7ac5

Same behavior. This was the result I saw

  1. New Pod ran for 5 minutes fine, did the one update.
  2. I pushed an image update, it crashed almost immediately
  3. Now it is in crashloop backoff
    image-automation-controller-84f4ffbd58-68qrd 0/1 CrashLoopBackOff 1 6m19s`
  4. It’s now running again …. let’s see if it deploys the new tag
  5. nope, crashed
  6. It logged this, then crashed again
{"level":"info","ts":"2022-04-11T15:51:22.660Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-04-11T15:51:22.660Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-04-11T15:51:22.661Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-04-11T15:51:22.661Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0411 15:51:22.761525       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0411 15:52:05.948297       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"debug","ts":"2022-04-11T15:52:05.948Z","logger":"events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"flux-system","name":"image-automation-controller-leader-election","uid":"b9d0bcfc-7f0e-4121-850f-615af567a1a8","apiVersion":"v1","resourceVersion":"11041528"},"reason":"LeaderElection","message":"image-automation-controller-84f4ffbd58-68qrd_bb77fce1-771f-45c3-b1bf-4724c0f55d2e became leader"}
{"level":"debug","ts":"2022-04-11T15:52:05.948Z","logger":"events","msg":"Normal","object":{"kind":"Lease","namespace":"flux-system","name":"image-automation-controller-leader-election","uid":"77a4a51c-22d3-49f4-ba90-0763a1e7d7f1","apiVersion":"coordination.k8s.io/v1","resourceVersion":"11041529"},"reason":"LeaderElection","message":"image-automation-controller-84f4ffbd58-68qrd_bb77fce1-771f-45c3-b1bf-4724c0f55d2e became leader"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-04-11T15:52:06.049Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"debug","ts":"2022-04-11T15:52:06.050Z","logger":"controller.imageupdateautomation","msg":"fetching git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"}}
{"level":"debug","ts":"2022-04-11T15:52:06.050Z","logger":"controller.imageupdateautomation","msg":"attempting to clone git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"},"ref":{"branch":"main"},"working":"/tmp/flux-system-flux-system2014439465"}
{"level":"debug","ts":"2022-04-11T15:52:06.402Z","logger":"controller.imageupdateautomation","msg":"updating with setters according to image policies","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","count":1,"manifests-path":"/tmp/flux-system-flux-system2014439465/clusters/apps/dev/redis"}
{"level":"debug","ts":"2022-04-11T15:52:06.404Z","logger":"controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","working":"/tmp/flux-system-flux-system2014439465"}

It's now cycling this last behavior, which is what I was seeing before changing image-automation-controller images

@annaken
Copy link

annaken commented Apr 13, 2022

We were seeing exactly this behaviour in one of our clusters, and it seemed like the pod wasn't able to burst high enough, so we increased the limits on the image-automation-controller to

           limits:
            cpu: 1.5
            memory: 2.5Gi

and also increase the git source interval from 2m to 5m. We haven't had any crashes since then.

NB this is only happening in the cluster that has multiple git sources, including "include" sources. The cluster that pull everything from a single git source is fine.

@thedarkwriter
Copy link

Just to offer a second test, I'm using a single repo cluster and upped my resources like @annaken . Still a lot of crashes

image-automation-controller-665dfcf789-xv6kx   1/1     Running     18         4h24m

But the images are eventually updated. So the image-automation-controller is still working, just not stable.

@saltyblu
Copy link
Author

We tested the image-automation-controller in a fully new created GKE Cluster with one additional repo beside the flux-sysrtem and it is still failing.

In our case the controller often misses updates, we downgraded the controller back to 0.20.1 for now. :(

@leehanel
Copy link

I added some tracing and built an image using the above method. I seems to be silently crashing in

if err := index.AddByPath(rel); err != nil {

Specifically, i added these traces around it:

tracelog.Info("Fixing to try index.AddByPath(rel)", "rel", rel)
if err := index.AddByPath(rel); err != nil {
    tracelog.Info("adding file", "file", rel)
    return err
}
tracelog.Info("Got error adding file", "file", rel)
tracelog.Info("Got error adding file", "error", err) 

set the log-level to trace and the last log i got from it was as follows:

{"level":"Level(-2)","ts":"2022-04-14T15:49:36.428Z","logger":"controller.imageupdateautomation","msg":"Fixing to try index.AddByPath(rel)","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"demo","namespace":"demo","rel":".editorconfig"}

we never get to the "adding file" tracelog line in the if or to the additional traces i added after it.

to note, .editorconfig is the 1st file in an ls of the git repo.

We have reverted to 0.20.1 as well, and that seems stable so far.

@saltyblu
Copy link
Author

I've tested the new image-automation-controller already and this did not solve the problem. The controller still crashes

@pjbgf
Copy link
Member

pjbgf commented May 6, 2022

@saltyblu I am working on a few changes, some are already merged into source-controller (panic recovery), others are on an open PR.

Would you mind testing it again with the following images and see whether you still experience the crashes?

source-controller: quay.io/paulinhu/source-controller:v0.24.4-cacheless@sha256:61930cad1da900f209b396f20c2f7740ff32b5cf1bb4ab7892200790c00a5f4b

image-automation-controller: quay.io/paulinhu/image-automation-controller:v0.22.2-cacheless@sha256:87823667cfc4c6e395d996ceaee92a1b5059a8950884f2c3aa49488dcbed81f5

More details on:
fluxcd/source-controller#713 (comment)

UPDATE: changed IAC image to v0.22.2-cacheless

@saltyblu
Copy link
Author

saltyblu commented May 10, 2022

Both images are source-controller images.

Starting the "image-automation-controller" seems to be a source-controller image, could you check the uploaded images.

@robertscherbarth
Copy link

Are there any news on this bug to share?

@pjbgf
Copy link
Member

pjbgf commented Jul 12, 2022

@robertscherbarth I am working on a RC version with a potential fix, aiming to release it Today.

pjbgf pushed a commit to fluxcd/golang-with-libgit2 that referenced this issue Jul 12, 2022
Some users are experiencing some segfault issues in Flux controllers
which seem to be related to internal git2go state processed within a
background thread.
Due to the shift into Managed Transport only, the concerns
around multi-threading from a libgit2 seem to be less problematic
than otherwise.

fluxcd/image-automation-controller#339
https://github.com/libgit2/libgit2/blob/main/docs/threading.md

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to fluxcd/source-controller that referenced this issue Jul 12, 2022
Some users are experiencing some segfault issues in Flux controllers
which seem to be related to internal git2go state processed within a
background thread.
Due to the shift into Managed Transport only, the concerns
around multi-threading from a libgit2 seem to be less problematic
than otherwise.

fluxcd/image-automation-controller#339
https://github.com/libgit2/libgit2/blob/main/docs/threading.md

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit that referenced this issue Jul 12, 2022
dquote> Some users are experiencing some segfault issues in Flux controllers
which seem to be related to internal git2go state processed within a
background thread.
Due to the shift into Managed Transport only, the concerns
around multi-threading from a libgit2 seem to be less problematic
than otherwise.

#339
https://github.com/libgit2/libgit2/blob/main/docs/threading.md

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Jul 12, 2022
Some users are experiencing some segfault issues in Flux controllers
which seem to be related to internal git2go state processed within a
background thread.
Due to the shift into Managed Transport only, the concerns
around multi-threading from a libgit2 seem to be less problematic
than otherwise.

fluxcd/image-automation-controller#339
https://github.com/libgit2/libgit2/blob/main/docs/threading.md

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit that referenced this issue Jul 12, 2022
dquote> Some users are experiencing some segfault issues in Flux controllers
which seem to be related to internal git2go state processed within a
background thread.
Due to the shift into Managed Transport only, the concerns
around multi-threading from a libgit2 seem to be less problematic
than otherwise.

#339
https://github.com/libgit2/libgit2/blob/main/docs/threading.md

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
@pjbgf
Copy link
Member

pjbgf commented Jul 12, 2022

Here's the release candidate image: ghcr.io/fluxcd/image-automation-controller:rc-41c0ddc4.
It would be great if folks experiencing this issue could test it and let us know whether or not it fixes it.

@pjbgf pjbgf added this to the GA milestone Jul 12, 2022
@pjbgf
Copy link
Member

pjbgf commented Jul 12, 2022

For everybody else's benefit, two users experiencing this issue have already confirmed the RC version worked. So we will start working towards getting this merged. Based on the upstream dependencies this may take some time to make its way into the official releases.

Please report any issues or confirm (:heavy_check_mark:) this is working for you.

@BertelBB
Copy link

✔️ ghcr.io/fluxcd/image-automation-controller:rc-41c0ddc4 works 👍

@sohahm
Copy link

sohahm commented Jul 13, 2022

@pjbgf Thank you very much, Paulo. The RC version rc-41c0ddc4 is working.

@saltyblu
Copy link
Author

Looks good, the controller is still crashing sometimes but thats maybe an other topic.

goroutine 423 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc00048da68, 0x5a5)
        runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0xc001633a20?)
        sync/cond.go:56 +0x8c
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc00048da40, 0xc0009856d0)
        k8s.io/client-go@v0.24.1/tools/cache/delta_fifo.go:527 +0x22e
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc000ad23f0)
        k8s.io/client-go@v0.24.1/tools/cache/controller.go:184 +0x36
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x40e8c5?)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xf94ce5?, {0x20724a0, 0xc000830780}, 0x1, 0xc000b6a0c0)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ad2458?, 0x3b9aca00, 0x0, 0x80?, 0x7fa87d0e4040?)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc000ad23f0, 0xc000b6a0c0)
        k8s.io/client-go@v0.24.1/tools/cache/controller.go:155 +0x2c5
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc0009f2aa0, 0x0?)
        k8s.io/client-go@v0.24.1/tools/cache/shared_informer.go:443 +0x47c
created by sigs.k8s.io/controller-runtime/pkg/cache/internal.(*specificInformersMap).Start.func1
        sigs.k8s.io/controller-runtime@v0.11.2/pkg/cache/internal/informers_map.go:152 +0x10c

@pjbgf
Copy link
Member

pjbgf commented Jul 18, 2022

@saltyblu thanks for sharing this. The log above looks like the end of a goroutines' dump. Do you have access to the entire logs which may provide further insights about the error?

@saltyblu
Copy link
Author

image-automation-controller.log

pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd/image-automation-controller#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd/image-automation-controller#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jul 31, 2022
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
@pjbgf
Copy link
Member

pjbgf commented Aug 10, 2022

@saltyblu this has now been fixed as of yesterday's release: ghcr.io/fluxcd/image-automation-controller:v0.24.0.

The second issue you mentioned will be tracked separately as part of #415.

Thank you for reporting and testing the changes. 🙇

souleb pushed a commit to souleb/image-automation-controller that referenced this issue Mar 12, 2024
Use of MUSL was a temporary solution to mitigate cross-platform
issues while building openssl and libssh2. Since Unmanaged transport has
been deprecated, openssl and libssh2 dependencies are no longer required
and by extension MUSL.

Enables libgit2 threadless support and provides a regression assurance
for fluxcd#339.

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.