image-automation-controller CrashLoopBack >= 0.20.2 #339

saltyblu · 2022-03-30T08:25:43Z

Issue

Since upgrading to 0.20.2 the Image-Update-Automation Controller stopped working,

flux reconcile image update runs forever

The image-update-controller fails periodically.
/readyz is not returning 200, but also provides no infos.

The service provides no Logs for the failures.

How To reproduce

In our case: update image-automation-crontroller > 0.20.1
Trigger an image update

The text was updated successfully, but these errors were encountered:

pjbgf · 2022-03-30T08:37:07Z

@saltyblu thank you very much for reporting this. Would you be able to share with us the contents of the commands below:

kubectl logs -n flux-system -l app=image-automation-controller

And the over version of your other controllers via:

flux version

I noticed that you are not running on the latest version of the image automation controller. Would you be able to run against the latest version and see if the error persists please?

saltyblu · 2022-03-30T09:59:10Z

I've upgraded the image-automation-controller back to the latest verison. I guess.

`flux version`

$ flux version 
flux: v0.28.4
helm-controller: v0.18.2
image-automation-controller: v0.21.2
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.2
notification-controller: v0.23.1
source-controller: v0.22.4

Image update automation controller restarted already once:

$ kubectl get pods -n flux-system
NAME                                          READY   STATUS    RESTARTS       AGE
helm-controller-bdf5dcf88-xbrdw               1/1     Running   0              47h
image-automation-controller-849f7965f-g7kqs   1/1     Running   2 (7m2s ago)   21m
image-reflector-controller-65d8b46696-zql8j   1/1     Running   1 (2d1h ago)   4d4h
kustomize-controller-7c9d88bcf7-zmc5k         1/1     Running   0              47h
notification-controller-665b85d44c-2b585      1/1     Running   1 (2d1h ago)   4d4h
source-controller-7c5cb99569-4zj48            1/1     Running   0              152m

$ kubectl logs -n flux-system -l app=image-automation-controller


{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0330 09:53:13.102182       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 09:53:52.171972       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T09:53:52.172Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.274Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"info","ts":"2022-03-30T09:54:43.342Z","logger":"controller.imageupdateautomation","msg":"pushed commit to origin","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","revision":"44bdac384742c82f85a49a1aa8a27be0d5d69393","branch":"main"}

$ kubectl describe pod -n flux-system image-automation-controller-849f7965f-g7kqs
Name:                 image-automation-controller-849f7965f-g7kqs
Namespace:            flux-system
Priority:             100000
Priority Class Name:  default
Node:                 gke-pt-dev-default-090510cc-brz7/10.5.0.44
Start Time:           Wed, 30 Mar 2022 11:38:38 +0200
Labels:               app=image-automation-controller
                      pod-template-hash=849f7965f
Annotations:          container.seccomp.security.alpha.kubernetes.io/manager: runtime/default
                      prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Running
IP:                   10.6.0.61
IPs:
  IP:           10.6.0.61
Controlled By:  ReplicaSet/image-automation-controller-849f7965f
Containers:
  manager:
    Container ID:  containerd://25f0fe05107b604b1bb55209891c247d4549a342ac91521b573c64a65ada94d6
    Image:         fluxcd/image-automation-controller:v0.21.2
    Image ID:      docker.io/fluxcd/image-automation-controller@sha256:b25892f1829261f2175f36773a742bc67e5be16edb6884676a21136f74a31e98
    Ports:         8080/TCP, 9440/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    State:          Running
      Started:      Wed, 30 Mar 2022 11:53:12 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Wed, 30 Mar 2022 11:51:06 +0200
      Finished:     Wed, 30 Mar 2022 11:52:53 +0200
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
    Mounts:
      /tmp from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h4pgk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-h4pgk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  22m                    default-scheduler  Successfully assigned flux-system/image-automation-controller-849f7965f-g7kqs to gke-pt-dev-default-090510cc-brz7
  Normal   Pulling    22m                    kubelet            Pulling image "fluxcd/image-automation-controller:v0.21.2"
  Normal   Pulled     22m                    kubelet            Successfully pulled image "fluxcd/image-automation-controller:v0.21.2" in 4.261787274s
  Warning  BackOff    8m24s (x3 over 8m30s)  kubelet            Back-off restarting failed container
  Normal   Created    8m12s (x3 over 22m)    kubelet            Created container manager
  Normal   Pulled     8m12s (x2 over 10m)    kubelet            Container image "fluxcd/image-automation-controller:v0.21.2" already present on machine
  Normal   Started    8m11s (x3 over 22m)    kubelet            Started container manager
  Warning  Unhealthy  8m11s (x2 over 10m)    kubelet            Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp 10.6.0.61:9440: connect: connection refused

$ kubectl logs -n flux-system image-automation-controller-849f7965f-g7kqs
{"level":"info","ts":"2022-03-30T09:53:13.000Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-03-30T09:53:13.001Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-03-30T09:53:13.002Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0330 09:53:13.102182       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 09:53:52.171972       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T09:53:52.172Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T09:53:52.173Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:53:52.274Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"info","ts":"2022-03-30T09:54:43.342Z","logger":"controller.imageupdateautomation","msg":"pushed commit to origin","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","revision":"44bdac384742c82f85a49a1aa8a27be0d5d69393","branch":"main"}

$ kubectl logs -n flux-system image-automation-controller-849f7965f-g7kqs --previous
{"level":"info","ts":"2022-03-30T09:51:07.020Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-03-30T09:51:07.023Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-03-30T09:51:07.023Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2022-03-30T09:51:07.024Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
I0330 09:51:07.124255       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 09:51:43.246877       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T09:51:43.247Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T09:51:43.349Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}

saltyblu · 2022-03-30T12:40:57Z

Image controller is in CrahsLoopBackOff:

$ kubectl get pods -n flux-system
NAME                                          READY   STATUS             RESTARTS       AGE
helm-controller-bdf5dcf88-xbrdw               1/1     Running            0              2d1h
image-automation-controller-849f7965f-g7kqs   0/1     CrashLoopBackOff   7 (82s ago)    178m
image-reflector-controller-65d8b46696-zql8j   1/1     Running            1 (2d4h ago)   4d7h
kustomize-controller-7c9d88bcf7-zmc5k         1/1     Running            0              2d1h
notification-controller-665b85d44c-2b585      1/1     Running            1 (2d4h ago)   4d7h
source-controller-7c5cb99569-4zj48            1/1     Running            0              5h8m

Pod logs are still not very great:

$ kubectl logs -n flux-system -l app=image-automation-controller
{"level":"info","ts":"2022-03-30T12:38:11.936Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-03-30T12:38:11.937Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-03-30T12:38:11.938Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2022-03-30T12:38:11.938Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
I0330 12:38:12.039777       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0330 12:38:49.663889       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-03-30T12:38:49.664Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-03-30T12:38:49.664Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-03-30T12:38:49.664Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-03-30T12:38:49.665Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}

$ kubectl describe pod -n flux-system image-automation-controller-849f7965f-g7kqs

<snip>
Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Normal   Started    12m (x6 over 3h1m)     kubelet  Started container manager
  Warning  Unhealthy  12m (x5 over 168m)     kubelet  Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp 10.6.0.61:9440: connect: connection refused
  Normal   Created    9m36s (x7 over 3h1m)   kubelet  Created container manager
  Normal   Pulled     9m36s (x6 over 169m)   kubelet  Container image "fluxcd/image-automation-controller:v0.21.2" already present on machine
  Warning  BackOff    4m41s (x24 over 167m)  kubelet  Back-off restarting failed container

agrib-01 · 2022-03-31T10:05:49Z

Same problem.

saltyblu · 2022-03-31T10:06:43Z

The issue still accoures with the latest flux version,
The debug output shows nothing more
the experimental git feature is also not solving the issue

The image-automation-controller fails silently with an: Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp

pjbgf · 2022-03-31T11:58:02Z

@saltyblu thank you very much for the information. Would you mind dialling up the log verbosity to trace level and share the logs again please?

That can be configured on the controller's deployment arguments with --log-level=trace. To set directly at your kustomization.yaml it would look like this:

patches:
- patch: |
    - op: replace
      path: /spec/template/spec/containers/0/args/2
      value: --log-level=trace
  target:
    kind: Deployment
    name: "image-automation-controller"

It would be great if you could also provide the results of: kubectl get events -n flux-system

saltyblu · 2022-03-31T13:34:12Z

I did;
There is nothing new inside the logs sadly :(

  |   | 2022-03-31 15:31:27 | I0331 13:31:27.761838       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
-- | -- | -- | --
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.660Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.660Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.660Z","logger":"setup","msg":"starting manager"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/goroutine"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/heap"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/trace"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/block"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/threadcreate"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/symbol"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/profile"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/cmdline"}
  |   | 2022-03-31 15:31:27 | {"level":"Level(-2)","ts":"2022-03-31T13:31:27.659Z","msg":"Registering metrics http server extra handler","path":"/debug/pprof/mutex"}
  |   | 2022-03-31 15:31:27 | {"level":"info","ts":"2022-03-31T13:31:27.659Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
  |   | 2022-03-31 15:31:25 | {"level":"debug","ts":"2022-03-31T13:31:25.629Z","logger":"controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","working":"/tmp/flux-system-flux-system678623668"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.629Z","logger":"controller.imageupdateautomation","msg":"applying setter","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","setter":"playground:example-nginx:tag","old":"main-1647936368","new":"main-1647936369"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.629Z","logger":"controller.imageupdateautomation","msg":"found schema extension","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","path":".spec.images.newTag"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.628Z","logger":"controller.imageupdateautomation","msg":"applying setter","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","setter":"playground:example-nginx:name","old":"europe-west1-docker.pkg.dev/pt-cicd/docker-snapshots/example-nginx","new":"europe-west1-docker.pkg.dev/pt-cicd/docker-snapshots/example-nginx"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.628Z","logger":"controller.imageupdateautomation","msg":"found schema extension","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","path":".spec.images.newName"}
  |   | 2022-03-31 15:31:25 | {"level":"Level(-2)","ts":"2022-03-31T13:31:25.628Z","logger":"controller.imageupdateautomation","msg":"reading file","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"example-nginx","namespace":"playground","path":"component.yaml"}

The service fails directly after trying to update an image tag :( without any logs.

We are using an GKE Cluster, the GCP Registry.

$ kubectl get events -n flux-system
LAST SEEN   TYPE      REASON                       OBJECT                                                  MESSAGE
3m13s       Normal    ReconciliationSucceeded      kustomization/flux-alertmanager-notifications           (combined from similar events): Reconciliation finished in 351.269513ms, next run in 5m0s
42m         Normal    Progressing                  kustomization/flux-system                               Deployment/flux-system/image-automation-controller configured...
2m44s       Normal    ReconciliationSucceeded      kustomization/flux-system                               (combined from similar events): Reconciliation finished in 8.447039598s, next run in 5m0s
45m         Normal    NewArtifact                  gitrepository/flux-system                               stored artifact for commit 'europe-west1-docker.pkg.dev/pt-cicd/docker-snapsho...'
50m         Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/5c750a9720c0fd1ee4c0bd7bf869692813609a32'
47m         Normal    Progressing                  kustomization/flux-system                               Kustomization/**/** configured
6m25s       Normal    Progressing                  kustomization/flux-system                               Kustomization/**/** configured
42m         Normal    NewArtifact                  gitrepository/flux-system                               stored artifact for commit 'SPI-138: add trace logging'
13m         Normal    GarbageCollectionSucceeded   gitrepository/flux-system                               garbage collected old artifacts
15m         Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/a10c49d6b65c81bc12541530a8148bfdb750e8ab'
14m         Normal    NewArtifact                  gitrepository/flux-system                               stored artifact for commit 'Update Flux to 0.28.5 (#642)'
6m42s       Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/32bb06318e1ae5c76775aa7c2a6b081f11d62fd9'
98s         Normal    ArtifactUpToDate             gitrepository/flux-system                               artifact up-to-date with remote revision: 'main/21bc2f10fa8e587788ffa3231772724d3f04dccd'
42m         Normal    Killing                      pod/image-automation-controller-546f4c8c46-m9k6r        Stopping container manager
42m         Normal    SuccessfulDelete             replicaset/image-automation-controller-546f4c8c46       Deleted pod: image-automation-controller-546f4c8c46-m9k6r
31s         Normal    Scheduled                    pod/image-automation-controller-6dc48cb94-pwg76         Successfully assigned flux-system/image-automation-controller-6dc48cb94-pwg76 to gke-pt-dev-default-090510cc-b305
29s         Normal    Pulled                       pod/image-automation-controller-6dc48cb94-pwg76         Container image "fluxcd/image-automation-controller:v0.20.1" already present on machine
29s         Normal    Created                      pod/image-automation-controller-6dc48cb94-pwg76         Created container manager
29s         Normal    Started                      pod/image-automation-controller-6dc48cb94-pwg76         Started container manager
27s         Warning   Unhealthy                    pod/image-automation-controller-6dc48cb94-pwg76         Readiness probe failed: Get "http://10.6.5.100:9440/readyz": dial tcp 10.6.5.100:9440: connect: connection refused
31s         Normal    SuccessfulCreate             replicaset/image-automation-controller-6dc48cb94        Created pod: image-automation-controller-6dc48cb94-pwg76
42m         Normal    Scheduled                    pod/image-automation-controller-78948c9bcf-hwhl2        Successfully assigned flux-system/image-automation-controller-78948c9bcf-hwhl2 to gke-pt-dev-default-d26ad449-4w1l
42m         Normal    Pulling                      pod/image-automation-controller-78948c9bcf-hwhl2        Pulling image "fluxcd/image-automation-controller:v0.21.3"
42m         Normal    Pulled                       pod/image-automation-controller-78948c9bcf-hwhl2        Successfully pulled image "fluxcd/image-automation-controller:v0.21.3" in 4.212085528s
3m37s       Normal    Created                      pod/image-automation-controller-78948c9bcf-hwhl2        Created container manager
3m36s       Normal    Started                      pod/image-automation-controller-78948c9bcf-hwhl2        Started container manager
7m43s       Warning   Unhealthy                    pod/image-automation-controller-78948c9bcf-hwhl2        Readiness probe failed: Get "http://10.6.3.188:9440/readyz": dial tcp 10.6.3.188:9440: connect: connection refused
3m37s       Normal    Pulled                       pod/image-automation-controller-78948c9bcf-hwhl2        Container image "fluxcd/image-automation-controller:v0.21.3" already present on machine
3m48s       Warning   BackOff                      pod/image-automation-controller-78948c9bcf-hwhl2        Back-off restarting failed container
42m         Normal    SuccessfulCreate             replicaset/image-automation-controller-78948c9bcf       Created pod: image-automation-controller-78948c9bcf-hwhl2
21s         Normal    SuccessfulDelete             replicaset/image-automation-controller-78948c9bcf       Deleted pod: image-automation-controller-78948c9bcf-hwhl2
42m         Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_a49c08ff-4432-4c7f-89ab-71f28e4e7278 became leader
42m         Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_a49c08ff-4432-4c7f-89ab-71f28e4e7278 became leader
8m11s       Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_19ed182f-c0d3-4adc-b59f-29a4f0395c36 became leader
8m11s       Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_19ed182f-c0d3-4adc-b59f-29a4f0395c36 became leader
7m1s        Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_f2daa8e9-d97d-415c-b53a-a1224b96f50a became leader
7m1s        Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_f2daa8e9-d97d-415c-b53a-a1224b96f50a became leader
4m50s       Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_fd07af93-0494-494a-88be-4e8e576bc123 became leader
4m50s       Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_fd07af93-0494-494a-88be-4e8e576bc123 became leader
2m58s       Normal    LeaderElection               configmap/image-automation-controller-leader-election   image-automation-controller-78948c9bcf-hwhl2_59ae06da-75ab-4eea-9f94-8605cc6c6304 became leader
2m58s       Normal    LeaderElection               lease/image-automation-controller-leader-election       image-automation-controller-78948c9bcf-hwhl2_59ae06da-75ab-4eea-9f94-8605cc6c6304 became leader
42m         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled up replica set image-automation-controller-78948c9bcf to 1
42m         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled down replica set image-automation-controller-546f4c8c46 to 0
31s         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled up replica set image-automation-controller-6dc48cb94 to 1
21s         Normal    ScalingReplicaSet            deployment/image-automation-controller                  Scaled down replica set image-automation-controller-78948c9bcf to 0
42m         Normal    Killing                      pod/source-controller-7644467db-5pkqm                   Stopping container manager
42m         Normal    SuccessfulDelete             replicaset/source-controller-7644467db                  Deleted pod: source-controller-7644467db-5pkqm
42m         Normal    Scheduled                    pod/source-controller-95684c946-6fwh5                   Successfully assigned flux-system/source-controller-95684c946-6fwh5 to gke-pt-dev-default-090510cc-brz7
42m         Normal    Pulling                      pod/source-controller-95684c946-6fwh5                   Pulling image "fluxcd/source-controller:v0.22.5"
42m         Normal    Pulled                       pod/source-controller-95684c946-6fwh5                   Successfully pulled image "fluxcd/source-controller:v0.22.5" in 4.206627015s
42m         Normal    Created                      pod/source-controller-95684c946-6fwh5                   Created container manager
42m         Normal    Started                      pod/source-controller-95684c946-6fwh5                   Started container manager
42m         Warning   Unhealthy                    pod/source-controller-95684c946-6fwh5                   Readiness probe failed: Get "http://10.6.0.76:9090/": dial tcp 10.6.0.76:9090: connect: connection refused
42m         Normal    SuccessfulCreate             replicaset/source-controller-95684c946                  Created pod: source-controller-95684c946-6fwh5
42m         Normal    LeaderElection               configmap/source-controller-leader-election             source-controller-95684c946-6fwh5_45489842-c150-48bd-bb8e-8c82ee5ca62d became leader
42m         Normal    LeaderElection               lease/source-controller-leader-election                 source-controller-95684c946-6fwh5_45489842-c150-48bd-bb8e-8c82ee5ca62d became leader
42m         Normal    ScalingReplicaSet            deployment/source-controller                            Scaled up replica set source-controller-95684c946 to 1
42m         Normal    ScalingReplicaSet            deployment/source-controller                            Scaled down replica set source-controller-7644467db to 0

I've updated and downgraded the service in this time perioud, we can't leave the service like this because it blocks our development :)

pjbgf · 2022-03-31T13:43:20Z

@saltyblu Thanks again for the quick reply.

We are investigating this and still trying to reproduce. Can you please provide more details about your cluster? What version of Kubernetes are you running on?

It seems that the controller is working properly, but the probes are misbehaving for some reason in your setup. Do you mind disabling both livenessProbe and readinessProbe temporarily in your deployment to confirm the theory?

saltyblu · 2022-03-31T13:47:07Z

Kubernetes Version is: v1.22.7-gke.300 with 6 nodes

62 Flux Kustomizations
31 Image Updates/Repositories
31 Source gits
1 flux-system git that is taking care of all image update automations.

pjbgf · 2022-03-31T13:48:21Z

Do you have any additional NetworkPolicy applied to that namespace that could potentially be blocking the access to that endpoint?

saltyblu · 2022-03-31T13:49:12Z

We have 3 in place but no new ones there:

NAMESPACE     NAME             POD-SELECTOR                  AGE
flux-system   allow-egress     <none>                        181d
flux-system   allow-scraping   <none>                        181d
flux-system   allow-webhooks   app=notification-controller   181d

saltyblu · 2022-03-31T13:51:37Z

I've also disabled health and readiness probe, directly after triggering an image-update-automation the pod dies and gets recreated :(

saltyblu · 2022-04-05T14:04:05Z

after some debugging the container seems to silently fail somwhere below that code line . . .

var signingEntity *openpgp.Entity

After debugging the controller with delve we are not able to get a "certain" line of code where the controller stops working.

brettstewart · 2022-04-08T15:30:45Z

We are having an issue similar to this. Anytime there is a NEW change that needs to be committed the automation controller crashes and ends up in a crashbackloop until we suspend automation.

This is the last line we get in our logs before crash: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L340

We DO NOT make it to this line: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L371

We think the failure is occurring around here since the Crash only happens when there is a new change: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L378

Images

helm-controller               ghcr.io/fluxcd/helm-controller:v0.18.2
image-automation-controller   ghcr.io/fluxcd/image-automation-controller:v0.21.3
image-reflector-controller    ghcr.io/fluxcd/image-reflector-controller:v0.17.1
kustomize-controller          ghcr.io/fluxcd/kustomize-controller:v0.22.3
notification-controller       ghcr.io/fluxcd/notification-controller:v0.23.2
source-controller             ghcr.io/fluxcd/source-controller:v0.22.5

Env

- name: EXPERIMENTAL_GIT_TRANSPORT
   value: “true”

Amount
Kustomizations: 80
Helm Releases: 166

Additional Information based on previous comments

No Network Policy would be blocking this

pjbgf · 2022-04-08T19:32:30Z

We have recently merged some fixes on source-controller but they haven't made their way into image-automation-controller yet. I managed to reproduce some of the issues you mentioned so I am assuming this may also resolve your situation.

The version below is working quite well with most Git providers when using SSH, with the exception of BitBucket and Azure DevOps. On that point, would you be able to confirm what Git providers you are using @saltyblu @brettstewart?

If you could test ahead of the release and confirm the fixes/improves your issue that would be great.
In the mean time, I created a draft PR, so you can either:

Wait until the changes are released (at some point next week).
Test changes using my image: quay.io/paulinhu/image-automation-controller:selfheal@sha256:0c78d0ff5aec404c724b623e9d4ae00d5c0348320c397626900d1e073e1d7ac5:
Create your own image using the PR and running: IMG=repo/and/image-name TAG=your-tag BUILD_PLATFORMS="linux/amd64" BUILD_ARGS="--push" make docker-build

brettstewart · 2022-04-08T21:42:43Z

@pjbgf Github - I reviewed the draft PR, we will get an image spun up and report back.

Thanks for the quick update!

saltyblu · 2022-04-11T05:36:20Z

Hi,
we'll also use the draft image to check if the issue still exists.

Currently we are using only Github Enterprise as git provider.

Thanks for the updates

saltyblu · 2022-04-11T12:19:15Z

We used the updated image for half the day now and the issue seems not to be solved. The controller is able to update 1 or 2 ImageAutomations after a random time period the controller is failing silently again (means without any error logs)

thedarkwriter · 2022-04-11T15:38:50Z

I can concur, I'm getting the exact same behavior.
The only strange thing to add is if I change the logging from info to debug, the very next run it will update the image to the newest tag, then it will start the crash loop again.
Nothing in the logs to indicate the cause of the failure.

I also tried the @pjbgf container quay.io/paulinhu/image-automation-controller:selfheal@sha256:0c78d0ff5aec404c724b623e9d4ae00d5c0348320c397626900d1e073e1d7ac5

Same behavior. This was the result I saw

New Pod ran for 5 minutes fine, did the one update.
I pushed an image update, it crashed almost immediately
Now it is in crashloop backoff
image-automation-controller-84f4ffbd58-68qrd 0/1 CrashLoopBackOff 1 6m19s`
It’s now running again …. let’s see if it deploys the new tag
nope, crashed
It logged this, then crashed again

{"level":"info","ts":"2022-04-11T15:51:22.660Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-04-11T15:51:22.660Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-04-11T15:51:22.661Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-04-11T15:51:22.661Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0411 15:51:22.761525       1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0411 15:52:05.948297       1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"debug","ts":"2022-04-11T15:52:05.948Z","logger":"events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"flux-system","name":"image-automation-controller-leader-election","uid":"b9d0bcfc-7f0e-4121-850f-615af567a1a8","apiVersion":"v1","resourceVersion":"11041528"},"reason":"LeaderElection","message":"image-automation-controller-84f4ffbd58-68qrd_bb77fce1-771f-45c3-b1bf-4724c0f55d2e became leader"}
{"level":"debug","ts":"2022-04-11T15:52:05.948Z","logger":"events","msg":"Normal","object":{"kind":"Lease","namespace":"flux-system","name":"image-automation-controller-leader-election","uid":"77a4a51c-22d3-49f4-ba90-0763a1e7d7f1","apiVersion":"coordination.k8s.io/v1","resourceVersion":"11041529"},"reason":"LeaderElection","message":"image-automation-controller-84f4ffbd58-68qrd_bb77fce1-771f-45c3-b1bf-4724c0f55d2e became leader"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-04-11T15:52:05.948Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-04-11T15:52:06.049Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"debug","ts":"2022-04-11T15:52:06.050Z","logger":"controller.imageupdateautomation","msg":"fetching git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"}}
{"level":"debug","ts":"2022-04-11T15:52:06.050Z","logger":"controller.imageupdateautomation","msg":"attempting to clone git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"},"ref":{"branch":"main"},"working":"/tmp/flux-system-flux-system2014439465"}
{"level":"debug","ts":"2022-04-11T15:52:06.402Z","logger":"controller.imageupdateautomation","msg":"updating with setters according to image policies","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","count":1,"manifests-path":"/tmp/flux-system-flux-system2014439465/clusters/apps/dev/redis"}
{"level":"debug","ts":"2022-04-11T15:52:06.404Z","logger":"controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"redis-temp-image-update-automation","namespace":"flux-system","working":"/tmp/flux-system-flux-system2014439465"}

It's now cycling this last behavior, which is what I was seeing before changing image-automation-controller images

annaken · 2022-04-13T12:03:37Z

We were seeing exactly this behaviour in one of our clusters, and it seemed like the pod wasn't able to burst high enough, so we increased the limits on the image-automation-controller to

           limits:
            cpu: 1.5
            memory: 2.5Gi

and also increase the git source interval from 2m to 5m. We haven't had any crashes since then.

NB this is only happening in the cluster that has multiple git sources, including "include" sources. The cluster that pull everything from a single git source is fine.

thedarkwriter · 2022-04-13T18:38:08Z

Just to offer a second test, I'm using a single repo cluster and upped my resources like @annaken . Still a lot of crashes

image-automation-controller-665dfcf789-xv6kx   1/1     Running     18         4h24m

But the images are eventually updated. So the image-automation-controller is still working, just not stable.

saltyblu · 2022-04-14T05:42:21Z

We tested the image-automation-controller in a fully new created GKE Cluster with one additional repo beside the flux-sysrtem and it is still failing.

In our case the controller often misses updates, we downgraded the controller back to 0.20.1 for now. :(

leehanel · 2022-04-14T19:31:33Z

I added some tracing and built an image using the above method. I seems to be silently crashing in

image-automation-controller/controllers/imageupdateautomation_controller.go

Line 619 in 4d70651

if err := index.AddByPath(rel); err != nil {

Specifically, i added these traces around it:

tracelog.Info("Fixing to try index.AddByPath(rel)", "rel", rel)
if err := index.AddByPath(rel); err != nil {
    tracelog.Info("adding file", "file", rel)
    return err
}
tracelog.Info("Got error adding file", "file", rel)
tracelog.Info("Got error adding file", "error", err)

set the log-level to trace and the last log i got from it was as follows:

{"level":"Level(-2)","ts":"2022-04-14T15:49:36.428Z","logger":"controller.imageupdateautomation","msg":"Fixing to try index.AddByPath(rel)","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"demo","namespace":"demo","rel":".editorconfig"}

we never get to the "adding file" tracelog line in the if or to the additional traces i added after it.

to note, .editorconfig is the 1st file in an ls of the git repo.

We have reverted to 0.20.1 as well, and that seems stable so far.

saltyblu · 2022-04-20T12:10:55Z

I've tested the new image-automation-controller already and this did not solve the problem. The controller still crashes

dquote> Some users are experiencing some segfault issues in Flux controllers which seem to be related to internal git2go state processed within a background thread. Due to the shift into Managed Transport only, the concerns around multi-threading from a libgit2 seem to be less problematic than otherwise. #339 https://github.com/libgit2/libgit2/blob/main/docs/threading.md Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

Some users are experiencing some segfault issues in Flux controllers which seem to be related to internal git2go state processed within a background thread. Due to the shift into Managed Transport only, the concerns around multi-threading from a libgit2 seem to be less problematic than otherwise. fluxcd/image-automation-controller#339 https://github.com/libgit2/libgit2/blob/main/docs/threading.md Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

dquote> Some users are experiencing some segfault issues in Flux controllers which seem to be related to internal git2go state processed within a background thread. Due to the shift into Managed Transport only, the concerns around multi-threading from a libgit2 seem to be less problematic than otherwise. #339 https://github.com/libgit2/libgit2/blob/main/docs/threading.md Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

pjbgf · 2022-07-12T13:33:28Z

Here's the release candidate image: ghcr.io/fluxcd/image-automation-controller:rc-41c0ddc4.
It would be great if folks experiencing this issue could test it and let us know whether or not it fixes it.

pjbgf · 2022-07-12T14:40:32Z

For everybody else's benefit, two users experiencing this issue have already confirmed the RC version worked. So we will start working towards getting this merged. Based on the upstream dependencies this may take some time to make its way into the official releases.

Please report any issues or confirm (:heavy_check_mark:) this is working for you.

BertelBB · 2022-07-13T12:30:23Z

✔️ ghcr.io/fluxcd/image-automation-controller:rc-41c0ddc4 works 👍

sohahm · 2022-07-13T13:29:02Z

@pjbgf Thank you very much, Paulo. The RC version rc-41c0ddc4 is working.

saltyblu · 2022-07-18T09:55:14Z

Looks good, the controller is still crashing sometimes but thats maybe an other topic.

goroutine 423 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc00048da68, 0x5a5)
        runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0xc001633a20?)
        sync/cond.go:56 +0x8c
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc00048da40, 0xc0009856d0)
        k8s.io/client-go@v0.24.1/tools/cache/delta_fifo.go:527 +0x22e
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc000ad23f0)
        k8s.io/client-go@v0.24.1/tools/cache/controller.go:184 +0x36
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x40e8c5?)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xf94ce5?, {0x20724a0, 0xc000830780}, 0x1, 0xc000b6a0c0)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ad2458?, 0x3b9aca00, 0x0, 0x80?, 0x7fa87d0e4040?)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc000ad23f0, 0xc000b6a0c0)
        k8s.io/client-go@v0.24.1/tools/cache/controller.go:155 +0x2c5
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc0009f2aa0, 0x0?)
        k8s.io/client-go@v0.24.1/tools/cache/shared_informer.go:443 +0x47c
created by sigs.k8s.io/controller-runtime/pkg/cache/internal.(*specificInformersMap).Start.func1
        sigs.k8s.io/controller-runtime@v0.11.2/pkg/cache/internal/informers_map.go:152 +0x10c

pjbgf · 2022-07-18T10:13:15Z

@saltyblu thanks for sharing this. The log above looks like the end of a goroutines' dump. Do you have access to the entire logs which may provide further insights about the error?

saltyblu · 2022-07-18T10:24:32Z

image-automation-controller.log

Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd/image-automation-controller#339. Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

pjbgf · 2022-08-10T08:39:26Z

@saltyblu this has now been fixed as of yesterday's release: ghcr.io/fluxcd/image-automation-controller:v0.24.0.

The second issue you mentioned will be tracked separately as part of #415.

Thank you for reporting and testing the changes. 🙇

Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

pjbgf self-assigned this Apr 8, 2022

pjbgf added the bug Something isn't working label Apr 8, 2022

pjbgf added this to Maintainers' Focus Apr 8, 2022

pjbgf moved this to In Progress in Maintainers' Focus Apr 8, 2022

saltyblu changed the title ~~image-automation-controller CrashLoopBack 0.20.2~~ image-automation-controller CrashLoopBack >= 0.20.2 Apr 11, 2022

johnp-stella mentioned this issue Apr 28, 2022

Crash "panic: invalid pointer handle: loop" with EXPERIMENTAL_GIT_TRANSPORT = true #358

Closed

pjbgf added this to the GA milestone Jul 12, 2022

pjbgf mentioned this issue Jul 12, 2022

Segfault when calling index.AddAll libgit2/git2go#917

Open

pjbgf moved this from In Progress to Blocked in Maintainers' Focus Jul 20, 2022

pjbgf moved this from Blocked to In Progress in Maintainers' Focus Jul 28, 2022

pjbgf mentioned this issue Jul 31, 2022

Remove MUSL and enable threadless libgit2 support fluxcd/source-controller#853

Merged

pjbgf mentioned this issue Jul 31, 2022

Remove MUSL and enable threadless libgit2 support #411

Merged

stefanprodan closed this as completed in #411 Aug 1, 2022

Repository owner moved this from In Progress to Since Last Dev Meeting in Maintainers' Focus Aug 1, 2022

kingdonb moved this from Since Last Dev Meeting to Done in Maintainers' Focus Aug 3, 2022

pjbgf mentioned this issue Aug 10, 2022

Libgit2 Intermittent Crash #415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image-automation-controller CrashLoopBack >= 0.20.2 #339

image-automation-controller CrashLoopBack >= 0.20.2 #339

saltyblu commented Mar 30, 2022 •

edited

Loading

pjbgf commented Mar 30, 2022

saltyblu commented Mar 30, 2022 •

edited

Loading

saltyblu commented Mar 30, 2022

agrib-01 commented Mar 31, 2022

saltyblu commented Mar 31, 2022

pjbgf commented Mar 31, 2022 •

edited

Loading

saltyblu commented Mar 31, 2022 •

edited

Loading

pjbgf commented Mar 31, 2022 •

edited

Loading

saltyblu commented Mar 31, 2022

pjbgf commented Mar 31, 2022

saltyblu commented Mar 31, 2022

saltyblu commented Mar 31, 2022

saltyblu commented Apr 5, 2022 •

edited

Loading

brettstewart commented Apr 8, 2022

pjbgf commented Apr 8, 2022

brettstewart commented Apr 8, 2022

saltyblu commented Apr 11, 2022

saltyblu commented Apr 11, 2022

thedarkwriter commented Apr 11, 2022 •

edited

Loading

annaken commented Apr 13, 2022

thedarkwriter commented Apr 13, 2022

saltyblu commented Apr 14, 2022

leehanel commented Apr 14, 2022

saltyblu commented Apr 20, 2022

pjbgf commented Jul 12, 2022

pjbgf commented Jul 12, 2022

BertelBB commented Jul 13, 2022

sohahm commented Jul 13, 2022

saltyblu commented Jul 18, 2022

pjbgf commented Jul 18, 2022

saltyblu commented Jul 18, 2022

pjbgf commented Aug 10, 2022

image-automation-controller CrashLoopBack >= 0.20.2 #339

image-automation-controller CrashLoopBack >= 0.20.2 #339

Comments

saltyblu commented Mar 30, 2022 • edited Loading

Issue

How To reproduce

pjbgf commented Mar 30, 2022

saltyblu commented Mar 30, 2022 • edited Loading

flux version

saltyblu commented Mar 30, 2022

agrib-01 commented Mar 31, 2022

saltyblu commented Mar 31, 2022

pjbgf commented Mar 31, 2022 • edited Loading

saltyblu commented Mar 31, 2022 • edited Loading

pjbgf commented Mar 31, 2022 • edited Loading

saltyblu commented Mar 31, 2022

pjbgf commented Mar 31, 2022

saltyblu commented Mar 31, 2022

saltyblu commented Mar 31, 2022

saltyblu commented Apr 5, 2022 • edited Loading

brettstewart commented Apr 8, 2022

pjbgf commented Apr 8, 2022

brettstewart commented Apr 8, 2022

saltyblu commented Apr 11, 2022

saltyblu commented Apr 11, 2022

thedarkwriter commented Apr 11, 2022 • edited Loading

annaken commented Apr 13, 2022

thedarkwriter commented Apr 13, 2022

saltyblu commented Apr 14, 2022

leehanel commented Apr 14, 2022

saltyblu commented Apr 20, 2022

pjbgf commented Jul 12, 2022

pjbgf commented Jul 12, 2022

BertelBB commented Jul 13, 2022

sohahm commented Jul 13, 2022

saltyblu commented Jul 18, 2022

pjbgf commented Jul 18, 2022

saltyblu commented Jul 18, 2022

pjbgf commented Aug 10, 2022

saltyblu commented Mar 30, 2022 •

edited

Loading

saltyblu commented Mar 30, 2022 •

edited

Loading

`flux version`

pjbgf commented Mar 31, 2022 •

edited

Loading

saltyblu commented Mar 31, 2022 •

edited

Loading

pjbgf commented Mar 31, 2022 •

edited

Loading

saltyblu commented Apr 5, 2022 •

edited

Loading

thedarkwriter commented Apr 11, 2022 •

edited

Loading