Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image-automation-controller not reconnecting after operation timed out #209

Closed
1 task done
Tracked by #2593
minh-nguyenquang opened this issue Aug 3, 2021 · 36 comments
Closed
1 task done
Tracked by #2593
Assignees
Labels
bug Something isn't working
Milestone

Comments

@minh-nguyenquang
Copy link

Describe the bug

image-automation-controller doesn't reconnect to github after operation timed out. I have to delete the pod to restart.
Below is the log from image-automation-controller.

{"level":"error","ts":"2021-07-22T14:57:58.859Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'https://github.com/CoverGo/k8s-fleet', error: failed to connect to github.com: Operation timed out"}
{"level":"error","ts":"2021-07-22T15:03:04.528Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'https://github.com/CoverGo/k8s-fleet', error: failed to connect to github.com: Operation timed out"}

Steps to reproduce

I don't know how to reproduce because operation timed out can happen anytime

Expected behavior

image-automation-controller can reconnect automatically.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

0.16.1

Flux check

► checking prerequisites
✔ kubectl 1.21.0 >=1.18.0-0
✔ Kubernetes 1.18.8-aliyun.1 >=1.16.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.11.1
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.14.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.11.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.13.2
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.15.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.15.3
✔ all checks passed

Git provider

github

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@stefanprodan stefanprodan transferred this issue from fluxcd/flux2 Aug 3, 2021
@stefanprodan
Copy link
Member

@minh-nguyenquang if this happens again can you please take a profile snapshot as described here: https://fluxcd.io/docs/gitops-toolkit/debugging/#endpoints I assume there is some socket leak and the container hits the file descriptors limit. Having a Go profile would helps us pin point to the library that has this leak. Thanks

@minh-nguyenquang
Copy link
Author

@stefanprodan Here is the attached file for profiles. Thanks
endpoints.zip

@stefanprodan
Copy link
Member

@minh-nguyenquang did you took the profile when the controller was stuck?

@minh-nguyenquang
Copy link
Author

@stefanprodan I took profiles after 10-15m when controller was stuck

@minh-nguyenquang
Copy link
Author

@stefanprodan this time, I got issues on source-controller, I attach profiles here
endpoint_source_controller.zip

@minh-nguyenquang
Copy link
Author

Here is the log

{"level":"error","ts":"2021-08-05T20:09:27.157Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-fleet.git', error: dial tcp 13.229.188.59:22: connect: connection refused"}
{"level":"error","ts":"2021-08-06T00:16:57.202Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"client-releases","namespace":"production","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-client-releases.git', error: ssh: handshake failed: read tcp 10.0.88.30:35114->13.229.188.59:22: read: connection timed out"}
{"level":"error","ts":"2021-08-06T00:17:04.627Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"client-releases","namespace":"staging","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-client-releases.git', error: ssh: handshake failed: read tcp 10.0.88.30:52664->13.229.188.59:22: read: connection timed out"}
{"level":"error","ts":"2021-08-06T00:27:40.894Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"client-monitoring-prometheus","namespace":"client-monitoring","error":"Get \"https://github.com/prometheus-community/helm-charts/releases/download/kube-prometheus-stack-15.4.2/kube-prometheus-stack-15.4.2.tgz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2021-08-06T01:25:37.525Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"client-releases","namespace":"staging","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-client-releases.git', error: ssh: handshake failed: read tcp 10.0.88.30:59908->13.229.188.59:22: read: connection reset by peer"}
{"level":"error","ts":"2021-08-06T01:26:07.219Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-fleet.git', error: ssh: handshake failed: read tcp 10.0.88.30:59896->13.229.188.59:22: read: connection reset by peer"}
{"level":"error","ts":"2021-08-06T13:38:48.053Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-fleet.git', error: ssh: handshake failed: read tcp 10.0.88.30:34542->13.229.188.59:22: read: connection reset by peer"}
{"level":"error","ts":"2021-08-06T13:38:52.916Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"client-releases","namespace":"staging","error":"unable to clone 'ssh://git@github.com/hiddlingrepo/k8s-client-releases.git', error: ssh: handshake failed: read tcp 10.0.88.30:39566->13.229.188.59:22: read: connection reset by peer"}

@stefanprodan
Copy link
Member

@minh-nguyenquang can you please post here the GitRepository object?

@minh-nguyenquang
Copy link
Author

Here is the GitRepository object

---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: client-releases
  namespace: staging
spec:
  interval: 1m0s
  ref:
    branch: master
  secretRef:
    name: client-releases
  url: ssh://git@github.com/CoverGo/k8s-client-releases.git 

@demisx
Copy link

demisx commented Dec 27, 2021

We ran into a similar issue today. After running Flux CD v2 for 24 days, the image-automation-controller could no longer connect to the GitHub repo and update the image tags. Killing both, image-automation-controller and source-controller pods, fixed the issue (most likely temporarily).

My heap files are attached (added ".zip" extension, so I can add them to GitHub)

$ flux version
---------------
flux: v0.24.0
helm-controller: v0.14.0
image-automation-controller: v0.18.0
image-reflector-controller: v0.14.0
kustomize-controller: v0.18.1
notification-controller: v0.19.0
source-controller: v0.19.0

@squaremo squaremo self-assigned this Jan 5, 2022
@demisx
Copy link

demisx commented Jan 6, 2022

Same issue with v0.24.1. Fails once a week in all environments. Restarting image-automation-controller and source-controller pods temporarily fixes the problem.

$ flux version
---
flux: v0.24.1
helm-controller: v0.14.1
image-automation-controller: v0.18.0
image-reflector-controller: v0.14.0
kustomize-controller: v0.18.2
notification-controller: v0.19.0
source-controller: v0.19.2

@stefanprodan
Copy link
Member

@demisx why would you restart source-controller? Image automation does not uses source-controller to push changes to Git. Does source-controller stops cloning? Can you please do #282 (comment)

@stefanprodan
Copy link
Member

@demisx can you post your GitRepository and ImageUpdateAutomation manifests here please.

@demisx
Copy link

demisx commented Jan 7, 2022

@demisx why would you restart source-controller? Image automation does not uses source-controller to push changes to Git. Does source-controller stops cloning?

You know, I don’t recall why exactly. Maybe I read it in some comment or restarting image-automation-controller alone didn’t work for me and I thought source-controller needed to be restarted as well. I can try restarting only image-source-controller next time.

Can you please do #282 (comment)

Should I do this now or right after it stops working?

@demisx
Copy link

demisx commented Jan 7, 2022

@demisx can you post your GitRepository and ImageUpdateAutomation manifests here please.

Here they are:

$ kubectl get GitRepository -n flux-system -oyaml
---
apiVersion: v1
items:
- apiVersion: source.toolkit.fluxcd.io/v1beta1
  kind: GitRepository
  metadata:
    annotations:
      reconcile.fluxcd.io/requestedAt: "2021-12-28T11:06:55.648944-08:00"
    creationTimestamp: "2021-12-03T16:00:21Z"
    finalizers:
    - finalizers.fluxcd.io
    generation: 1
    labels:
      kustomize.toolkit.fluxcd.io/name: flux-system
      kustomize.toolkit.fluxcd.io/namespace: flux-system
    name: flux-system
    namespace: flux-system
    resourceVersion: "20344562"
    uid: c9b44abf-e2c8-476a-81d9-98fc7147acbb
  spec:
    gitImplementation: go-git
    interval: 1m0s
    ref:
      branch: main
    secretRef:
      name: flux-system
    timeout: 20s
    url: ssh://git@github.com/ChoiHoldings/infra
  status:
    artifact:
      checksum: fed591398523fc39be00246e8ef4ec702e93e053837a3e33f6f3d4e4de0f2e37
      lastUpdateTime: "2022-01-07T04:40:44Z"
      path: gitrepository/flux-system/flux-system/0ab92a793f666e8043bbc90e44f3384304c8eee8.tar.gz
      revision: main/0ab92a793f666e8043bbc90e44f3384304c8eee8
      url: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/flux-system/0ab92a793f666e8043bbc90e44f3384304c8eee8.tar.gz
    conditions:
    - lastTransitionTime: "2022-01-06T18:03:59Z"
      message: 'Fetched revision: main/0ab92a793f666e8043bbc90e44f3384304c8eee8'
      reason: GitOperationSucceed
      status: "True"
      type: Ready
    lastHandledReconcileAt: "2021-12-28T11:06:55.648944-08:00"
    observedGeneration: 1
    url: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/flux-system/latest.tar.gz
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
$ kubectl get ImageUpdateAutomation -n flux-system -oyaml
---
apiVersion: v1
items:
- apiVersion: image.toolkit.fluxcd.io/v1beta1
  kind: ImageUpdateAutomation
  metadata:
    creationTimestamp: "2021-12-03T16:00:35Z"
    generation: 1
    labels:
      kustomize.toolkit.fluxcd.io/name: flux-system
      kustomize.toolkit.fluxcd.io/namespace: flux-system
    name: image-update-automation
    namespace: flux-system
    resourceVersion: "20348437"
    uid: 392ea6d4-063d-43aa-8f3a-ba6aef4af4ec
  spec:
    git:
      checkout:
        ref:
          branch: main
      commit:
        author:
          email: fluxcdbot@users.noreply.github.com
          name: fluxcdbot
        messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
      push:
        branch: main
    interval: 1m
    sourceRef:
      kind: GitRepository
      name: flux-system
    update:
      path: ./k8s/stg
      strategy: Setters
  status:
    conditions:
    - lastTransitionTime: "2022-01-07T03:36:20Z"
      message: no updates made; last commit 0ab92a7 at 2022-01-07T04:39:39Z
      reason: ReconciliationSucceeded
      status: "True"
      type: Ready
    lastAutomationRunTime: "2022-01-07T04:49:39Z"
    lastPushCommit: 0ab92a793f666e8043bbc90e44f3384304c8eee8
    lastPushTime: "2022-01-07T04:39:39Z"
    observedGeneration: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@demisx
Copy link

demisx commented Jan 7, 2022

Can you please do #282 (comment)

@stefanprodan Here is what I see now after the pod has been restarted (11 hours ago):

$ kubectl exec -it -n flux-system deploy/image-automation-controller -- sh
$ ls -lah /tmp
---
drwxrwsrwx 2 root 1337  6 Jan  7 05:02 .
drwxr-xr-x 1 root root 17 Jan  6 18:03 ..

$ du -sh /tmp/*
du: cannot access '/tmp/*': No such file or directory

@hiddeco
Copy link
Member

hiddeco commented Jan 11, 2022

@demisx thanks for providing us with all the details thus far. Would you be able to provide a trace profile as well next time you notice the freeze (and before killing the container)? This can be done by running the following:

$ kubectl port-forward -n <namespace> deploy/<component> 8080
$ curl -Sk -v http://localhost:8080/debug/pprof/trace?seconds=10 > trace.out

@demisx
Copy link

demisx commented Jan 11, 2022

@hiddeco My pleasure. I will do that as soon as I notice the freeze again. Just to clarify, is the <namespace> = flux-system and deploy/<component> = deploy/image-automation-controller?

@hiddeco
Copy link
Member

hiddeco commented Jan 11, 2022

That's correct!

@uqix
Copy link

uqix commented Jan 20, 2022

Our heap and trace of image-automation-controller:
Archive.zip

See also: #296 (comment)

@uqix
Copy link

uqix commented Jan 20, 2022

@minh-nguyenquang if this happens again can you please take a profile snapshot as described here: https://fluxcd.io/docs/gitops-toolkit/debugging/#endpoints I assume there is some socket leak and the container hits the file descriptors limit. Having a Go profile would helps us pin point to the library that has this leak. Thanks

FYI, we experienced privoxy Too many open connections error after using fluxcd source-controller with https_proxy env var.

@demisx
Copy link

demisx commented Mar 18, 2022

Happened to us in production again. Had to bounce image-automation-controller in oder to for the new images to be pulled and deployed. I am attaching the heap dumps. We don't have this issue with Flux v1 running in older clusters. Seems solid there.

heap.image-automation-controller.out.zip
heap.source-controller.out.zip

$ flux version
---------------
flux: v0.24.1
helm-controller: v0.14.1
image-automation-controller: v0.18.0
image-reflector-controller: v0.14.0
kustomize-controller: v0.18.2
notification-controller: v0.19.0
source-controller: v0.19.2

@hiddeco
Copy link
Member

hiddeco commented Mar 18, 2022

It is worth keeping an eye out for #326, which if all goes according to plan, will be out next week.

@pjbgf
Copy link
Member

pjbgf commented Mar 22, 2022

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller's Deployment.

This will require a redeploy of all components so I would recommend doing so via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

Can you test it again with the experimental transport enabled and let us know how you get on please?

@pjbgf pjbgf added the bug Something isn't working label Mar 22, 2022
@demisx
Copy link

demisx commented Mar 22, 2022

We'll give it a try once v0.28 is posted: https://github.com/fluxcd/flux2. Thank you guys for such an awesome solution!

@pjbgf
Copy link
Member

pjbgf commented Mar 29, 2022

@demisx we have now released the official version, and followed it up with a few patches. Please give it a try against version v0.28.4 or newer.

@demisx
Copy link

demisx commented Mar 29, 2022

@pjbgf Thank you. I just upgraded to the latest in our environments. I'll let you guys know if I run into any issues. This is what we are currently running after the upgrade:

flux version
flux: v0.28.4
helm-controller: v0.18.2
image-automation-controller: v0.21.2
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.2
notification-controller: v0.23.1
source-controller: v0.22.4

@demisx
Copy link

demisx commented Mar 30, 2022

I've noticed this error popping up in the upgraded image-automation-controller log, even though the new images seem to get pulled and deployed in the cluster:

{
  "level":"error",
  "ts":"2022-03-30T00:40:29.636Z",
  "logger":"controller.imageupdateautomation",
  "msg":"Reconciler error",
  "reconciler group":"image.toolkit.fluxcd.io",
  "reconciler kind":"ImageUpdateAutomation",
  "name":"image-update-automation",
  "namespace":"flux-system",
  "error":"unable to clone 'ssh://git@github.com/<org>/<repo-name>': SSH could not read data: Error waiting on socket"
}

@pjbgf
Copy link
Member

pjbgf commented Mar 30, 2022

@demisx thank you for reporting back. Do you mind confirming whether controller's deployment had the environment variable EXPERIMENTAL_GIT_TRANSPORT set to true?

@demisx
Copy link

demisx commented Mar 31, 2022

@pjbgf Oh, no. I missed that part. My understanding I should set the EXPERIMENTAL_GIT_TRANSPORT env var per these instructions, right? Let me do that and get back to you.

@demisx
Copy link

demisx commented Mar 31, 2022

Just to make sure I enable it the right way. I am planning to edit flux-system/kustomization.yaml file in the same dir as gotk-components.yaml and gotk-sync.yaml files. The contents of the file will be:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gotk-components.yaml
- gotk-sync.yaml
patches:
- patch: |
    - op: add
      path: /spec/template/spec/containers/0/env/0
      value:
        name: EXPERIMENTAL_GIT_TRANSPORT
        value: "true"
  target:
    kind: Deployment
    name: "(source-controller|image-automation-controller)"

Am I missing anything? Do I need to do anything else besides this?

@pjbgf
Copy link
Member

pjbgf commented Mar 31, 2022

@demisx that should be all, you can confirm that your kustomize is being applied correctly by leveraging the command below:

kustomize build path/to/your/flux-system | grep -B11 -A 2 'EXPERIMENTAL_GIT_TRANSPORT'

Once that is committed and pushed your into repository, also ensure that the correct version of your GitRepository source was reconciled into the cluster - just to be certain that the latest was applied. For that check whether the git hash of your repository matches the revision returned by:

flux get source git flux-system

And finally, we have released a new patch yesterday for both source-controller and image-automation-controller, so I would also recommend trying using the very latest version if possible.

@demisx
Copy link

demisx commented Apr 2, 2022

Done. Thank you very much for the detail instructions. I see most controller pods have been restarted. I will let you know if I run into any issues. This is what I have right now:

$ flux version 
---
flux: v0.28.5
helm-controller: v0.18.2
image-automation-controller: v0.21.3
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.3
notification-controller: v0.23.2
source-controller: v0.22.5

@pjbgf pjbgf added this to the GA milestone Apr 2, 2022
@pjbgf pjbgf self-assigned this Apr 2, 2022
@pjbgf
Copy link
Member

pjbgf commented May 3, 2022

@demisx have you experienced the issue again since the upgrade?

@demisx
Copy link

demisx commented May 3, 2022

@pjbgf So far, so good. 🤞🏻 If I notice any issues, I'll make sure to post here right away.

@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

We have a new release candidate that further improve the controller:
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

Two important changes a) Managed Transport is enabled by default and context timeouts are now enforced.

@pjbgf
Copy link
Member

pjbgf commented Jun 8, 2022

Closing this based on similar reports from users that confirmed this is no longer happening. If that changes we can always reopen the issue.

@pjbgf pjbgf closed this as completed Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

7 participants