"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293

ilya-git · 2022-01-17T14:19:09Z

I have updated flux from version 0.18 to the latest 0.25.2 which included update of image-automation-controller from 0.15 to 0.19.

After that there was one successful commit to the repository. After that commit everything stopped and killing the pod does not help. It is still the same message all the time. I have turned on debug level but it did not provide anything more useful than this error:

{
  "level": "error",
  "ts": "2022-01-17T14:09:25.041Z",
  "logger": "controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "my-gitops",
  "namespace": "flux-system",
  "error": "unable to clone: SSH could not read data: Error waiting on socket",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227"
}

I have tried to downgrade to 0.15 and then a weird thing happened: all the commits went through, but right after that a similar error appeared:

{
  "level": "error",
  "ts": "2022-01-17T14:13:17.047Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "my-gitops",
  "namespace": "flux-system",
  "error": "unable to clone 'ssh://git@bitbucket.org/myorg/my-gitops', error: SSH could not read data: Error waiting on socket"
}

Source controller does not have this trouble. I am using Bitbucket Cloud. I have seen a similar error posted elsewhere, but that one seems to be resolved by restarting the pod each time while I am not so lucky, so it's probably best to have it as a separate error.

Is there any workaround I can use, automation all but stopped for us because of it.

The text was updated successfully, but these errors were encountered:

hiddeco · 2022-01-17T14:29:35Z

Can you trace the application while it is in a "locked state", as I described here: #209 (comment)?

In case of concerns about sensitive data, you can send it to hidde <at> weave <dot> works.

ilya-git · 2022-01-17T14:36:55Z

I can send it to you, but could you please describe what type of sensitive data is there? I hope there is no dump of k8s secrets or something in there :)
I tried to check myself but it seems to be binary

hiddeco · 2022-01-17T14:50:00Z

The execution trace captures a wide range of execution events such as goroutine creation/blocking/unblocking, syscall enter/exit/block, GC-related events, changes of heap size, processor start/stop, etc. A precise nanosecond-precision timestamp and a stack trace is captured for most events. The generated trace can be interpreted using go tool trace.
— https://pkg.go.dev/runtime/trace#hdr-Tracing_runtime_activities

It tells us what the application is doing at the moment the snapshot is taken. Nothing sensitive directly, and you can check the contents yourself using go tool trace <file>.out. I mostly include this message because people are often hesitant about sharing it in public.

Looking at a 10s snapshot I took myself however, it may be more useful to have a 30 or 60 second version (by changing the GET parameter value in the URL).

ilya-git · 2022-01-17T15:13:52Z

Thanks for details, I have sent you the traces with 60 seconds. Do you by any chance have an idea what can we do in a meantime (any workaround)?

ilya-git · 2022-01-18T10:26:02Z

I have started to see this error from time to time as well (don't know if it is relevant or not to this case, but for the sake of completeness will post it here):

{
  "level": "error",
  "ts": "2022-01-18T09:18:14.258Z",
  "logger": "controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "my-gitops",
  "namespace": "flux-system",
  "error": "unable to clone: User"
}

ilya-git · 2022-01-19T09:13:58Z

I have now downgraded to 0.15.0 and it seems to work fine again! I don't know why it did not work before, perhaps bitbucket had issues at just that time.

image: ghcr.io/fluxcd/image-automation-controller:v0.15.0

So it seems very likely that one of the versions after 0.15.0 introduced something that causes this behavior.

ilya-git · 2022-02-03T12:13:54Z

I have updated to the latest flux v0.26.0 and image-automation-controller v0.20.0. The issue seems to fixed, but instead I am getting this error in the logs and as an alert in slack:

unable to clone: User

Everything seems to work though, just generating some spam

hiddeco · 2022-02-03T12:16:59Z

Any signs it has to e.g. reconcile twice before it succeeds?

ilya-git · 2022-02-03T12:23:59Z

How do I detect that? Same reconciliation notification twice in slack e.g.? I don't see anything like that, but just noticed this error does not happen for each of the reconciliations, some go through just fine

stefanprodan · 2022-02-03T12:29:28Z

Try increasing the timeout in your GitRepository, I guess that error hides a timeout, by default it's set to 20s and that may be too low for your setup.

ilya-git · 2022-02-03T12:50:47Z

Thanks, I will try it this way:

spec:
  interval: 1m
  timeout: 1m

Do you know if interval that is equal to timeout can cause any troubles or it looks fine this way?

stefanprodan · 2022-02-03T12:58:10Z

I would set the timeout to 50s, always less than the interval.

hiddeco · 2022-02-03T13:05:04Z

Having a timeout higher than the interval does theoretically not really matter because it is used to schedule the next reconciliation while completing the reconcile of the resource, which means that e.g. the following would just result in a reconciliation every 6 minutes in the worst case scenario:

spec:
  interval: 1m
  timeout: 5m

ilya-git · 2022-02-03T13:09:54Z

Thanks for help! I'll run it for some time with this set up and see what happens

ilya-git · 2022-02-04T11:11:29Z

Looks good so far!

stefanprodan · 2022-02-04T12:29:19Z

I think we need to change the default GitRepository timeout in source-controller, 20s it's way to low for libgit2.

mbrancato · 2022-02-05T03:04:09Z

We've been running Flux v0.24.1 / image-automation-controller v0.18.0 for a little while now and just started to see this as well.

I did notice this is intermittent, but it did put my ImageUpdateAutomation object into a non-ready state and never recovered:

NAME                             	READY	MESSAGE                                                          	LAST RUN                 	SUSPENDED 
imageupdateautomation/mygit		False	unable to clone: SSH could not read data: Error waiting on socket	2022-02-03T13:43:54-05:00	False

After killing the pod, it restarted and went back into a ready state after one initial failure. I had updated the interval and increased the timeout on the GitRepository but that didn't seem to help given the initial failure on startup. And after restarting, it pushed an image update to a git branch which I was expecting.

NAME                             	READY	MESSAGE                                                     	LAST RUN                 	SUSPENDED 
imageupdateautomation/mygit		True 	no updates made; last commit aa63386 at 2022-02-05T03:25:49Z	2022-02-04T22:25:52-05:00	False

ilya-git · 2022-02-05T08:50:29Z

You can try updating to 0.26.0, that helped a lot (at least in my case) + the timeout

gws · 2022-03-02T22:25:34Z

This issue had been resolved for us up through Flux v0.27.2, but after updating to Flux v0.27.3 / image-automation-controller v0.20.1, we're seeing this issue intermittently once again. Could be related to the fix for #316?

mkoertgen mentioned this issue Feb 5, 2022

"SSH could not read data: Error waiting on socket" when using libgit2 fluxcd/source-controller#439

Closed

stefanprodan mentioned this issue Feb 7, 2022

Increase default timeout to 60s fluxcd/source-controller#570

Merged

stefanprodan closed this as completed in fluxcd/source-controller#570 Feb 7, 2022

pjbgf mentioned this issue Mar 4, 2022

Remove direct dependency to go-git #324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293

"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293

ilya-git commented Jan 17, 2022 •

edited

Loading

hiddeco commented Jan 17, 2022

ilya-git commented Jan 17, 2022 •

edited

Loading

hiddeco commented Jan 17, 2022

ilya-git commented Jan 17, 2022

ilya-git commented Jan 18, 2022 •

edited

Loading

ilya-git commented Jan 19, 2022

ilya-git commented Feb 3, 2022

hiddeco commented Feb 3, 2022

ilya-git commented Feb 3, 2022 •

edited

Loading

stefanprodan commented Feb 3, 2022

ilya-git commented Feb 3, 2022

stefanprodan commented Feb 3, 2022

hiddeco commented Feb 3, 2022 •

edited

Loading

ilya-git commented Feb 3, 2022

ilya-git commented Feb 4, 2022

stefanprodan commented Feb 4, 2022

mbrancato commented Feb 5, 2022 •

edited

Loading

ilya-git commented Feb 5, 2022

gws commented Mar 2, 2022 •

edited

Loading

"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293

"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293

Comments

ilya-git commented Jan 17, 2022 • edited Loading

hiddeco commented Jan 17, 2022

ilya-git commented Jan 17, 2022 • edited Loading

hiddeco commented Jan 17, 2022

ilya-git commented Jan 17, 2022

ilya-git commented Jan 18, 2022 • edited Loading

ilya-git commented Jan 19, 2022

ilya-git commented Feb 3, 2022

hiddeco commented Feb 3, 2022

ilya-git commented Feb 3, 2022 • edited Loading

stefanprodan commented Feb 3, 2022

ilya-git commented Feb 3, 2022

stefanprodan commented Feb 3, 2022

hiddeco commented Feb 3, 2022 • edited Loading

ilya-git commented Feb 3, 2022

ilya-git commented Feb 4, 2022

stefanprodan commented Feb 4, 2022

mbrancato commented Feb 5, 2022 • edited Loading

ilya-git commented Feb 5, 2022

gws commented Mar 2, 2022 • edited Loading

ilya-git commented Jan 17, 2022 •

edited

Loading

ilya-git commented Jan 17, 2022 •

edited

Loading

ilya-git commented Jan 18, 2022 •

edited

Loading

ilya-git commented Feb 3, 2022 •

edited

Loading

hiddeco commented Feb 3, 2022 •

edited

Loading

mbrancato commented Feb 5, 2022 •

edited

Loading

gws commented Mar 2, 2022 •

edited

Loading