Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"SSH could not read data: Error waiting on socket" when using libgit2 #439

Closed
Tracked by #2593
squaremo opened this issue Jul 13, 2021 · 108 comments · Fixed by #570
Closed
Tracked by #2593

"SSH could not read data: Error waiting on socket" when using libgit2 #439

squaremo opened this issue Jul 13, 2021 · 108 comments · Fixed by #570
Assignees
Labels
area/git Git related issues and pull requests
Milestone

Comments

@squaremo
Copy link
Member

squaremo commented Jul 13, 2021

https://cloud-native.slack.com/archives/CLAJ40HV3/p1625133279255100

This is reported in the logs:

{
  "level": "error",
  "ts": "2021-07-01T07:44:10.656Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "redacted",
  "namespace": "flux-system",
  "error": "unable to clone 'ssh://git@github.com/redacted/redacted.git', error: SSH could not read data: Error waiting on socket"
}

.. though apparently not all the time, as

After adding [update markers], Image Automation controller started to update files for me.

Source controller reportedly manages to clone the repo (all the time?) when set to use libgit2, and changing to an RSA key didn't stop the error messages. EDIT: no, not all the time -- source-controller also fails intermittently, indicating that the problem is in the code in source-controller/pkg that source-controller and image-automation-controller both use.

@rjhenry
Copy link

rjhenry commented Jul 16, 2021

I've found a solution that worked for me, at least - there gitrepo definition had a URL ending in .git; removing this then worked.
Before: .spec.url: ssh://git@github.com/orgname/repo.git
After: .spec.url: ssh://git@github.com/orgname/repo

Why it worked, I'm not sure - but I happened to stumble across the difference between a working and a non-working cluster.

@morganchristiansson
Copy link

morganchristiansson commented Jul 29, 2021

I am getting this cloning from localhost git repo with openssh+gitolite.

It's not transient every time error waiting on socket.

Same GitRepository works perfectly from source-controller. In image-automation-controller this error:

{
	"level": "error",
	"ts": "2021-07-29T15:54:57.687Z",
	"logger": "controller-runtime.manager.controller.imageupdateautomation",
	"msg": "Reconciler error",
	"reconciler group": "image.toolkit.fluxcd.io",
	"reconciler kind": "ImageUpdateAutomation",
	"name": "flux-system",
	"namespace": "flux-system",
	"error": "unable to clone 'ssh://git@morgan-server.lan/k3s.git', error: SSH could not read data: Error waiting on socket"
}

@rjhenry
Copy link

rjhenry commented Aug 31, 2021

I've found a solution that worked for me, at least - there gitrepo definition had a URL ending in .git; removing this then worked.
Before: .spec.url: ssh://git@github.com/orgname/repo.git
After: .spec.url: ssh://git@github.com/orgname/repo

Why it worked, I'm not sure - but I happened to stumble across the difference between a working and a non-working cluster.

I think this is a significant red herring; a few instances now with the .git-less syntax still hangs. The repository is about 20M and the cluster has a gig pipe to the internet, so it shouldn't be a timeout when trying to clone.

Is it possible to get the image-automation-controller to die in the event of these errors so the scheduler can simply recreate it?

@rjhenry
Copy link

rjhenry commented Sep 3, 2021

For the record, this occurs with both libgit2 and go-git gitImplementationson the gitRepository resource; I've switched back to go-git as the source controller seemed much happier with that.

@morganchristiansson
Copy link

Are there any suggestions on how to debug this further? Any flags to enable debug logging? Or maybe tcpdump?

@ozlotusflare
Copy link

ozlotusflare commented Sep 7, 2021

Hey everyone!
After upgrade FluxCD2 to the latest version I faced with the same issue :(
I checked gitRepository CRD and by default is go-git gitImplementationson, I've tried to change to libgit2, but still have errors:

{
  "level": "error",
  "ts": "2021-09-07T11:35:38.559Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "image-update-automation-myrepo",
  "namespace": "test-ns",
  "error": "unable to clone ssh://git@github.com/repo/myrepo, error: SSH could not read data: Error waiting on socket"
}

Any thoughts , how to fix this ?
Thanks 🙏

@rjhenry
Copy link

rjhenry commented Sep 8, 2021

Are there any suggestions on how to debug this further? Any flags to enable debug logging? Or maybe tcpdump?

I was taking a look at this, and it appears that only way to enable debug logging is to edit the deployment of the image-automation-controller. It's a relatively simple change - in your gotk-components.yaml, make the following change:

@@ -5235,7 +5235,7 @@ spec:
       - args:
         - --events-addr=http://notification-controller/
         - --watch-all-namespaces=true
-        - --log-level=info
+        - --log-level=debug
         - --log-encoding=json
         - --enable-leader-election
         env:

I'm now running this in a couple of clusters, so hopefully will get more information that can be contributed to this issue.

@rjhenry
Copy link

rjhenry commented Sep 9, 2021

I've noted a failure to clone, within only a few minutes of starting one of the pods:

{"level":"info","ts":"2021-09-08T08:58:41.670Z","logger":"setup","msg":"starting manager"}
<..>
{"level":"debug","ts":"2021-09-08T09:02:02.075Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","working":"/tmp/flux-system-flux-system244015386"}
{"level":"debug","ts":"2021-09-08T09:02:02.097Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ImageUpdateAutomation","namespace":"flux-system","name":"flux-system","uid":"c9b53014-a6e7-4e65-87f6-7d254c417baa","apiVersion":"image.toolkit.fluxcd.io/v1beta1","resourceVersion":"52286285"},"reason":"info","message":"no updates made"}
{"level":"debug","ts":"2021-09-08T09:02:02.098Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"no changes made in working directory; no commit","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system"}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"fetching git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"}}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"using git repository ref from .spec.git.checkout","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","ref":{"branch":"main"}}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"using push branch from .spec.push.branch","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","branch":"main"}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"attempting to clone git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"},"ref":{"branch":"main"},"working":"/tmp/flux-system-flux-system796571057"}
{"level":"debug","ts":"2021-09-08T09:03:03.695Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ImageUpdateAutomation","namespace":"flux-system","name":"flux-system","uid":"c9b53014-a6e7-4e65-87f6-7d254c417baa","apiVersion":"image.toolkit.fluxcd.io/v1beta1","resourceVersion":"52286696"},"reason":"error","message":"unable to clone 'ssh://git@github.com/<GITHUB_GROUP>/<GITOPS_REPO>', error: SSH could not read data: Error waiting on socket"}
{"level":"error","ts":"2021-09-08T09:03:03.709Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/<GITHUB_GROUP>/<GITOPS_REPO>', error: SSH could not read data: Error waiting on socket","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:214"}

Interestingly, there were only two time windows during which this complaint was made:

$ grep 'unable to clone' flux-iac-broken-debug.log | jq '.ts'
"2021-09-08T09:03:03.695Z"
"2021-09-08T09:03:03.709Z"
"2021-09-08T10:46:06.474Z"
"2021-09-08T10:46:06.491Z"

@squaremo
Copy link
Member Author

squaremo commented Sep 9, 2021

For the record, this occurs with both libgit2 and go-git gitImplementationson the gitRepository

image-automation-controller always uses libgit2 for cloning, mainly because the go-git implementation does a shallow clone (and this makes branching difficult to impossible). Did source-controller have any problems when you switched the implementation to libgit2? I gather not -- that's what makes this truly mysterious, because it is running the same code.

@rjhenry
Copy link

rjhenry commented Sep 9, 2021

I thought there were issues with both the image automation controller and the source controller, but I've dug through our logging server and found records at the time the change was made. It looks like the problems were with the image-automation-controller, which was immediately failing to clone the repository just minutes after the change was merged. It appears to be the same issue as before, just immediate rather than after some delay.

@squaremo
Copy link
Member Author

squaremo commented Sep 9, 2021

which was immediately failing to clone the repository just minutes after the change was merged.

Which change was merged -- did I miss a vital clue?

@rjhenry
Copy link

rjhenry commented Sep 10, 2021

which was immediately failing to clone the repository just minutes after the change was merged.

Which change was merged -- did I miss a vital clue?

Sorry, that's my own terrible wording - the change was from go-git to libgit2 for the gitImplementation on my gitRepository objects.

@rjhenry
Copy link

rjhenry commented Sep 15, 2021

@squaremo If it's any use, I can provide debug-level logs from a controller pod - I'd prefer to do so directly via Slack, though, as I'm not entirely comfortable posting unredacted logs on a public issue.

@squaremo
Copy link
Member Author

If it's any use, I can provide debug-level logs from a controller pod - I'd prefer to do so directly via Slack, though, as I'm not entirely comfortable posting unredacted logs on a public issue.

Worth a try! I'm Michael Bridgen in CNCF slack.

@squaremo
Copy link
Member Author

squaremo commented Sep 16, 2021

OK, thanks to lots of digging through logs from @rjhenry, I think we have established that this happens for source-controller as well, when the GitRepository object has gitImplementation: libgit2[1]. It seems to be less frequent in the source-controller, but it's already intermittent with image-automation-controller[2] so difficult to judge.

[1] the same "SSH could not read data: Error waiting on socket" error message appears in the source-controller logs, inside a period when the GitRepository was known (in git history!) to have that setting
[2] the image-automation logs show a failure, then a success, without a config change in the meantime

@squaremo squaremo changed the title "SSH could not read data: Error waiting on socket" "SSH could not read data: Error waiting on socket" when using libgit2 Sep 16, 2021
@squaremo squaremo transferred this issue from fluxcd/image-automation-controller Sep 16, 2021
@morganchristiansson
Copy link

morganchristiansson commented Sep 22, 2021

It's started working for me.. debug logging was really helpful.

Ran into next issue -have to put resources in same namespace as GitRepository fluxcd/image-automation-controller#159

@squaremo
Copy link
Member Author

squaremo commented Sep 23, 2021

It's started working for me.. debug logging was really helpful.

What changed so that it started working? How was the debug logging helpful? Don't leave us hanging @morganchristiansson :-P

@morganchristiansson
Copy link

morganchristiansson commented Sep 23, 2021

I'm still sporadically getting the same error.

{"level":"error","ts":"2021-09-23T09:38:29.217Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-sy
stem","error":"unable to clone 'ssh://git@192.168.1.190/k3s.git', error: SSH could not read data: Error waiting on socket"}
{"level":"info","ts":"2021-09-23T09:38:32.986Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 3.763027654s, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","
name":"flux-system","namespace":"flux-system"}

Because of my namespace issue linked above the automations were correct but not applying anything - with log level debug I was figured this out, I shoud've been able to see this with default info log level too..

I also looked with wireshark/tcpdump but I was only seeing successful ssh connections...

So I haven't really solved the problem just discovered it's working with the error...

@rjhenry
Copy link

rjhenry commented Oct 4, 2021

Very similar situation here; on a different cluster I administer we had a successful automation. Taking a look at the logs shows (neatly formatted for your viewing pleasure):

{
  "level": "error",
  "ts": "2021-10-04T07:12:30.488Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "flux-system",
  "namespace": "flux-system",
  "error": "unable to clone 'ssh://git@github.com/<redacted>/<redacted>-gitops', error: SSH could not read data: Error waiting on socket"
}
{
  "level": "info",
  "ts": "2021-10-04T07:56:18.308Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "pushed commit to origin",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "flux-system",
  "namespace": "flux-system",
  "revision": "afc8f467fb59361c55a3f80c94e31cefe96e43d1",
  "branch": "main"
}
{
  "level": "error",
  "ts": "2021-10-04T07:56:26.055Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "flux-system",
  "namespace": "flux-system",
  "error": "unable to clone 'ssh://git@github.com/<redacted>/<redacted>-gitops', error: SSH could not read data: Error waiting on socket"
}

No changes were made to the controller, GitRepository, or really anything else in that time.

@hiddeco
Copy link
Member

hiddeco commented Oct 9, 2021

You all may want to try out the the latest release of the source-controller (v0.16.0) and image-automation-controller (v0.15.0), as this contains libgit2 linked against OpenSSL and LibSSH2. Based on my research and extensive testing, this should solve most issues around private key formats and/or SSH transports, with an exception for ECDSA* related host key issues (which will be solved once we depend on libgit2 >=1.2.0).

@timja
Copy link

timja commented Oct 13, 2021

@hiddeco

We're still seeing this with:

ghcr.io/fluxcd/helm-controller:v0.12.0
ghcr.io/fluxcd/image-automation-controller:v0.15.0
ghcr.io/fluxcd/image-reflector-controller:v0.12.0
ghcr.io/fluxcd/kustomize-controller:v0.15.5
ghcr.io/fluxcd/notification-controller:v0.17.1
ghcr.io/fluxcd/source-controller:v0.16.0

ssh destination is github.com

@hiddeco
Copy link
Member

hiddeco commented Oct 14, 2021

@timja we have a suspicion this is due to underlying C objects not getting freed in time; but given the undocumented state of libgit2/git2go, this is very much guess work.

I created #452, which most importantly frees the Repository object which is responsible for the lifetime of the connection, which should ensure any previous connections are properly cleaned up.

It would be great if you could give this a test run, as the issue is kind of hard to replicate for us. (If you prefer testing it using the image-automation-controller, please shout, as I can port the changes to such a test image as well).

@rjhenry
Copy link

rjhenry commented Oct 18, 2021

@hiddeco If you could port the same changes over to the image-automation-controller, that'd be great - I'd be keen to test this out for you.

@hiddeco
Copy link
Member

hiddeco commented Oct 18, 2021

@rjhenry thanks a lot! 🙇🥇 Did a quick port, and the image details can be found at fluxcd/image-automation-controller#238

@timja
Copy link

timja commented Oct 19, 2021

Same error:

{"level":"error","ts":"2021-10-19T09:29:41.416Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"imageautomation","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/hmcts/cnp-flux-config', error: SSH could not read data: Error waiting on socket"}

Running:

docker.io/hiddeco/image-automation-controller:libgit2-free-e9bdffc
docker.io/hiddeco/source-controller:libgit2-free-a62cfe8

Can we provide anything else to help?

@hiddeco
Copy link
Member

hiddeco commented Oct 19, 2021

After having cloned the repository myself, I now have the suspicion it's due to size. Even on my 1Gbps connection, it seems to be relatively slow (but I think the default timeout is 30s 🤔).

$ git clone ssh://git@github.com/hmcts/cnp-flux-config
Cloning into 'cnp-flux-config'...
remote: Enumerating objects: 675113, done.
remote: Counting objects: 100% (1374/1374), done.
remote: Compressing objects: 100% (501/501), done.
remote: Total 675113 (delta 991), reused 1220 (delta 861), pack-reused 673739
Receiving objects: 100% (675113/675113), 88.49 MiB | 8.00 MiB/s, done.
Resolving deltas: 100% (550836/550836), done.
~/Projects took 16s

@timja
Copy link

timja commented Oct 19, 2021

Was also wondering if that's an issue, on flux v1 we had to increase some timeouts, will try that out, thanks!

@uderik
Copy link

uderik commented Mar 30, 2022

Hi, no errors for almost 2 days

@uderik
Copy link

uderik commented Apr 3, 2022

Over a time image-automation-controller:0.21.3, can't connect to GitLab with error:
unable to clone 'ssh://git@gitlabhost/gitops/my-gitops': transport close (potentially due to a timeout)
after pod restart, it's return to normal work

@pjbgf
Copy link
Member

pjbgf commented Apr 4, 2022

@uderik did it stop working after the error? IAC should be able to self-heal from such issue and just carry on business as usual.

What's the interval you currently have set for the repository that had the issue?

@peterfication
Copy link
Contributor

I had the same issue and tried the new version with the EXPERIMENTAL_GIT_TRANSPORT set to true and now there are no errors for me anymore as well 💪

source-controller: 0.22.5
image-automation-controller: 0.21.3

@uderik
Copy link

uderik commented Apr 4, 2022

@pjbgf on two clusters (us-east-1 and ap-southeast-1) same issue, many errors with transport close (potentially due to a timeout) and no updates
on another cluster (eu-central-1) see pod restarts with last error:

[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x16f19c0]
runtime stack:
runtime.throw({0x1da3c46, 0xc0008dc8f8})
        runtime/panic.go:1198 +0x71
runtime.sigpanic()

maybe it has something to do with the location of the clusters , gitlab location eu-central-1

gitrepo interval 1min, timeout 2min

see attach for pod crash log
pod_crash.log

update:
log with first issue
image-controller.log

@mkoertgen
Copy link

mkoertgen commented Apr 4, 2022

Same for me as observed by @uderik

@uderik
Copy link

uderik commented May 11, 2022

problem still present

@pjbgf
Copy link
Member

pjbgf commented May 11, 2022

@uderik would you mind testing the versions below and confirm whether that fixes your issue?

- source-controller: 
quay.io/paulinhu/source-controller:v0.24.4-cacheless@sha256:61930cad1da900f209b396f20c2f7740ff32b5cf1bb4ab7892200790c00a5f4b

- image-automation-controller:
quay.io/paulinhu/image-automation-controller:v0.22.2-cacheless@sha256:87823667cfc4c6e395d996ceaee92a1b5059a8950884f2c3aa49488dcbed81f5

An user with a similar issue had this resolving. This is related to the in-flight PR: #713

Test images based on version 830771f.

@uderik
Copy link

uderik commented May 12, 2022

@pjbgf CrashLoopBackOff log in attach,It works but has already crashed 4 times (in 13 hours) with this errors, on all 3 envs, where i updated images for source/image-automation deployments
image-automation-controller.log

@pjbgf
Copy link
Member

pjbgf commented May 12, 2022

@uderik thanks again for sharing. I managed to reproduce and on my environment the changes seem to have fixed the problem. The new version have some changes we recently merged into main, which would recover git2go/libgit2 panics. Therefore if this would happen again you would see errors on the logs, but no crashes/restarts.

The PR is now updated and a new image created for source-controller:
ghcr.io/fluxcd/source-controller:rc-6d517589

Please let me know how you get on.

xref: #713 (comment)

UPDATE: changed the image with an official source-controller release candidate.

@Nosmoht
Copy link

Nosmoht commented May 17, 2022

Using

ghcr.io/fluxcd/source-controller:v0.24.4
ghcr.io/fluxcd/image-automation-controller:v0.22.1

on GKE results in the same error. Surprisingly it was working for some days without any change. Source is Gitlab.

@pjbgf pjbgf added this to the GA milestone May 17, 2022
@pjbgf
Copy link
Member

pjbgf commented May 17, 2022

@Nosmoht occurrences of this issue tend be intermittent and resolve itself. We have a few changes that should be released soon which should decrease their likelihood. On they are merged/released I will share on this thread.

@Nosmoht
Copy link

Nosmoht commented May 23, 2022

@pjbgf any idea when a new release will be published?

@pjbgf
Copy link
Member

pjbgf commented May 24, 2022

@Nosmoht We are aiming to have a release done between the end of this week and beginning of next.

@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

@uderik @Nosmoht here's the release candidate for source controller: ghcr.io/fluxcd/source-controller:rc-4b3e0f9a

Can you please give it a go and let me know whether it resolved your issues?

@pjbgf pjbgf removed the blocked/upstream Blocked by an upstream dependency or issue label May 27, 2022
@Nosmoht
Copy link

Nosmoht commented May 30, 2022

@pjbgf i have the error within the image-automation-controller. Does it help if i use another source-controller image?

@uderik
Copy link

uderik commented May 30, 2022

@pjbgf image-automation-controller (image-automation-controller:v0.22.2-cacheless and Enabling experimental managed transport")
source-controller tag rc-4b3e0f9a
still restarting with:

{"level":"info","ts":"2022-05-30T12:54:37.672Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcpanic: runtime error: invalid memory address or nil pointer dereference panic: invalid pointer handle: 0x7f97cb5a6c80 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14d6754] goroutine 529 [running]: github.com/libgit2/git2go/v33.(*HandleList).Get(0x8, 0x7f97cb5a6c80) github.com/libgit2/git2go/v33@v33.0.9/handles.go:64 +0x12e github.com/libgit2/git2go/v33.credentialsCallback(0x7f97cddb4c10, 0xc000d32010, 0x2200000000001b, 0x1d85cc0, 0x12, 0x13) github.com/libgit2/git2go/v33@v33.0.9/remote.go:373 +0x45 github.com/libgit2/git2go/v33._Cfunc_git_transport_smart_credentials(0xc000d32010, 0x7f97ccfa7130, 0x0, 0x40) _cgo_gotypes.go:8811 +0x4c

@serbaut
Copy link

serbaut commented Jun 1, 2022

quay.io/paulinhu/image-automation-controller:v0.22.2-cacheless seems to have resolved the issue for me.

@Nosmoht
Copy link

Nosmoht commented Jun 2, 2022

Just tried with quay.io/paulinhu/image-automation-controller:v0.22.2-cacheless but still get the same issue.

unable to clone '#####'
SSH could not read data: Error waiting on socket

@aryan9600
Copy link
Member

Hi @Nosmoht, we are in the process of getting some improvements merged, which should fix this. Could you try this image ghcr.io/fluxcd/image-automation-controller:rc-48bcca59 and confirm whether it solves the issue for you?

@Nosmoht
Copy link

Nosmoht commented Jun 2, 2022

Hi @aryan9600,

tried, SSH error replaced by following:

{"level":"error","ts":"2022-06-02T08:22:41.221Z","logger":"controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"app","namespace":"flux-system","error":"unable to fetch-connect to remote 'ssh://git@gitlab.com/app/app.git': ssh: handshake failed: hostkey could not be verified"}

Seems like i need Gitlab's SSH key inside the container. Any easy way to do that? I bootstrap Flux with Terraform using Flux provider.

@aryan9600
Copy link
Member

I'm not familiar with the terraform provider, but the GitRepository source needs to have a .spec.secretRef, which provides a reference to the secret containing the private key and the password. You can create one using the cli: https://fluxcd.io/docs/cmd/flux_create_secret_git/

@Nosmoht
Copy link

Nosmoht commented Jun 2, 2022

Hi @aryan9600,

as it is the flux-system Gitrepository it already uses the flux-system secret which contains identity, identity.pub and known_hosts already.

@jakubhajek
Copy link

hello Flux Maintainers!,

Thanks a lot for your great work!

I wanted to let you know that this issue can be closed and it is solved with the latest Flux 0.31.0 release. I have already tested the latest release and seems that I no longer experience the issue in all of my environments.

Repository owner moved this from In Progress to Done in Maintainers' Focus Jun 7, 2022
@uderik
Copy link

uderik commented Jun 9, 2022

Look's like it's really fixed now, many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/git Git related issues and pull requests
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.