-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller stops reconciling, needs restart #282
Comments
This coincided with outages at AWS in us-west-2 where I'm running Kubernetes on EKS. My entire cluster went down on that day for about 10 minutes and when it came back up I ran into this issue. |
Hi, Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller. |
@bondido how about something like this: kubectl get pod -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-96dc99bfd-p9g4q 1/1 Running 0 9d
image-automation-controller-64c955c59-ckbft 1/1 Running 0 9d <<< ------
image-reflector-controller-55fb7f765d-cr8zn 1/1 Running 0 9d
kustomize-controller-7bc878f768-xf2xb 1/1 Running 0 9d
notification-controller-945795558-q8nht 1/1 Running 0 9d
source-controller-65665fd68f-n7qqz 1/1 Running 0 9d |
@jwerre |
Ah yes.. I had the same problem you need to restart the pod e.g.:
|
Sure @jwerre , I know. Thanks :-) As temporary kind of "automatic mitigation" we introduced a scheduled external script checking .status.lastAutomationRunTime of imageupdateautomation resource ( image-automation-controller/config/crd/bases/image.toolkit.fluxcd.io_imageupdateautomations.yaml Line 170 in 041018f
We'd love to see this bug fixed soon, anyway :-) |
@bondido, I misunderstood your question. Sorry for the confusion. |
@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller? |
We have to restart source-controller for image-automation-controller to start working. In fact we restart both - first image-automation-controller, and if we don't see any improvement in a couple of minutes time - source-controller. So far, we haven't try to restart just source-controller. |
@bondido Thanks for elaborating! On the face of it, I wouldn't expect restarting source-controller to have any effect on image-automation-controller, because it works independently: it only coincidentally refers to the same GitRepository objects, and doesn't alter anything at the upstream git repository (that might "unlock" image-automation-controller). Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"? |
I can't be 100% sure as I couldn't get to any logs or metric confirming what was actually happening. At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart. The situation repeated exactly like the above for one more time. |
|
I should mention that I haven't had any problems since I restated the controller pod the first time. |
Hello,
In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation. And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation. |
I've gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn't be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election) With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn't able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.) If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it. |
After over 16 days since last problems, controller has just "stuck" on one of our clusters.
|
An other example, on a cluster with 16 differents ressources ImageUpdateAutomation, I have 3 of them "stuck".
Can this image from #297 resolve this issue ? |
Hello,
|
With the release of Flux |
@hiddeco Installed yesterday the new flux and today pushed an image, it seems to also work for me now. |
I managed to reproduce this locally. I am running against the latest IAC version Just to help further investigations I will relay some of my observations/assumptions here as I progress. By analysing the pprof endpoint, I noticed that the time difference in minutes between last IAC reconciliation log message, seems to match the running time of the thread/goroutine below. Leading me to think that libgit2
The container is still operational (I can exec into it) and other goroutines seem to be working as expected. For an automatic restart, users could leverage the
|
The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios. The experimental transport needs to be opted-in by setting the environment variable Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists. |
Hi @pjbgf, thank you for the update, I deployed Sample metrics: workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 3302.405028565
workqueue_queue_duration_seconds_bucket{le="+Inf", name="imageupdateautomation"} 3
$ kubectl --namespace=flux-system exec -ti image-automation-controller-7995f48c77-g99qd -- \
printenv EXPERIMENTAL_GIT_TRANSPORT
true Version: $ kubectl --namespace=flux-system get pod image-automation-controller-7995f48c77-g99qd \
--output=jsonpath='{.spec.containers[?(@.name=="manager")].image}'
ghcr.io/fluxcd/image-automation-controller:v0.21.0 Nothing specific in the logs. Our interval is |
@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please? |
Here is the output of |
@maxbrunet thank you again for testing and providing the details so promptly. Here's more information on how to test: fluxcd/source-controller#636 (comment) |
Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller image-automation-controller - panic trace
source-controller - panic trace
GitRepository + SecretapiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: my-repo
namespace: flux-system
spec:
gitImplementation: libgit2
interval: 1m0s
ref:
branch: master
url: ssh://git@bitbucket.org/my-workspace/my-repo.git
secretRef:
name: flux-git-credentials apiVersion: v1
kind: Secret
metadata:
name: flux-git-credentials
namespace: flux-system
stringData:
identity: |
-----BEGIN OPENSSH PRIVATE KEY-----
...
-----END OPENSSH PRIVATE KEY-----
known_hosts: bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDX
CaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPW
c/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==
type: Opaque FYI I had not realized I needed to change |
@maxbrunet we have made some improvements that may fix the issue you are experiencing. I have a release candidate for source-controller (below) that is based on a pending PR. Can you please test the image below and let us know whether that fixes your problem? |
Hey @pjbgf, no, sorry, I used Flux with my previous employer, and I am not working with it at the moment |
@maxbrunet no worries, thank you for all the help so far. |
This should be fixed as part of the managed transport improvements and the enforcement of context timeouts.
|
Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed. Now with Managed Transport enforcing timeouts for Git operations, this should be resolved. If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one. |
@maxbrunet how do you obtain these panic traces, I've portforwared pprof endpoints and cat get dubug-info at request. But how to get information when the process panics? |
Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with |
Ah, ok, thank you. |
Reported here: fluxcd/flux2#2219
Having an automation that should reconcile every 7 minutes:
The reconciliation stoped two days ago for unknown reasons:
The text was updated successfully, but these errors were encountered: