-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArgoCD stuck on "Refresh" #4044
Comments
Could you provide logs of the controller when it's in this state? |
We've restarted the argocd-application-controller pod and now it seems that the problem has fixed. Applications can be created and are no longer stuck on refresh. Please let me know if you want to investigate this issue any further and maybe I can offer additional info! |
yep. this is also happened to me. fixed it multiple times by restarting the argocd-application-controller. |
We see the same behavior with 1.5.4 and 1.6.2. The application keeps "refreshing" indefinitely (every second) and we don't see the replica sets and pods of the deployments although they are present on the machine. If we delete the application, the replica sets and pods appear until everything is deleted. Also, the deployments do not show any info and we get the "Unable to load data: cache: key is missing" when we click on it. We tried to delete the application and resync. The issue reappears after a while and is sporadic. Not all apps are affected. We are running ArgoCD 1.5.4 and 1.6.2 on Openshift 3.11 / Kubernetes 1.11 clusters. Restarting, redis, argocd-server or controller doesn't help. There are no errors in any of the logs. |
I was able to get into this state during an upgrade from 1.5.1 to 1.6.2, specifically updating the argocd-server
argocd-application-controller
error from UI
|
We figured out what was causing the issue in our case. We use a secret generator operator for generating randomized secrets. In our kustomizations, there are some secrets with an empty data key (like this |
This happened to me, too. I'm setting up a clean k8s cluster (1.17.9) on AKS and a clean ArgoCD v1.6.2+3d1f37b. The application is cert-manager, described by a simple kustomization file: resources:
- https://github.com/jetstack/cert-manager/releases/download/v0.16.1/cert-manager.yaml The application is setup with ignoreDifferences:
- group: apiextensions.k8s.io
jsonPointers:
- /status
kind: CustomResourceDefinition After the initial sync, the application seems synced with no problem. ArogCD also reports last sync operation successful as well. However, the health remains missing and sync status remains out of sync. If I click refresh, it will be stuck on refresh status. |
We looked into this a bit more. We see the same behavior on Kubernetes v1.15, v1.16, and v1.17 using ArgoCD v1.5 and v1.6. |
Tried upgrading again this time to 1.7.1 from 1.5.1, and experience the same behavior. I am upgrading individual components at a time, and everything works until I sync the |
1.7.2 appears to have fixed this refresh loop for me, but not without some hand-holding. I'm going to keep a close eye on it. This is what worked for me:
Its not clear to me which of these steps actually solved the problem. |
I think we're seeing something similar on 1.7.6. Refreshing is taking longer than it used to and App Diffs (either the diff for the entire app or individual resources) are not returning any results. When trying to view diffs the UI throws this error:
And in argocd-server logs I see this message:
Edit - Restarting both the argocd-application-controller and argocd-server pods had no effect. Restarting the Redis pod is what ultimately fixed it for me. |
Hello there, We are facing the same issue: as stated by @stefanhenseler the issue (in our case) was also an empty secret file (without actual data). Deleting it "solved" the issue. Restarting the pods (controller, redis, server) didn't helped. |
Just another bump on this thread. I too have been experimenting with ArgoCD, and ran into this issue. Running v1.8.0+fdb5ada , which is pretty fresh.
Restarting Redis had no effect. For me, |
For me it was application controller running :latest while the rest running a fixed version. Removing "image: argoproj/argocd:latest" & "imagePullPolicy: Always" from application controller cm solved the issue. |
Still encounter this issue in v2.4.8
|
I am also facing this issue with v2.4.12. Restarting the application controller statefulset seems to resolve the issue and it could also be recreated consistently when the invalidation of cluster cache sequence is triggered on application controller replicas. One way to trigger the sequence is to restart/delete one of the two deployed argocd-server pod(we are running with 2 argocd-server instances), and that would trigger the cluster cache invalidation and reinitialization in all the application controller instances(we are running with 3 instances) and one of the replica out of three would show the hang/refresh stuck issue. The applications handled by this problematic replica would stuck at refresh indefinitely with minimal logging and drop in CPU usage(almost flat to zero) and constant memory there after. I am not sure about the root cause for this issue so I opened a another one #10842 with all the details and logs. |
The root cause for my case turns out to be an extra-large list of CRDs caused by a bug in cert-manager. Cluster cache initialization/refresh was blocked on listing all the resources. |
For me, it seems that the recent commit introduced RLock() in the controller/cache/cache.go which is leading to a deadlock scenario. argo-cd/controller/cache/cache.go Line 375 in b0dab38
Without realease the above write lock trying to acquire the RLock here argo-cd/controller/cache/cache.go Line 404 in b0dab38
Since the above change is recent and this issue might a different(opened before the above change) so I have updated the detailed analysis with the goroutine stack under the #10842 with the comment |
We are on Please just bear in mind that not everyone would have access to the underlying infrastructure to restart the pod so this could be a big issue for some and a smaller one for others. I'll add this same comment on #10842 for completeness. |
had same issue when upgrading from 2.6.0 to 2.6.1. had to restart argocd-application-controller |
Fixed by #13636 |
Anyone facing this issue with 2.7.9 or v2.8.0-rc5 pre-release? |
@alexandresavicki could you open a new issue with full details? |
Just happened to face the same issue as well, related to upgrading from v2.7.9 to v2.8.0. One of the Helm charts had an invalid reference, self-managed ArgoCD controller, got stuck into a loop where even when all pods are recreated, it seems to be stuck on the same state. time="2023-09-03T17:54:20Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get namespace top-level resources: error synchronizing cache state : failed to sync cluster https://10.43.0.1:443: failed to load initial state of resource Redis.redis.redis.opstreelabs.in: conversion webhook for redis.redis.opstreelabs.in/v1beta1, Kind=Redis failed Deleting the invalid CRD and manually creating both the operator resources and CRD itself do not work. If anybody has a debugging suggestion other than reinstating an ETCD backup, would be very helpful. |
hi! we are using ArgoCD v1.5.4+36bade7 to orchestrate our applications in our EKS k8s cluster, we've tried editing the argocd-server ConfigMap to make it ignore differences on deployments replicas, we've using documentation (https://argoproj.github.io/argo-cd/user-guide/diffing/). We changed the ConfigMap using:
after I redeployed the deployment using
kubectl rollout restart deployment argocd-server -n argocd
it worked, but it wasn't what we were expecting, and we removed the resource.customizations from the configmap and redeployed the argocd-server again. Now, after redeployment, every application is stuck on refresh, and I cannot see pods (in photo) but they are present in the cluster, as I can see them with kubectl get pods command... is this something that has to do with restarting the deployment? I can see the github repositories where we keep the charts but maybe we lost the connection with the cluster?The text was updated successfully, but these errors were encountered: