Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD stuck on "Refresh" #4044

Closed
razvanonet opened this issue Aug 4, 2020 · 25 comments
Closed

ArgoCD stuck on "Refresh" #4044

razvanonet opened this issue Aug 4, 2020 · 25 comments
Labels
bug/severity:major Malfunction in one of the core component, impacting a majority of users bug Something isn't working

Comments

@razvanonet
Copy link

hi! we are using ArgoCD v1.5.4+36bade7 to orchestrate our applications in our EKS k8s cluster, we've tried editing the argocd-server ConfigMap to make it ignore differences on deployments replicas, we've using documentation (https://argoproj.github.io/argo-cd/user-guide/diffing/). We changed the ConfigMap using:

data:
  resource.customizations: |
    apps/Deployment:
      ignoreDifferences: |
        jsonPointers:
        - /spec/replicas

after I redeployed the deployment using kubectl rollout restart deployment argocd-server -n argocd it worked, but it wasn't what we were expecting, and we removed the resource.customizations from the configmap and redeployed the argocd-server again. Now, after redeployment, every application is stuck on refresh, and I cannot see pods (in photo) but they are present in the cluster, as I can see them with kubectl get pods command... is this something that has to do with restarting the deployment? I can see the github repositories where we keep the charts but maybe we lost the connection with the cluster?
Screenshot 2020-08-04 at 13 33 58

@jessesuen jessesuen added bug/severity:major Malfunction in one of the core component, impacting a majority of users more-information-needed Further information is requested bug Something isn't working labels Aug 4, 2020
@jessesuen
Copy link
Member

Could you provide logs of the controller when it's in this state?

@razvanonet
Copy link
Author

here are the logs from argocd-application-controller:

W0804 18:34:02.314991       1 reflector.go:299] github.com/argoproj/argo-cd/controller/appcontroller.go:414: watch of *v1alpha1.AppProject ended with: too old resource version: 22031003 (22078637)
time="2020-08-04T18:41:13Z" level=info msg="Alloc=14421 TotalAlloc=1125656414 Sys=764540 NumGC=111906 Goroutines=279"
time="2020-08-04T18:51:13Z" level=info msg="Alloc=15906 TotalAlloc=1125676576 Sys=764540 NumGC=111911 Goroutines=279"
time="2020-08-04T19:01:13Z" level=info msg="Alloc=21287 TotalAlloc=1125695928 Sys=764540 NumGC=111916 Goroutines=279"
W0804 19:09:36.307976       1 reflector.go:299] github.com/argoproj/argo-cd/util/settings/settings.go:645: watch of *v1.Secret ended with: too old resource version: 22021300 (22085836)
time="2020-08-04T19:11:13Z" level=info msg="Alloc=14540 TotalAlloc=1125709011 Sys=764540 NumGC=111921 Goroutines=279"
time="2020-08-04T19:21:13Z" level=info msg="Alloc=14424 TotalAlloc=1125723510 Sys=764540 NumGC=111926 Goroutines=279"
time="2020-08-04T19:31:13Z" level=info msg="Alloc=14411 TotalAlloc=1125741860 Sys=764540 NumGC=111931 Goroutines=279"
time="2020-08-04T19:41:13Z" level=info msg="Alloc=17281 TotalAlloc=1125764512 Sys=764540 NumGC=111936 Goroutines=279"
time="2020-08-04T19:51:13Z" level=info msg="Alloc=14419 TotalAlloc=1125780779 Sys=764540 NumGC=111941 Goroutines=279"
time="2020-08-04T20:01:13Z" level=info msg="Alloc=14325 TotalAlloc=1125800102 Sys=764540 NumGC=111946 Goroutines=279"
time="2020-08-04T20:11:13Z" level=info msg="Alloc=14327 TotalAlloc=1125814483 Sys=764540 NumGC=111951 Goroutines=279"
time="2020-08-04T20:21:13Z" level=info msg="Alloc=17796 TotalAlloc=1125830563 Sys=764540 NumGC=111956 Goroutines=279"
time="2020-08-04T20:31:13Z" level=info msg="Alloc=14277 TotalAlloc=1125843125 Sys=764540 NumGC=111961 Goroutines=279"
time="2020-08-04T20:41:13Z" level=info msg="Alloc=14365 TotalAlloc=1125856453 Sys=764540 NumGC=111966 Goroutines=279"
time="2020-08-04T20:51:13Z" level=info msg="Alloc=14274 TotalAlloc=1125871398 Sys=764540 NumGC=111971 Goroutines=279"

and here are the logs from argocd-server:

time="2020-08-04T20:54:53Z" level=info msg="Requested app 'grafana' refresh"
time="2020-08-04T20:55:01Z" level=info msg="client watch grpc context closed"
time="2020-08-04T20:55:01Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=Watch grpc.service=application.ApplicationService grpc.start_time="2020-08-04T20:54:01Z" grpc.time_ms=60000.727 span.kind=server system=grpc
time="2020-08-04T20:55:01Z" level=info msg="k8s application watch event channel closed"

also, I think I forgot to say when I try create an app it says its state is unknown (in picture):
Screenshot 2020-08-04 at 23 57 06

@no-response no-response bot removed the more-information-needed Further information is requested label Aug 4, 2020
@razvanonet
Copy link
Author

We've restarted the argocd-application-controller pod and now it seems that the problem has fixed. Applications can be created and are no longer stuck on refresh. Please let me know if you want to investigate this issue any further and maybe I can offer additional info!

@imrenagi
Copy link
Contributor

imrenagi commented Aug 6, 2020

yep. this is also happened to me. fixed it multiple times by restarting the argocd-application-controller.

@stefanhenseler
Copy link

stefanhenseler commented Aug 17, 2020

We see the same behavior with 1.5.4 and 1.6.2. The application keeps "refreshing" indefinitely (every second) and we don't see the replica sets and pods of the deployments although they are present on the machine. If we delete the application, the replica sets and pods appear until everything is deleted. Also, the deployments do not show any info and we get the "Unable to load data: cache: key is missing" when we click on it. We tried to delete the application and resync. The issue reappears after a while and is sporadic. Not all apps are affected. We are running ArgoCD 1.5.4 and 1.6.2 on Openshift 3.11 / Kubernetes 1.11 clusters. Restarting, redis, argocd-server or controller doesn't help. There are no errors in any of the logs.

@igaskin
Copy link
Member

igaskin commented Aug 20, 2020

I was able to get into this state during an upgrade from 1.5.1 to 1.6.2, specifically updating the argocd-server deployment is what triggered this bad state. I believe the load was enough to knock over the application controller.

argocd-server

argocd-server-58f6bb7cf-hb8qp argocd-server time="2020-08-20T00:22:09Z" level=error msg="finished unary call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2020-08-20T00:22:02Z" grpc.time_ms=6992.517 span.kind=server system=grpc

argocd-application-controller

  Warning  Unhealthy  16m (x2 over 20m)  kubelet, sjc04p1kubhv45  Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  16m (x4 over 25m)  kubelet, sjc04p1kubhv45  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  16m (x5 over 25m)  kubelet, sjc04p1kubhv45  Liveness probe failed: Get http://10.233.84.229:8082/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  13m (x3 over 25m)  kubelet, sjc04p1kubhv45  Readiness probe failed: Get http://10.233.84.229:8082/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

error from UI

"https://10.88.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argocd/applications/foobar-baz": dial tcp 10.88.0.1:443: connect: connection refused

@stefanhenseler
Copy link

stefanhenseler commented Aug 20, 2020

We figured out what was causing the issue in our case. We use a secret generator operator for generating randomized secrets. In our kustomizations, there are some secrets with an empty data key (like this data: {}). In this case, Openshift 3.11 (K8s v1.11) returns a null value when the resource is applied, which seems to cause the behavior. We were able to clearly repo and narrow the issue down to this. This only seems to be an issue in older versions of K8S. As a workaround, we just had to add a dummy key in our empty secret. This isn't an issue because we ignore differences for the data property anyways (due to the secret generator we use).

@pinkplus
Copy link

This happened to me, too.

I'm setting up a clean k8s cluster (1.17.9) on AKS and a clean ArgoCD v1.6.2+3d1f37b.

The application is cert-manager, described by a simple kustomization file:

resources:
  - https://github.com/jetstack/cert-manager/releases/download/v0.16.1/cert-manager.yaml

The application is setup with

  ignoreDifferences:
    - group: apiextensions.k8s.io
      jsonPointers:
        - /status
      kind: CustomResourceDefinition

After the initial sync, the application seems synced with no problem. ArogCD also reports last sync operation successful as well. However, the health remains missing and sync status remains out of sync. If I click refresh, it will be stuck on refresh status.

@stefanhenseler
Copy link

stefanhenseler commented Aug 25, 2020

We figured out what was causing the issue in our case. We use a secret generator operator for generating randomized secrets. In our kustomizations, there are some secrets with an empty data key (like this data: {}). In this case, Openshift 3.11 (K8s v1.11) returns a null value when the resource is applied, which seems to cause the behavior. We were able to clearly repo and narrow the issue down to this. This only seems to be an issue in older versions of K8S. As a workaround, we just had to add a dummy key in our empty secret. This isn't an issue because we ignore differences for the data property anyways (due to the secret generator we use).

We looked into this a bit more. We see the same behavior on Kubernetes v1.15, v1.16, and v1.17 using ArgoCD v1.5 and v1.6.

@igaskin
Copy link
Member

igaskin commented Aug 27, 2020

Tried upgrading again this time to 1.7.1 from 1.5.1, and experience the same behavior. I am upgrading individual components at a time, and everything works until I sync the arogcd-server deployment, at which point all applications get stuck in a refresh loop. In the UI this is surfaced as a Failed to load resource: the server responded wiht a status of 500 () error, specifically on for /api/v1/applications/<app-name>/resource-tree endpoint. Attached are some screenshots, including some graphs which show an increase in goroutines. Restarting the application-controller does not resolve the refresh loop.

Screen Shot 2020-08-26 at 6 51 56 PM
Screen Shot 2020-08-26 at 6 50 31 PM
Screen Shot 2020-08-26 at 6 50 23 PM

@igaskin
Copy link
Member

igaskin commented Sep 1, 2020

1.7.2 appears to have fixed this refresh loop for me, but not without some hand-holding. I'm going to keep a close eye on it. This is what worked for me:

  • upgrade all components to 1.7.2 (saving argocd-server for last)
  • the system is now stuck in a "refresh" state
  • delete the argocd-application-controller
  • encountered Unable to load data: key is missing error
    • edit argocd-server deployment to use non-ha redis; --redis argocd-redis:6379
  • delete the argocd-application-controller
  • the system is no longer in a "refresh" state

Its not clear to me which of these steps actually solved the problem.

@caseyclarkjamf
Copy link

caseyclarkjamf commented Oct 21, 2020

I think we're seeing something similar on 1.7.6. Refreshing is taking longer than it used to and App Diffs (either the diff for the entire app or individual resources) are not returning any results. When trying to view diffs the UI throws this error:

Unable to load data: cache: key is missing

And in argocd-server logs I see this message:

argocd-server-75b6f956c-2wljl server time="2020-10-21T16:35:46Z" level=error msg="finished unary call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=ManagedResources grpc.service=application.ApplicationService grpc.start_time="2020-10-21T16:35:02Z" grpc.time_ms=43928.277 span.kind=server system=grpc

Edit - Restarting both the argocd-application-controller and argocd-server pods had no effect. Restarting the Redis pod is what ultimately fixed it for me.

@daufinsyd
Copy link

Hello there,

We are facing the same issue: as stated by @stefanhenseler the issue (in our case) was also an empty secret file (without actual data). Deleting it "solved" the issue.

Restarting the pods (controller, redis, server) didn't helped.
We are running on OpenShift 4.4.30 with argocd v1.6.2+3d1f37b .

@jshin47
Copy link

jshin47 commented Dec 9, 2020

Just another bump on this thread. I too have been experimenting with ArgoCD, and ran into this issue.

Running v1.8.0+fdb5ada , which is pretty fresh.

ed unary call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=ManagedResources grpc.service=application.ApplicationService grpc.start_time="2020-12-09T01:52:50Z" grpc.time_ms=238.234 span.kind=server system=grpc

Restarting Redis had no effect.

For me, kubectl -n argocd delete pod argocd-application-controller-0 "fixed" the issue, but before I stake the future of my ops to ArgoCD I would love to understand why or how this happens.

@644755
Copy link

644755 commented Apr 13, 2021

For me it was application controller running :latest while the rest running a fixed version. Removing "image: argoproj/argocd:latest" & "imagePullPolicy: Always" from application controller cm solved the issue.

@yujunz
Copy link
Contributor

yujunz commented Aug 11, 2022

Still encounter this issue in v2.4.8

time="2022-08-11T03:22:42Z" level=error msg="finished unary call with code Unknown" error="error getting cached app state: error getting application by query: application refresh de
adline exceeded" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2022-08-11T03:19:42Z" grpc.time_ms=180002.36 span.kind=serve
r system=grpc

@rahul-mourya
Copy link

rahul-mourya commented Oct 7, 2022

I am also facing this issue with v2.4.12. Restarting the application controller statefulset seems to resolve the issue and it could also be recreated consistently when the invalidation of cluster cache sequence is triggered on application controller replicas. One way to trigger the sequence is to restart/delete one of the two deployed argocd-server pod(we are running with 2 argocd-server instances), and that would trigger the cluster cache invalidation and reinitialization in all the application controller instances(we are running with 3 instances) and one of the replica out of three would show the hang/refresh stuck issue. The applications handled by this problematic replica would stuck at refresh indefinitely with minimal logging and drop in CPU usage(almost flat to zero) and constant memory there after.

I am not sure about the root cause for this issue so I opened a another one #10842 with all the details and logs.

@yujunz
Copy link
Contributor

yujunz commented Oct 8, 2022

Still encounter this issue in v2.4.8

The root cause for my case turns out to be an extra-large list of CRDs caused by a bug in cert-manager. Cluster cache initialization/refresh was blocked on listing all the resources.

@rahul-mourya
Copy link

For me, it seems that the recent commit introduced RLock() in the controller/cache/cache.go which is leading to a deadlock scenario.
Write lock here

c.lock.Lock()

Without realease the above write lock trying to acquire the RLock here
c.lock.RLock()

Since the above change is recent and this issue might a different(opened before the above change) so I have updated the detailed analysis with the goroutine stack under the #10842 with the comment

@decodingahmed
Copy link

decodingahmed commented Jan 9, 2023

We are on v2.5.2+148d8da and we have also experienced this. Restarting the argocd-application-controller pod seems to have done the trick.

Please just bear in mind that not everyone would have access to the underlying infrastructure to restart the pod so this could be a big issue for some and a smaller one for others.

I'll add this same comment on #10842 for completeness.

@cheskayang
Copy link

had same issue when upgrading from 2.6.0 to 2.6.1. had to restart argocd-application-controller

@alexmt
Copy link
Collaborator

alexmt commented Jul 25, 2023

Fixed by #13636

@alexmt alexmt closed this as completed Jul 25, 2023
@alexandresavicki
Copy link

Anyone facing this issue with 2.7.9 or v2.8.0-rc5 pre-release?
Im, only see this problem if we run with sidecar plugin, once we remove it the problem gone.

@crenshaw-dev
Copy link
Member

@alexandresavicki could you open a new issue with full details?

@gruberdev
Copy link

gruberdev commented Sep 3, 2023

Just happened to face the same issue as well, related to upgrading from v2.7.9 to v2.8.0.

One of the Helm charts had an invalid reference, self-managed ArgoCD controller, got stuck into a loop where even when all pods are recreated, it seems to be stuck on the same state.

time="2023-09-03T17:54:20Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get namespace top-level resources: error synchronizing cache state : failed to sync cluster https://10.43.0.1:443: failed to load initial state of resource Redis.redis.redis.opstreelabs.in: conversion webhook for redis.redis.opstreelabs.in/v1beta1, Kind=Redis failed

Deleting the invalid CRD and manually creating both the operator resources and CRD itself do not work. If anybody has a debugging suggestion other than reinstating an ETCD backup, would be very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/severity:major Malfunction in one of the core component, impacting a majority of users bug Something isn't working
Projects
None yet
Development

No branches or pull requests