ArgoCD stuck on "Refresh" #4044

razvanonet · 2020-08-04T13:16:51Z

hi! we are using ArgoCD v1.5.4+36bade7 to orchestrate our applications in our EKS k8s cluster, we've tried editing the argocd-server ConfigMap to make it ignore differences on deployments replicas, we've using documentation (https://argoproj.github.io/argo-cd/user-guide/diffing/). We changed the ConfigMap using:

data:
  resource.customizations: |
    apps/Deployment:
      ignoreDifferences: |
        jsonPointers:
        - /spec/replicas

after I redeployed the deployment using kubectl rollout restart deployment argocd-server -n argocd it worked, but it wasn't what we were expecting, and we removed the resource.customizations from the configmap and redeployed the argocd-server again. Now, after redeployment, every application is stuck on refresh, and I cannot see pods (in photo) but they are present in the cluster, as I can see them with kubectl get pods command... is this something that has to do with restarting the deployment? I can see the github repositories where we keep the charts but maybe we lost the connection with the cluster?

The text was updated successfully, but these errors were encountered:

jessesuen · 2020-08-04T20:45:53Z

Could you provide logs of the controller when it's in this state?

razvanonet · 2020-08-04T20:57:52Z

here are the logs from argocd-application-controller:

W0804 18:34:02.314991       1 reflector.go:299] github.com/argoproj/argo-cd/controller/appcontroller.go:414: watch of *v1alpha1.AppProject ended with: too old resource version: 22031003 (22078637)
time="2020-08-04T18:41:13Z" level=info msg="Alloc=14421 TotalAlloc=1125656414 Sys=764540 NumGC=111906 Goroutines=279"
time="2020-08-04T18:51:13Z" level=info msg="Alloc=15906 TotalAlloc=1125676576 Sys=764540 NumGC=111911 Goroutines=279"
time="2020-08-04T19:01:13Z" level=info msg="Alloc=21287 TotalAlloc=1125695928 Sys=764540 NumGC=111916 Goroutines=279"
W0804 19:09:36.307976       1 reflector.go:299] github.com/argoproj/argo-cd/util/settings/settings.go:645: watch of *v1.Secret ended with: too old resource version: 22021300 (22085836)
time="2020-08-04T19:11:13Z" level=info msg="Alloc=14540 TotalAlloc=1125709011 Sys=764540 NumGC=111921 Goroutines=279"
time="2020-08-04T19:21:13Z" level=info msg="Alloc=14424 TotalAlloc=1125723510 Sys=764540 NumGC=111926 Goroutines=279"
time="2020-08-04T19:31:13Z" level=info msg="Alloc=14411 TotalAlloc=1125741860 Sys=764540 NumGC=111931 Goroutines=279"
time="2020-08-04T19:41:13Z" level=info msg="Alloc=17281 TotalAlloc=1125764512 Sys=764540 NumGC=111936 Goroutines=279"
time="2020-08-04T19:51:13Z" level=info msg="Alloc=14419 TotalAlloc=1125780779 Sys=764540 NumGC=111941 Goroutines=279"
time="2020-08-04T20:01:13Z" level=info msg="Alloc=14325 TotalAlloc=1125800102 Sys=764540 NumGC=111946 Goroutines=279"
time="2020-08-04T20:11:13Z" level=info msg="Alloc=14327 TotalAlloc=1125814483 Sys=764540 NumGC=111951 Goroutines=279"
time="2020-08-04T20:21:13Z" level=info msg="Alloc=17796 TotalAlloc=1125830563 Sys=764540 NumGC=111956 Goroutines=279"
time="2020-08-04T20:31:13Z" level=info msg="Alloc=14277 TotalAlloc=1125843125 Sys=764540 NumGC=111961 Goroutines=279"
time="2020-08-04T20:41:13Z" level=info msg="Alloc=14365 TotalAlloc=1125856453 Sys=764540 NumGC=111966 Goroutines=279"
time="2020-08-04T20:51:13Z" level=info msg="Alloc=14274 TotalAlloc=1125871398 Sys=764540 NumGC=111971 Goroutines=279"

and here are the logs from argocd-server:

time="2020-08-04T20:54:53Z" level=info msg="Requested app 'grafana' refresh"
time="2020-08-04T20:55:01Z" level=info msg="client watch grpc context closed"
time="2020-08-04T20:55:01Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=Watch grpc.service=application.ApplicationService grpc.start_time="2020-08-04T20:54:01Z" grpc.time_ms=60000.727 span.kind=server system=grpc
time="2020-08-04T20:55:01Z" level=info msg="k8s application watch event channel closed"

also, I think I forgot to say when I try create an app it says its state is unknown (in picture):

razvanonet · 2020-08-05T10:22:50Z

We've restarted the argocd-application-controller pod and now it seems that the problem has fixed. Applications can be created and are no longer stuck on refresh. Please let me know if you want to investigate this issue any further and maybe I can offer additional info!

imrenagi · 2020-08-06T01:18:22Z

yep. this is also happened to me. fixed it multiple times by restarting the argocd-application-controller.

stefanhenseler · 2020-08-17T22:13:06Z

We see the same behavior with 1.5.4 and 1.6.2. The application keeps "refreshing" indefinitely (every second) and we don't see the replica sets and pods of the deployments although they are present on the machine. If we delete the application, the replica sets and pods appear until everything is deleted. Also, the deployments do not show any info and we get the "Unable to load data: cache: key is missing" when we click on it. We tried to delete the application and resync. The issue reappears after a while and is sporadic. Not all apps are affected. We are running ArgoCD 1.5.4 and 1.6.2 on Openshift 3.11 / Kubernetes 1.11 clusters. Restarting, redis, argocd-server or controller doesn't help. There are no errors in any of the logs.

igaskin · 2020-08-20T00:48:57Z

I was able to get into this state during an upgrade from 1.5.1 to 1.6.2, specifically updating the argocd-server deployment is what triggered this bad state. I believe the load was enough to knock over the application controller.

argocd-server

argocd-server-58f6bb7cf-hb8qp argocd-server time="2020-08-20T00:22:09Z" level=error msg="finished unary call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2020-08-20T00:22:02Z" grpc.time_ms=6992.517 span.kind=server system=grpc

argocd-application-controller

  Warning  Unhealthy  16m (x2 over 20m)  kubelet, sjc04p1kubhv45  Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  16m (x4 over 25m)  kubelet, sjc04p1kubhv45  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  16m (x5 over 25m)  kubelet, sjc04p1kubhv45  Liveness probe failed: Get http://10.233.84.229:8082/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  13m (x3 over 25m)  kubelet, sjc04p1kubhv45  Readiness probe failed: Get http://10.233.84.229:8082/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

error from UI

"https://10.88.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argocd/applications/foobar-baz": dial tcp 10.88.0.1:443: connect: connection refused

stefanhenseler · 2020-08-20T09:29:42Z

We figured out what was causing the issue in our case. We use a secret generator operator for generating randomized secrets. In our kustomizations, there are some secrets with an empty data key (like this data: {}). In this case, Openshift 3.11 (K8s v1.11) returns a null value when the resource is applied, which seems to cause the behavior. We were able to clearly repo and narrow the issue down to this. This only seems to be an issue in older versions of K8S. As a workaround, we just had to add a dummy key in our empty secret. This isn't an issue because we ignore differences for the data property anyways (due to the secret generator we use).

pinkplus · 2020-08-22T07:06:30Z

This happened to me, too.

I'm setting up a clean k8s cluster (1.17.9) on AKS and a clean ArgoCD v1.6.2+3d1f37b.

The application is cert-manager, described by a simple kustomization file:

resources:
  - https://github.com/jetstack/cert-manager/releases/download/v0.16.1/cert-manager.yaml

The application is setup with

  ignoreDifferences:
    - group: apiextensions.k8s.io
      jsonPointers:
        - /status
      kind: CustomResourceDefinition

After the initial sync, the application seems synced with no problem. ArogCD also reports last sync operation successful as well. However, the health remains missing and sync status remains out of sync. If I click refresh, it will be stuck on refresh status.

stefanhenseler · 2020-08-25T19:42:43Z

We figured out what was causing the issue in our case. We use a secret generator operator for generating randomized secrets. In our kustomizations, there are some secrets with an empty data key (like this data: {}). In this case, Openshift 3.11 (K8s v1.11) returns a null value when the resource is applied, which seems to cause the behavior. We were able to clearly repo and narrow the issue down to this. This only seems to be an issue in older versions of K8S. As a workaround, we just had to add a dummy key in our empty secret. This isn't an issue because we ignore differences for the data property anyways (due to the secret generator we use).

We looked into this a bit more. We see the same behavior on Kubernetes v1.15, v1.16, and v1.17 using ArgoCD v1.5 and v1.6.

igaskin · 2020-08-27T02:12:47Z

Tried upgrading again this time to 1.7.1 from 1.5.1, and experience the same behavior. I am upgrading individual components at a time, and everything works until I sync the arogcd-server deployment, at which point all applications get stuck in a refresh loop. In the UI this is surfaced as a Failed to load resource: the server responded wiht a status of 500 () error, specifically on for /api/v1/applications/<app-name>/resource-tree endpoint. Attached are some screenshots, including some graphs which show an increase in goroutines. Restarting the application-controller does not resolve the refresh loop.

igaskin · 2020-09-01T01:16:33Z

1.7.2 appears to have fixed this refresh loop for me, but not without some hand-holding. I'm going to keep a close eye on it. This is what worked for me:

upgrade all components to 1.7.2 (saving argocd-server for last)
the system is now stuck in a "refresh" state
delete the argocd-application-controller
encountered Unable to load data: key is missing error
- edit argocd-server deployment to use non-ha redis; --redis argocd-redis:6379
delete the argocd-application-controller
the system is no longer in a "refresh" state

Its not clear to me which of these steps actually solved the problem.

caseyclarkjamf · 2020-10-21T16:52:28Z

I think we're seeing something similar on 1.7.6. Refreshing is taking longer than it used to and App Diffs (either the diff for the entire app or individual resources) are not returning any results. When trying to view diffs the UI throws this error:

Unable to load data: cache: key is missing

And in argocd-server logs I see this message:

argocd-server-75b6f956c-2wljl server time="2020-10-21T16:35:46Z" level=error msg="finished unary call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=ManagedResources grpc.service=application.ApplicationService grpc.start_time="2020-10-21T16:35:02Z" grpc.time_ms=43928.277 span.kind=server system=grpc

Edit - Restarting both the argocd-application-controller and argocd-server pods had no effect. Restarting the Redis pod is what ultimately fixed it for me.

daufinsyd · 2020-11-25T13:36:59Z

Hello there,

We are facing the same issue: as stated by @stefanhenseler the issue (in our case) was also an empty secret file (without actual data). Deleting it "solved" the issue.

Restarting the pods (controller, redis, server) didn't helped.
We are running on OpenShift 4.4.30 with argocd v1.6.2+3d1f37b .

jshin47 · 2020-12-09T05:08:13Z

Just another bump on this thread. I too have been experimenting with ArgoCD, and ran into this issue.

Running v1.8.0+fdb5ada , which is pretty fresh.

ed unary call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=ManagedResources grpc.service=application.ApplicationService grpc.start_time="2020-12-09T01:52:50Z" grpc.time_ms=238.234 span.kind=server system=grpc

Restarting Redis had no effect.

For me, kubectl -n argocd delete pod argocd-application-controller-0 "fixed" the issue, but before I stake the future of my ops to ArgoCD I would love to understand why or how this happens.

644755 · 2021-04-13T01:09:34Z

For me it was application controller running :latest while the rest running a fixed version. Removing "image: argoproj/argocd:latest" & "imagePullPolicy: Always" from application controller cm solved the issue.

yujunz · 2022-08-11T03:25:38Z

Still encounter this issue in v2.4.8

time="2022-08-11T03:22:42Z" level=error msg="finished unary call with code Unknown" error="error getting cached app state: error getting application by query: application refresh de
adline exceeded" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2022-08-11T03:19:42Z" grpc.time_ms=180002.36 span.kind=serve
r system=grpc

rahul-mourya · 2022-10-07T13:55:34Z

I am also facing this issue with v2.4.12. Restarting the application controller statefulset seems to resolve the issue and it could also be recreated consistently when the invalidation of cluster cache sequence is triggered on application controller replicas. One way to trigger the sequence is to restart/delete one of the two deployed argocd-server pod(we are running with 2 argocd-server instances), and that would trigger the cluster cache invalidation and reinitialization in all the application controller instances(we are running with 3 instances) and one of the replica out of three would show the hang/refresh stuck issue. The applications handled by this problematic replica would stuck at refresh indefinitely with minimal logging and drop in CPU usage(almost flat to zero) and constant memory there after.

I am not sure about the root cause for this issue so I opened a another one #10842 with all the details and logs.

yujunz · 2022-10-08T02:15:50Z

Still encounter this issue in v2.4.8

The root cause for my case turns out to be an extra-large list of CRDs caused by a bug in cert-manager. Cluster cache initialization/refresh was blocked on listing all the resources.

rahul-mourya · 2022-11-09T11:34:48Z

For me, it seems that the recent commit introduced RLock() in the controller/cache/cache.go which is leading to a deadlock scenario.
Write lock here

argo-cd/controller/cache/cache.go

Line 375 in b0dab38

c.lock.Lock()

Without realease the above write lock trying to acquire the RLock here

argo-cd/controller/cache/cache.go

Line 404 in b0dab38

c.lock.RLock()

Since the above change is recent and this issue might a different(opened before the above change) so I have updated the detailed analysis with the goroutine stack under the #10842 with the comment

decodingahmed · 2023-01-09T12:39:55Z

We are on v2.5.2+148d8da and we have also experienced this. Restarting the argocd-application-controller pod seems to have done the trick.

Please just bear in mind that not everyone would have access to the underlying infrastructure to restart the pod so this could be a big issue for some and a smaller one for others.

I'll add this same comment on #10842 for completeness.

cheskayang · 2023-02-13T21:47:15Z

had same issue when upgrading from 2.6.0 to 2.6.1. had to restart argocd-application-controller

alexmt · 2023-07-25T09:10:36Z

Fixed by #13636

alexandresavicki · 2023-07-28T12:41:14Z

Anyone facing this issue with 2.7.9 or v2.8.0-rc5 pre-release?
Im, only see this problem if we run with sidecar plugin, once we remove it the problem gone.

crenshaw-dev · 2023-07-28T13:38:45Z

@alexandresavicki could you open a new issue with full details?

gruberdev · 2023-09-03T17:52:36Z

Just happened to face the same issue as well, related to upgrading from v2.7.9 to v2.8.0.

One of the Helm charts had an invalid reference, self-managed ArgoCD controller, got stuck into a loop where even when all pods are recreated, it seems to be stuck on the same state.

time="2023-09-03T17:54:20Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get namespace top-level resources: error synchronizing cache state : failed to sync cluster https://10.43.0.1:443: failed to load initial state of resource Redis.redis.redis.opstreelabs.in: conversion webhook for redis.redis.opstreelabs.in/v1beta1, Kind=Redis failed

Deleting the invalid CRD and manually creating both the operator resources and CRD itself do not work. If anybody has a debugging suggestion other than reinstating an ETCD backup, would be very helpful.

jessesuen added bug/severity:major Malfunction in one of the core component, impacting a majority of users more-information-needed Further information is requested bug Something isn't working labels Aug 4, 2020

no-response bot removed the more-information-needed Further information is requested label Aug 4, 2020

lentzi90 mentioned this issue Aug 27, 2020

Sync hangs with cert-manager on latest RC #4105

Closed

Oro mentioned this issue Feb 23, 2021

Using an explicitly empty secret throws error #5584

Closed

3 tasks

agaudreault mentioned this issue Jun 14, 2023

fix: avoid acquiring lock on two mutexes at the same time to prevent deadlock #13636

Merged

alexmt closed this as completed Jul 25, 2023

nickma82 mentioned this issue Aug 1, 2023

"error getting cached app state: cache: key is missing" when running app diff via CLI #10554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArgoCD stuck on "Refresh" #4044

ArgoCD stuck on "Refresh" #4044

razvanonet commented Aug 4, 2020

jessesuen commented Aug 4, 2020

razvanonet commented Aug 4, 2020

razvanonet commented Aug 5, 2020

imrenagi commented Aug 6, 2020

stefanhenseler commented Aug 17, 2020 •

edited

Loading

igaskin commented Aug 20, 2020

stefanhenseler commented Aug 20, 2020 •

edited

Loading

pinkplus commented Aug 22, 2020

stefanhenseler commented Aug 25, 2020 •

edited

Loading

igaskin commented Aug 27, 2020

igaskin commented Sep 1, 2020 •

edited

Loading

caseyclarkjamf commented Oct 21, 2020 •

edited

Loading

daufinsyd commented Nov 25, 2020

jshin47 commented Dec 9, 2020

644755 commented Apr 13, 2021

yujunz commented Aug 11, 2022

rahul-mourya commented Oct 7, 2022 •

edited

Loading

yujunz commented Oct 8, 2022

rahul-mourya commented Nov 9, 2022

decodingahmed commented Jan 9, 2023 •

edited

Loading

cheskayang commented Feb 13, 2023

alexmt commented Jul 25, 2023

alexandresavicki commented Jul 28, 2023

crenshaw-dev commented Jul 28, 2023

gruberdev commented Sep 3, 2023 •

edited

Loading

ArgoCD stuck on "Refresh" #4044

ArgoCD stuck on "Refresh" #4044

Comments

razvanonet commented Aug 4, 2020

jessesuen commented Aug 4, 2020

razvanonet commented Aug 4, 2020

razvanonet commented Aug 5, 2020

imrenagi commented Aug 6, 2020

stefanhenseler commented Aug 17, 2020 • edited Loading

igaskin commented Aug 20, 2020

stefanhenseler commented Aug 20, 2020 • edited Loading

pinkplus commented Aug 22, 2020

stefanhenseler commented Aug 25, 2020 • edited Loading

igaskin commented Aug 27, 2020

igaskin commented Sep 1, 2020 • edited Loading

caseyclarkjamf commented Oct 21, 2020 • edited Loading

daufinsyd commented Nov 25, 2020

jshin47 commented Dec 9, 2020

644755 commented Apr 13, 2021

yujunz commented Aug 11, 2022

rahul-mourya commented Oct 7, 2022 • edited Loading

yujunz commented Oct 8, 2022

rahul-mourya commented Nov 9, 2022

decodingahmed commented Jan 9, 2023 • edited Loading

cheskayang commented Feb 13, 2023

alexmt commented Jul 25, 2023

alexandresavicki commented Jul 28, 2023

crenshaw-dev commented Jul 28, 2023

gruberdev commented Sep 3, 2023 • edited Loading

stefanhenseler commented Aug 17, 2020 •

edited

Loading

stefanhenseler commented Aug 20, 2020 •

edited

Loading

stefanhenseler commented Aug 25, 2020 •

edited

Loading

igaskin commented Sep 1, 2020 •

edited

Loading

caseyclarkjamf commented Oct 21, 2020 •

edited

Loading

rahul-mourya commented Oct 7, 2022 •

edited

Loading

decodingahmed commented Jan 9, 2023 •

edited

Loading

gruberdev commented Sep 3, 2023 •

edited

Loading