Argo UI shows old pods as healthy #9226

0dragosh · 2022-04-27T07:51:34Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

ArgoCD UI is showing old pods (that don’t exist anymore) from old replicasets as healthy in the UI. When you go to details on those pods it's empty, and delete errors out — because they haven’t existed in a while.

It also shows the new replicaset with the new pods correctly, in parallel with the old pods.
I’ve tried a hard refresh to no avail, running ArgoCD HA.

To Reproduce

Deploy a new revision, creating a new replicaset.

Expected behavior

Old pods that don't exist in the Kubernetes API should not show up in the UI

Version

argocd: v2.3.3+07ac038.dirty
  BuildDate: 2022-03-30T05:14:36Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: dirty
  GoVersion: go1.18
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.3.3+07ac038
  BuildDate: 2022-03-30T00:06:18Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

I suspect it's got something to do with improper cache invalidation on the Redis side.

The text was updated successfully, but these errors were encountered:

0dragosh · 2022-05-04T17:40:52Z

Adding a screenshot to better showcase what I mean, maybe it's not clear.

crenshaw-dev · 2022-05-10T14:33:34Z

Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po shows the Pods as no longer existing on the cluster.

So this looks like a bug in how the Argo CD controller populates the resource tree.

OmerKahani · 2022-05-14T18:01:31Z

We had the same use case in a very old GKE cluster (it was created a long time ago but is updated to 1.20). My thought was the ArgoCD uses a different API than kubectl get

@0dragosh can it be related to an old cluster?

crenshaw-dev · 2022-06-13T18:37:24Z

@alexmt did you have any luck reproducing this?

rs-cole · 2022-07-07T22:15:00Z

Running into this as well, each sync creates more and more. revisionHistoryLimit is set to 3 but I currently have 23 and counting sitting around.

crenshaw-dev · 2022-07-08T12:18:12Z

@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.

0dragosh · 2022-07-08T12:49:08Z

@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.

Plus a restart of the argocd application controller.

rs-cole · 2022-07-08T15:56:31Z

Hitting Refresh - > Hard reset and restarting the application controller didn't seem to do the trick. Old non existant replicasets continued to grow, the applications RevisionHistoryLimit is set to 3. Perhaps I'm utilizing argocd incorrectly?

Disregard, my reading comprehension is terrible. Fixed my issue with: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy

ilyastoliar · 2022-07-21T12:37:38Z

Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po shows the Pods as no longer existing on the cluster.

So this looks like a bug in how the Argo CD controller populates the resource tree.

We are experiencing it with EKS as well

crenshaw-dev · 2022-07-27T17:38:00Z

Just encountered it an Intuit for (afaik) the first time. A force refresh and controller restart cleared the error.

Argo CD version: v2.4.6
Kubernetes provider: EKS
Kubernetes version: v1.21.12-eks-a64ea69

yoshigev · 2022-09-29T16:01:03Z

We encountered a similar issue with AKS.

We have a CronJob that creates Job + Pod resources. Those resources are automatically deleted due to either ttlSecondsAfterFinished, successfulJobsHistoryLimit or failedJobsHistoryLimit configuration of the CronJob.

Quite often, ArgoCD continues displaying those deleted resources for months after they were deleted, both for failed and for successful job executions.

Analysis of the API calls of ArgoCD (from the browser's log):

The phantom resources appear on the resource-tree.
On call to managed-resources API with the resource name, the response is an empty JSON.
On call to resource API with the resource name, the response is 404.

Regarding the Hard Refresh - is it possible to make it work without restarting the controller?

joseaio · 2022-10-26T16:36:39Z

Same problem here with start/stop AKS cluster.

There is a workaround without ArgoCD controlle restarts (via API or similar?) to remove old/ghost nodes?

0dragosh · 2022-10-27T07:03:45Z

@joseaio I could not figure out another workaround

crenshaw-dev · 2023-02-15T22:13:14Z

Again on 2.5.7.

h4tt3n · 2023-03-08T13:14:55Z

We are seeing the exact same behavior. ArgoCD shows ghost resources that have been deleted days or weeks ago, and the UI does not expose any means of removing them ( delete obviously doesn't work).

CPlommer · 2023-03-13T05:35:50Z

We're encountering this as well, specifically with cronjobs.

Argo CD version: v2.5.10
Kubernetes provider: GKE
Kubernetes server version: v1.24.9-gke.3200

The workaround to stop the old pods from surfacing in ArgoCD is, as stated earlier in this thread:

Force Refresh to clear the old pods
Restart the argocd application controller

Sorry for the silly questions, but I'm coming up empty googling ....
Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?
How do you restart the argocd application controller? https://argo-cd.readthedocs.io/en/stable/operator-manual/server-commands/argocd-application-controller/ doesn't say....

mikesmitty · 2023-03-27T14:26:34Z

@crenshaw-dev we're seeing this on a daily basis and would prefer not to resort to cycling the app controller every day. Is there any info we can provide to help diagnose the cause?

crenshaw-dev · 2023-03-27T14:32:14Z

Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?

@CPlommer yep, they're the same!

@mikesmitty honestly, I'm stumped. I mean I know that the app controller is filling Redis with "resource tree" values that include ghost resources. But I have no clue where to start figuring out why the app controller thinks they still exist.

When the app controller starts up, it launches a bunch of watches to keep track of what's happening on the destination cluster (like pods being deleted). So I think either the app controller either isn't correctly handling the updates, or k8s isn't correctly sending them. I tend to think it's the former.

And it's possible that the app controller is logging those failures, but I'm not sure what kind of messages to look for. I'd have to start with "any error or warn messages" and see if anything looks suspicious.

mikesmitty · 2023-03-27T20:16:31Z

Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:

{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}

{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}

"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}

{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}

{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}

"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}

If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.

While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339

crenshaw-dev · 2023-05-10T16:12:27Z

Latest theory: we miss watch events, Argo CD goes on blissfully unaware that the resources have been deleted.

I still don't know why the 24hr full-resync doesn't clear the old resources. https://github.com/argoproj/gitops-engine/blob/ed70eac8b7bd6b2f276502398fdbccccab5d189a/pkg/cache/cluster.go#L712

crenshaw-dev · 2023-06-08T13:50:56Z

@ashutosh16 do you have a link to the branch where you modified the code to reproduce this?

nipun-groww · 2023-06-14T13:08:02Z

Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:
{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}
If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.

While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339

Hi, we are using kube-oidc-proxy, and facing somewhat of a similar issue. In our case, we see a dip in argocd_app_reconcile_count and some of the resources(Pods, ReplicaSet) in each application are not shown in ArgoCD UI but are present in the cluster, sometimes it shows older data.

Whenever there is a dip in the reconcile count, wee have found the below logs in OIDC proxy pod:
E0613 16:02:42.441439 1 proxy.go:215] unable to authenticate the request via TokenReview due to an error:() rate: Wait(n=1) would exceed context deadline

At the same time, the ArgoCD application-controller throws the below error:
{ "error": "unable to retrieve the complete list of server APIs: apm.k8s.elastic.co/v1: the server has asked for the client to provide credentials, apm.k8s.elastic.co/v1beta1: the server has asked for the client to provide credentials, auto.gke.io/v1: the server has asked for the client to provide credentials, auto.gke.io/v1alpha1: the server has asked for the client to provide credentials, billingbudgets.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, binaryauthorization.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudbuild.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudfunctions.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudscheduler.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, configcontroller.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, core.strimzi.io/v1beta2: the server has asked for the client to provide credentials, custom.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, external.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, k8s.nginx.org/v1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1alpha1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta2: the server has asked for the client to provide credentials, keda.sh/v1alpha1: the server has asked for the client to provide credentials, kiali.io/v1alpha1: the server has asked for the client to provide credentials, kibana.k8s.elastic.co/v1: the server has asked for the client to provide credentials, kms.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, logging.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.coreos.com/v1alpha1: the server has asked for the client to provide credentials, networkconnectivity.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, networking.istio.io/v1alpha3: the server has asked for the client to provide credentials, networking.istio.io/v1beta1: the server has asked for the client to provide credentials, networkservices.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, osconfig.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, recaptchaenterprise.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, servicenetworking.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, serviceusage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, sourcerepo.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, spanner.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storagetransfer.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, vpcaccess.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, wgpolicyk8s.io/v1alpha2: the server has asked for the client to provide credentials", "level": "error", "msg": "Partial success when performing preferred resource discovery", "server": "https://cluster-1.data.int, "time": "2023-06-13T16:02:53Z" }
Attaching Graph of argocd_app_k8s_request_total:

mkilchhofer · 2023-10-18T12:37:08Z

I saw this issue today for the first time in our setup at @swisspost on a DaemonSet pod of Aquasecurity.
There was a "ghost" Pod which I could not remove.

The only thing helped was to do

$ kubectl rollout restart sts/argocd-application-controller
statefulset.apps/argocd-application-controller restarted

After this the Pod was gone.

Before restarting the application-controller I tried:

refresh the app
hard refresh the app
invalidate cache of the cluster (Settings > Clusters > in-cluster > Button "Invalidate cache")

We are currently on version "v2.8.2+dbdfc71".

dtrouillet · 2024-06-05T06:39:06Z

Hi,

For information. we faced this issue to on ArgoCD v2.10.7+b060053.

RoyerRamirez · 2024-08-23T23:17:27Z

Hi,

We're also seeing this same issue. It's only occurring on deployments that contain significant number of pods (1k plus). Our other smaller deployments are fine.

jmmclean · 2024-09-10T17:41:59Z

Issue on EKS with ArgoCD v2.11.3, the pods from kubectl get pods is different than the display from ArgoCD. I snooped the eventsource response and saw that its populating with old stale data (ghost pods). a hard refresh took an extremely long time and did not help.

razbomi · 2024-10-24T00:36:09Z

Issue on EKS 1.31 with ArgoCD v2.12.4+27d1e64.

Pod not on the cluster when inspected with kubectl.
Argo showing pods visible in progressing state with pending delete reason.

Only way to get rid of it in the ArgoUI is to delete the offending replica set.

NicholasRaymondiSpot · 2024-10-31T22:07:38Z

We're seeing similar issues using ArgoCD 2.12.3 & 2.13.0 deployed to our Kubernetes 1.30 clusters. It seems to happen in our noisy clusters that do a lot of scaling, we'll find Argo showing a number of pods either Healthy or in another state but the pods no longer exist. Screenshots are kubectl showing actual pods listed VS argo for the aws-node showing "ghost" pods. When you attempt to open any of these pods from the UI, you see the error in the 2nd screenshot and no Manifest is shown.

Deleting the old replicasets didn't help, running kubectl rollout restart sts -n argocd argo-cd-argocd-application-controller like @mkilchhofer suggested at first didn't seem to help either, but then refreshing the application view in the UI after the pod came back up cleaned up all of the "ghost" resources.

daftping · 2024-11-27T00:38:12Z

We encountered the same issue in version 2.12.3 on a large self-hosted Kubernetes cluster (v1.27). Restarting the application controller did not help. However, adding a dummy annotation to the parent ReplicaSet refreshed the state.
kubectl annotate replicaset <name> foo=bar

0dragosh added the bug Something isn't working label Apr 27, 2022

crenshaw-dev mentioned this issue Apr 27, 2022

ArgoCD: Logs tab is not visible in latest version of ArgoCD #9231

Closed

crenshaw-dev added the component:core Syncing, diffing, cluster state cache label May 10, 2022

crenshaw-dev mentioned this issue Feb 15, 2023

Previous CronJob Jobs have been deleted, but show in ArgoCD UI #10759

Open

3 tasks

crenshaw-dev mentioned this issue Aug 16, 2024

Deleted pod still appears in Web UI as "pending deletion" stage #19560

Closed

3 tasks

mathieux51 mentioned this issue Aug 19, 2024

one or more objects failed to apply, reason: error when patching "/dev/shm/1490806802": CronJob.batch "consumer" is invalid: spec.schedule: Required value (retried 5 times) #19584

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argo UI shows old pods as healthy #9226

Argo UI shows old pods as healthy #9226

0dragosh commented Apr 27, 2022 •

edited

Loading

0dragosh commented May 4, 2022

crenshaw-dev commented May 10, 2022

OmerKahani commented May 14, 2022

crenshaw-dev commented Jun 13, 2022

rs-cole commented Jul 7, 2022

crenshaw-dev commented Jul 8, 2022

0dragosh commented Jul 8, 2022

rs-cole commented Jul 8, 2022 •

edited

Loading

ilyastoliar commented Jul 21, 2022

crenshaw-dev commented Jul 27, 2022

yoshigev commented Sep 29, 2022

joseaio commented Oct 26, 2022 •

edited

Loading

0dragosh commented Oct 27, 2022

crenshaw-dev commented Feb 15, 2023

h4tt3n commented Mar 8, 2023

CPlommer commented Mar 13, 2023 •

edited

Loading

mikesmitty commented Mar 27, 2023

crenshaw-dev commented Mar 27, 2023

mikesmitty commented Mar 27, 2023

crenshaw-dev commented May 10, 2023

crenshaw-dev commented Jun 8, 2023

nipun-groww commented Jun 14, 2023

mkilchhofer commented Oct 18, 2023

dtrouillet commented Jun 5, 2024

RoyerRamirez commented Aug 23, 2024

jmmclean commented Sep 10, 2024 •

edited

Loading

razbomi commented Oct 24, 2024

NicholasRaymondiSpot commented Oct 31, 2024 •

edited

Loading

daftping commented Nov 27, 2024

Argo UI shows old pods as healthy #9226

Argo UI shows old pods as healthy #9226

Comments

0dragosh commented Apr 27, 2022 • edited Loading

0dragosh commented May 4, 2022

crenshaw-dev commented May 10, 2022

OmerKahani commented May 14, 2022

crenshaw-dev commented Jun 13, 2022

rs-cole commented Jul 7, 2022

crenshaw-dev commented Jul 8, 2022

0dragosh commented Jul 8, 2022

rs-cole commented Jul 8, 2022 • edited Loading

ilyastoliar commented Jul 21, 2022

crenshaw-dev commented Jul 27, 2022

yoshigev commented Sep 29, 2022

joseaio commented Oct 26, 2022 • edited Loading

0dragosh commented Oct 27, 2022

crenshaw-dev commented Feb 15, 2023

h4tt3n commented Mar 8, 2023

CPlommer commented Mar 13, 2023 • edited Loading

mikesmitty commented Mar 27, 2023

crenshaw-dev commented Mar 27, 2023

mikesmitty commented Mar 27, 2023

crenshaw-dev commented May 10, 2023

crenshaw-dev commented Jun 8, 2023

nipun-groww commented Jun 14, 2023

mkilchhofer commented Oct 18, 2023

dtrouillet commented Jun 5, 2024

RoyerRamirez commented Aug 23, 2024

jmmclean commented Sep 10, 2024 • edited Loading

razbomi commented Oct 24, 2024

NicholasRaymondiSpot commented Oct 31, 2024 • edited Loading

daftping commented Nov 27, 2024

0dragosh commented Apr 27, 2022 •

edited

Loading

rs-cole commented Jul 8, 2022 •

edited

Loading

joseaio commented Oct 26, 2022 •

edited

Loading

CPlommer commented Mar 13, 2023 •

edited

Loading

jmmclean commented Sep 10, 2024 •

edited

Loading

NicholasRaymondiSpot commented Oct 31, 2024 •

edited

Loading