Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo UI shows old pods as healthy #9226

Open
3 tasks done
0dragosh opened this issue Apr 27, 2022 · 29 comments
Open
3 tasks done

Argo UI shows old pods as healthy #9226

0dragosh opened this issue Apr 27, 2022 · 29 comments
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache

Comments

@0dragosh
Copy link

0dragosh commented Apr 27, 2022

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

ArgoCD UI is showing old pods (that don’t exist anymore) from old replicasets as healthy in the UI. When you go to details on those pods it's empty, and delete errors out — because they haven’t existed in a while.

It also shows the new replicaset with the new pods correctly, in parallel with the old pods.
I’ve tried a hard refresh to no avail, running ArgoCD HA.

To Reproduce

Deploy a new revision, creating a new replicaset.

Expected behavior

Old pods that don't exist in the Kubernetes API should not show up in the UI

Version

argocd: v2.3.3+07ac038.dirty
  BuildDate: 2022-03-30T05:14:36Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: dirty
  GoVersion: go1.18
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.3.3+07ac038
  BuildDate: 2022-03-30T00:06:18Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

I suspect it's got something to do with improper cache invalidation on the Redis side.

@0dragosh
Copy link
Author

0dragosh commented May 4, 2022

old_pod

Adding a screenshot to better showcase what I mean, maybe it's not clear.

@crenshaw-dev
Copy link
Member

Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po shows the Pods as no longer existing on the cluster.

So this looks like a bug in how the Argo CD controller populates the resource tree.

@crenshaw-dev crenshaw-dev added the component:core Syncing, diffing, cluster state cache label May 10, 2022
@OmerKahani
Copy link
Contributor

We had the same use case in a very old GKE cluster (it was created a long time ago but is updated to 1.20). My thought was the ArgoCD uses a different API than kubectl get

@0dragosh can it be related to an old cluster?

@crenshaw-dev
Copy link
Member

@alexmt did you have any luck reproducing this?

@rs-cole
Copy link

rs-cole commented Jul 7, 2022

Running into this as well, each sync creates more and more. revisionHistoryLimit is set to 3 but I currently have 23 and counting sitting around.

@crenshaw-dev
Copy link
Member

@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.

@0dragosh
Copy link
Author

0dragosh commented Jul 8, 2022

@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.

Plus a restart of the argocd application controller.

@rs-cole
Copy link

rs-cole commented Jul 8, 2022

Hitting Refresh - > Hard reset and restarting the application controller didn't seem to do the trick. Old non existant replicasets continued to grow, the applications RevisionHistoryLimit is set to 3. Perhaps I'm utilizing argocd incorrectly?

Disregard, my reading comprehension is terrible. Fixed my issue with: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy

@ilyastoliar
Copy link

Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po shows the Pods as no longer existing on the cluster.

So this looks like a bug in how the Argo CD controller populates the resource tree.

We are experiencing it with EKS as well

@crenshaw-dev
Copy link
Member

Just encountered it an Intuit for (afaik) the first time. A force refresh and controller restart cleared the error.

Argo CD version: v2.4.6
Kubernetes provider: EKS
Kubernetes version: v1.21.12-eks-a64ea69

@yoshigev
Copy link

We encountered a similar issue with AKS.

We have a CronJob that creates Job + Pod resources. Those resources are automatically deleted due to either ttlSecondsAfterFinished, successfulJobsHistoryLimit or failedJobsHistoryLimit configuration of the CronJob.

Quite often, ArgoCD continues displaying those deleted resources for months after they were deleted, both for failed and for successful job executions.

Analysis of the API calls of ArgoCD (from the browser's log):

  • The phantom resources appear on the resource-tree.
  • On call to managed-resources API with the resource name, the response is an empty JSON.
  • On call to resource API with the resource name, the response is 404.

Regarding the Hard Refresh - is it possible to make it work without restarting the controller?

@joseaio
Copy link

joseaio commented Oct 26, 2022

Same problem here with start/stop AKS cluster.

There is a workaround without ArgoCD controlle restarts (via API or similar?) to remove old/ghost nodes?

@0dragosh
Copy link
Author

@joseaio I could not figure out another workaround

@crenshaw-dev
Copy link
Member

Again on 2.5.7.

@h4tt3n
Copy link

h4tt3n commented Mar 8, 2023

We are seeing the exact same behavior. ArgoCD shows ghost resources that have been deleted days or weeks ago, and the UI does not expose any means of removing them ( delete obviously doesn't work).

@CPlommer
Copy link

CPlommer commented Mar 13, 2023

We're encountering this as well, specifically with cronjobs.

Argo CD version: v2.5.10
Kubernetes provider: GKE
Kubernetes server version: v1.24.9-gke.3200

The workaround to stop the old pods from surfacing in ArgoCD is, as stated earlier in this thread:

  1. Force Refresh to clear the old pods
  2. Restart the argocd application controller

Sorry for the silly questions, but I'm coming up empty googling ....
Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?
How do you restart the argocd application controller? https://argo-cd.readthedocs.io/en/stable/operator-manual/server-commands/argocd-application-controller/ doesn't say....

@mikesmitty
Copy link

@crenshaw-dev we're seeing this on a daily basis and would prefer not to resort to cycling the app controller every day. Is there any info we can provide to help diagnose the cause?

@crenshaw-dev
Copy link
Member

Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?

@CPlommer yep, they're the same!

@mikesmitty honestly, I'm stumped. I mean I know that the app controller is filling Redis with "resource tree" values that include ghost resources. But I have no clue where to start figuring out why the app controller thinks they still exist.

When the app controller starts up, it launches a bunch of watches to keep track of what's happening on the destination cluster (like pods being deleted). So I think either the app controller either isn't correctly handling the updates, or k8s isn't correctly sending them. I tend to think it's the former.

And it's possible that the app controller is logging those failures, but I'm not sure what kind of messages to look for. I'd have to start with "any error or warn messages" and see if anything looks suspicious.

@mikesmitty
Copy link

Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:

{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}

If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.

While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339

@crenshaw-dev
Copy link
Member

Latest theory: we miss watch events, Argo CD goes on blissfully unaware that the resources have been deleted.

I still don't know why the 24hr full-resync doesn't clear the old resources. https://github.com/argoproj/gitops-engine/blob/ed70eac8b7bd6b2f276502398fdbccccab5d189a/pkg/cache/cluster.go#L712

@crenshaw-dev
Copy link
Member

@ashutosh16 do you have a link to the branch where you modified the code to reproduce this?

@nipun-groww
Copy link

Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:

{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}

If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.

While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339

Hi, we are using kube-oidc-proxy, and facing somewhat of a similar issue. In our case, we see a dip in argocd_app_reconcile_count and some of the resources(Pods, ReplicaSet) in each application are not shown in ArgoCD UI but are present in the cluster, sometimes it shows older data.

Whenever there is a dip in the reconcile count, wee have found the below logs in OIDC proxy pod:
E0613 16:02:42.441439 1 proxy.go:215] unable to authenticate the request via TokenReview due to an error:() rate: Wait(n=1) would exceed context deadline

At the same time, the ArgoCD application-controller throws the below error:
{ "error": "unable to retrieve the complete list of server APIs: apm.k8s.elastic.co/v1: the server has asked for the client to provide credentials, apm.k8s.elastic.co/v1beta1: the server has asked for the client to provide credentials, auto.gke.io/v1: the server has asked for the client to provide credentials, auto.gke.io/v1alpha1: the server has asked for the client to provide credentials, billingbudgets.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, binaryauthorization.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudbuild.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudfunctions.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudscheduler.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, configcontroller.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, core.strimzi.io/v1beta2: the server has asked for the client to provide credentials, custom.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, external.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, k8s.nginx.org/v1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1alpha1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta2: the server has asked for the client to provide credentials, keda.sh/v1alpha1: the server has asked for the client to provide credentials, kiali.io/v1alpha1: the server has asked for the client to provide credentials, kibana.k8s.elastic.co/v1: the server has asked for the client to provide credentials, kms.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, logging.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.coreos.com/v1alpha1: the server has asked for the client to provide credentials, networkconnectivity.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, networking.istio.io/v1alpha3: the server has asked for the client to provide credentials, networking.istio.io/v1beta1: the server has asked for the client to provide credentials, networkservices.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, osconfig.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, recaptchaenterprise.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, servicenetworking.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, serviceusage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, sourcerepo.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, spanner.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storagetransfer.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, vpcaccess.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, wgpolicyk8s.io/v1alpha2: the server has asked for the client to provide credentials", "level": "error", "msg": "Partial success when performing preferred resource discovery", "server": "https://cluster-1.data.int, "time": "2023-06-13T16:02:53Z" }
Attaching Graph of argocd_app_k8s_request_total:
image

@mkilchhofer
Copy link
Member

I saw this issue today for the first time in our setup at @swisspost on a DaemonSet pod of Aquasecurity.
There was a "ghost" Pod which I could not remove.

The only thing helped was to do

$ kubectl rollout restart sts/argocd-application-controller
statefulset.apps/argocd-application-controller restarted

After this the Pod was gone.

Before restarting the application-controller I tried:

  • refresh the app
  • hard refresh the app
  • invalidate cache of the cluster (Settings > Clusters > in-cluster > Button "Invalidate cache")

We are currently on version "v2.8.2+dbdfc71".

@dtrouillet
Copy link

Hi,

For information. we faced this issue to on ArgoCD v2.10.7+b060053.

@RoyerRamirez
Copy link

Hi,

We're also seeing this same issue. It's only occurring on deployments that contain significant number of pods (1k plus). Our other smaller deployments are fine.

@jmmclean
Copy link

jmmclean commented Sep 10, 2024

Issue on EKS with ArgoCD v2.11.3, the pods from kubectl get pods is different than the display from ArgoCD. I snooped the eventsource response and saw that its populating with old stale data (ghost pods). a hard refresh took an extremely long time and did not help.

@razbomi
Copy link

razbomi commented Oct 24, 2024

Issue on EKS 1.31 with ArgoCD v2.12.4+27d1e64.

Pod not on the cluster when inspected with kubectl.
Argo showing pods visible in progressing state with pending delete reason.

Only way to get rid of it in the ArgoUI is to delete the offending replica set.

@NicholasRaymondiSpot
Copy link

NicholasRaymondiSpot commented Oct 31, 2024

We're seeing similar issues using ArgoCD 2.12.3 & 2.13.0 deployed to our Kubernetes 1.30 clusters. It seems to happen in our noisy clusters that do a lot of scaling, we'll find Argo showing a number of pods either Healthy or in another state but the pods no longer exist. Screenshots are kubectl showing actual pods listed VS argo for the aws-node showing "ghost" pods. When you attempt to open any of these pods from the UI, you see the error in the 2nd screenshot and no Manifest is shown.

Screenshot 2024-10-31 at 17 49 03
Screenshot 2024-10-31 at 17 49 50

Deleting the old replicasets didn't help, running kubectl rollout restart sts -n argocd argo-cd-argocd-application-controller like @mkilchhofer suggested at first didn't seem to help either, but then refreshing the application view in the UI after the pod came back up cleaned up all of the "ghost" resources.

@daftping
Copy link
Contributor

We encountered the same issue in version 2.12.3 on a large self-hosted Kubernetes cluster (v1.27). Restarting the application controller did not help. However, adding a dummy annotation to the parent ReplicaSet refreshed the state.
kubectl annotate replicaset <name> foo=bar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache
Projects
None yet
Development

No branches or pull requests