-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argo UI shows old pods as healthy #9226
Comments
Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the So this looks like a bug in how the Argo CD controller populates the resource tree. |
We had the same use case in a very old GKE cluster (it was created a long time ago but is updated to 1.20). My thought was the ArgoCD uses a different API than @0dragosh can it be related to an old cluster? |
@alexmt did you have any luck reproducing this? |
Running into this as well, each sync creates more and more. revisionHistoryLimit is set to 3 but I currently have 23 and counting sitting around. |
@rs-cole as a workaround, we found that a Force Refresh cleared the old pods. |
Plus a restart of the argocd application controller. |
Disregard, my reading comprehension is terrible. Fixed my issue with: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy |
We are experiencing it with EKS as well |
Just encountered it an Intuit for (afaik) the first time. A force refresh and controller restart cleared the error. Argo CD version: v2.4.6 |
We encountered a similar issue with AKS. We have a CronJob that creates Job + Pod resources. Those resources are automatically deleted due to either Quite often, ArgoCD continues displaying those deleted resources for months after they were deleted, both for failed and for successful job executions. Analysis of the API calls of ArgoCD (from the browser's log):
Regarding the Hard Refresh - is it possible to make it work without restarting the controller? |
Same problem here with start/stop AKS cluster. There is a workaround without ArgoCD controlle restarts (via API or similar?) to remove old/ghost nodes? |
@joseaio I could not figure out another workaround |
Again on 2.5.7. |
We are seeing the exact same behavior. ArgoCD shows ghost resources that have been deleted days or weeks ago, and the UI does not expose any means of removing them ( delete obviously doesn't work). |
We're encountering this as well, specifically with cronjobs. Argo CD version: v2.5.10 The workaround to stop the old pods from surfacing in ArgoCD is, as stated earlier in this thread:
Sorry for the silly questions, but I'm coming up empty googling .... |
@crenshaw-dev we're seeing this on a daily basis and would prefer not to resort to cycling the app controller every day. Is there any info we can provide to help diagnose the cause? |
@CPlommer yep, they're the same! @mikesmitty honestly, I'm stumped. I mean I know that the app controller is filling Redis with "resource tree" values that include ghost resources. But I have no clue where to start figuring out why the app controller thinks they still exist. When the app controller starts up, it launches a bunch of watches to keep track of what's happening on the destination cluster (like pods being deleted). So I think either the app controller either isn't correctly handling the updates, or k8s isn't correctly sending them. I tend to think it's the former. And it's possible that the app controller is logging those failures, but I'm not sure what kind of messages to look for. I'd have to start with "any error or warn messages" and see if anything looks suspicious. |
Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:
If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster. While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339 |
Latest theory: we miss watch events, Argo CD goes on blissfully unaware that the resources have been deleted. I still don't know why the 24hr full-resync doesn't clear the old resources. https://github.com/argoproj/gitops-engine/blob/ed70eac8b7bd6b2f276502398fdbccccab5d189a/pkg/cache/cluster.go#L712 |
@ashutosh16 do you have a link to the branch where you modified the code to reproduce this? |
Hi, we are using kube-oidc-proxy, and facing somewhat of a similar issue. In our case, we see a dip in argocd_app_reconcile_count and some of the resources(Pods, ReplicaSet) in each application are not shown in ArgoCD UI but are present in the cluster, sometimes it shows older data. Whenever there is a dip in the reconcile count, wee have found the below logs in OIDC proxy pod: At the same time, the ArgoCD application-controller throws the below error: |
I saw this issue today for the first time in our setup at @swisspost on a DaemonSet pod of Aquasecurity. The only thing helped was to do $ kubectl rollout restart sts/argocd-application-controller
statefulset.apps/argocd-application-controller restarted After this the Pod was gone. Before restarting the
We are currently on version "v2.8.2+dbdfc71". |
Hi, For information. we faced this issue to on ArgoCD v2.10.7+b060053. |
Hi, We're also seeing this same issue. It's only occurring on deployments that contain significant number of pods (1k plus). Our other smaller deployments are fine. |
Issue on EKS with ArgoCD v2.11.3, the pods from |
Issue on EKS 1.31 with ArgoCD v2.12.4+27d1e64. Pod not on the cluster when inspected with kubectl. Only way to get rid of it in the ArgoUI is to delete the offending replica set. |
We're seeing similar issues using ArgoCD 2.12.3 & 2.13.0 deployed to our Kubernetes 1.30 clusters. It seems to happen in our noisy clusters that do a lot of scaling, we'll find Argo showing a number of pods either Healthy or in another state but the pods no longer exist. Screenshots are Deleting the old replicasets didn't help, running |
We encountered the same issue in version 2.12.3 on a large self-hosted Kubernetes cluster (v1.27). Restarting the application controller did not help. However, adding a dummy annotation to the parent ReplicaSet refreshed the state. |
Checklist:
argocd version
.Describe the bug
ArgoCD UI is showing old pods (that don’t exist anymore) from old replicasets as healthy in the UI. When you go to details on those pods it's empty, and delete errors out — because they haven’t existed in a while.
It also shows the new replicaset with the new pods correctly, in parallel with the old pods.
I’ve tried a hard refresh to no avail, running ArgoCD HA.
To Reproduce
Deploy a new revision, creating a new replicaset.
Expected behavior
Old pods that don't exist in the Kubernetes API should not show up in the UI
Version
I suspect it's got something to do with improper cache invalidation on the Redis side.
The text was updated successfully, but these errors were encountered: