-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconciliation loop #8100
Comments
@Funk66 Mind me asking if your ArgoCD setup is in AWS? Did the issue happen around the 4th of Januar - at least that what happened with our idendical environments |
@patrickjahns, yes, this is on AWS. It started in December, on the day we upgraded to v2.2.1, as explained in the description. I have no reason to think that this is related to the underlying infrastructure. If you have any indication to the contrary, please let me know and I'll try reaching out to the AWS support team. |
I met the same problem, but at version |
@Funk66 We have several kubernetes environments in AWS and Azure. ArgoCD is installed locally - from alls clusters, there are 3 EKS clusters in the same region and their version is 1.18-eks.8 and 1.19-eks.6. We are seeing the issue on 3 of our clusters from the same region, and it started to surface on the same day (4. January ) around the same time ( half an hour difference ). We increased the verbosity of logging to debug/trace but haven't found any further indicators so far. So this is really mind boggling right now @FatalC |
@patrickjahns, did the issue by any chance start after an application controller pod restart? We're on EKS 1.20 and see this happening on every cluster in every region. The only change around the time it started was the ArgoCD upgrade, which is why I'm inclined to think that this problem is caused by ArgoCD being unable to properly keep track of the apps it has already refreshed. That said, I haven't taken the time to look into the code, so that just an uninformed guess. |
We didn't perform any operations on the controllers. By chance they controllers must have all three been restarted around the same time (same day, within 1 hour from each other) |
We are seeing this on our k3s cluster (v1.22.4+k3s1) with ArgoCD v2.1.8. CPU generally high too. |
Further digging in our environments revealed, that there were permanent updates to externalsecrets resource (status field) by the external-secrets controller. In our environments that was triggered through expired certificates (mTLS authentication of external-secrets) which we didn't catch. We've resolved the underlying issues with the certificates and the reconciliation loop stopped. In the ArgoCD documentation we noticed als that one can disable that StatusChanges trigger reconciliation loops
https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#system-level-configuration Maybe this is something people can try and see if that is the trigger in their environments. Something like corneliusweig/ketall#29 would be good to catch I suppose. In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation. Maybe this is something that ArgoCD maintainers would consider (cc @alexmt (pinging you since it was added to a milestone for investigation)) |
I agree, this information would be really useful. We had reconciliation loops bugs in the past, where it wasn't clear which resource(s) actually triggered the reconciliation and took tremendous efforts to troubleshoot. |
The issue about changing secrets was mentioned in #6108. I have checked all resources being tracked by the corresponding applications and none of them sees to change, or at least not at that rate. The |
So I've finally taken some time to have another look at this and here's what I found. First, I can confirm that the issue started with v2.2.0. Reverting the application-controller image to an earlier version makes the problem go away. Furthermore, I think the issue was introduced with commit 05935a9, where an 'if' statement to exclude orphaned resources was removed. |
Running argocd 2.1.3 in EKS and have problem with high cpu usage and throttling of application controller aswell. So do not think 2.2 is the only issue though. |
For what it's worth I tried the solution suggested by @patrickjahns above and our ArgoCD went from consuming ~1000-1500m to ~ 20m CPU. i.e. setting this in
Running ArgoCD 2.2.5 in EKS 1.21. |
I'm also hit by the high cpu caused by reconciliation loop. Thanks to @Funk66 I verified that it is caused by the leader election configmaps.
|
using the command suggested by @Funk66 I was also able to see that I have several cm that keep popping in the list, but one of them is in a namespace we see many reconciliations for. is there a workaround? |
Tested with version V2.3.3 |
@Vladyslav-Miletskyi thanks! That did the trick. We were having the exact same problem and now the load is normal. |
Is there something else than a debug log that we could use to detect this in a production deployment? Enabling debug in production is not something that is possible for us. I am mainly looking at a way to find resources that are continuously regenerated. |
Disabling |
The issue is still present in v2.5.1 and the |
We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting #8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one! |
In case it helps anyone else, increasing the ScaledObject pollingInterval made a massive difference to the ArgoCD CPU usage. |
I've been seeing this a lot still on v2.6.2 with two different metallb deployments. Constantly loops over them and the orphanedResources is not in the project spec for default. |
In v2.6.1 with |
ArgoCD version:
we even bumped timeout.reconciliation from 30m to 2h, but that didn't help. we ran into this issue when using custom plugins for our applications:
and noticed the following logs in application controller: with multiple test environments configured to use argocd and 100s of argo apps per env, this crashed our git servers every couple of days. so we had to add the following dummy var to fix the constant refresh of the app:
|
I'm also seeing this issue with
this then triggers a level (1) refresh that takes a long time:
|
The behavior can be configured in |
@Funk66 did u submit a PR for #8100 (comment) ? |
I tried implementing a fix but couldn't make it work fully. I may try again in the coming weeks, if nobody else does. |
Checklist:
argocd version
.Describe the bug
Upon upgrading from v2.1.7 to v2.2.1, the argocd application controller started performing continuous reconciliations for every app (about one per second, which is as much as CPU capacity allows).
Issues #3262 and #6108 sound similar but didn't help.
I haven't been able to figure out the reason why a refresh keeps being requested. The log below shows the block that keeps repeating for each app every second.
Expected behavior
The number of reconciliations should be two orders of magnitude lower.
Version
Logs
The text was updated successfully, but these errors were encountered: