Reconciliation loop #8100

Funk66 · 2022-01-05T13:13:31Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

Upon upgrading from v2.1.7 to v2.2.1, the argocd application controller started performing continuous reconciliations for every app (about one per second, which is as much as CPU capacity allows).
Issues #3262 and #6108 sound similar but didn't help.
I haven't been able to figure out the reason why a refresh keeps being requested. The log below shows the block that keeps repeating for each app every second.

Expected behavior

The number of reconciliations should be two orders of magnitude lower.

Version

v2.2.2+03b17e0

Logs

time="2022-01-05T12:34:33Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=kube-proxy
time="2022-01-05T12:34:33Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2022-01-05T12:34:33Z" level=info msg="Ignore '/spec/preserveUnknownFields' for CustomResourceDefinitions"
time="2022-01-05T12:34:33Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: kube-system)" application=kube-proxy
time="2022-01-05T12:34:33Z" level=info msg="getRepoObjs stats" application=prometheus-adapter build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=178 unmarshal_ms=163 version_ms=14
time="2022-01-05T12:34:33Z" level=info msg="Skipping auto-sync: application status is Synced" application=monitoring-common
time="2022-01-05T12:34:33Z" level=info msg="No status changes. Skipping patch" application=monitoring-common
time="2022-01-05T12:34:33Z" level=info msg="Reconciliation completed" application=monitoring-common dedup_ms=0 dest-name= dest-namespace=services dest-server="https://kubernetes.default.svc" diff_ms=0 fields.level=1 git_ms=391 health_ms=0 live_ms=119 settings_ms=0 sync_ms=0 time_ms=1098

The text was updated successfully, but these errors were encountered:

patrickjahns · 2022-01-10T14:44:55Z

@Funk66
We are experiencing a similar situation, but on ArgoCD v1.8.4+28aea3d. Something strange we noticed is, that we encounter this issue currently only on EKS clusters. We have identical clusters in Azure and the problem does not occur there.

Mind me asking if your ArgoCD setup is in AWS? Did the issue happen around the 4th of Januar - at least that what happened with our idendical environments

Funk66 · 2022-01-10T14:57:00Z

@patrickjahns, yes, this is on AWS. It started in December, on the day we upgraded to v2.2.1, as explained in the description. I have no reason to think that this is related to the underlying infrastructure. If you have any indication to the contrary, please let me know and I'll try reaching out to the AWS support team.

cnfatal · 2022-01-11T02:46:45Z

I met the same problem, but at version v1.7.14+92b0237

patrickjahns · 2022-01-11T08:18:58Z

@Funk66
It might be a red herring, but I can elaborate shortly what lead me down the path for asking:

We have several kubernetes environments in AWS and Azure. ArgoCD is installed locally - from alls clusters, there are 3 EKS clusters in the same region and their version is 1.18-eks.8 and 1.19-eks.6. We are seeing the issue on 3 of our clusters from the same region, and it started to surface on the same day (4. January ) around the same time ( half an hour difference ).

We increased the verbosity of logging to debug/trace but haven't found any further indicators so far. So this is really mind boggling right now

@FatalC
Any chance this is happening in EKS? If not, at least I am a bit more sure that EKS is not the right direction to investigate ;-)

Funk66 · 2022-01-11T08:49:49Z

@patrickjahns, did the issue by any chance start after an application controller pod restart? We're on EKS 1.20 and see this happening on every cluster in every region. The only change around the time it started was the ArgoCD upgrade, which is why I'm inclined to think that this problem is caused by ArgoCD being unable to properly keep track of the apps it has already refreshed. That said, I haven't taken the time to look into the code, so that just an uninformed guess.

patrickjahns · 2022-01-11T14:47:02Z

We didn't perform any operations on the controllers. By chance they controllers must have all three been restarted around the same time (same day, within 1 hour from each other)

MrSaints · 2022-01-11T17:20:10Z

We are seeing this on our k3s cluster (v1.22.4+k3s1) with ArgoCD v2.1.8. CPU generally high too.

patrickjahns · 2022-01-12T09:35:15Z

Further digging in our environments revealed, that there were permanent updates to externalsecrets resource (status field) by the external-secrets controller. In our environments that was triggered through expired certificates (mTLS authentication of external-secrets) which we didn't catch.

We've resolved the underlying issues with the certificates and the reconciliation loop stopped. In the ArgoCD documentation we noticed als that one can disable that StatusChanges trigger reconciliation loops

data:
  resource.compareoptions: |
    # disables status field diffing in specified resource types
    # 'crd' - CustomResourceDefinition-s (default)
    # 'all' - all resources
    # 'none' - disabled
    ignoreResourceStatusField: all

https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#system-level-configuration

Maybe this is something people can try and see if that is the trigger in their environments.
Besides that, another option would be to iterate over the resources and watch for changes. Didn't find a nice command yet to do a watch on all resources (i.e. something along the lines of kubectl watch * ) yet - if anyone has an idea - highly appreciated

Something like corneliusweig/ketall#29 would be good to catch I suppose.

In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.

Maybe this is something that ArgoCD maintainers would consider (cc @alexmt (pinging you since it was added to a milestone for investigation))

jannfis · 2022-01-16T09:34:45Z

In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.

I agree, this information would be really useful. We had reconciliation loops bugs in the past, where it wasn't clear which resource(s) actually triggered the reconciliation and took tremendous efforts to troubleshoot.

Funk66 · 2022-01-17T11:08:11Z

The issue about changing secrets was mentioned in #6108. I have checked all resources being tracked by the corresponding applications and none of them sees to change, or at least not at that rate. The ignoreResourceStatusField parameter didn't help in my case. I will have to dig deeper to ferret out what is going on. I agree that more comprehensive logging would make this much easier.

Funk66 · 2022-02-09T08:58:30Z

So I've finally taken some time to have another look at this and here's what I found. First, I can confirm that the issue started with v2.2.0. Reverting the application-controller image to an earlier version makes the problem go away. Furthermore, I think the issue was introduced with commit 05935a9, where an 'if' statement to exclude orphaned resources was removed.
The problem itself is that ArgoCD detects changes to config-maps used for leader election purposes. These can be easily identified with kubectl get cm -A -w, since the leader election process requires updating the config-map every few seconds. Now, even though these resources are listed in spec.orphanedResources.ignore of the AppProject manifest, the ApplicationController.handleObjectUpdated method flags them as being managed by every App in that namespace, hence calling requestAppRefresh for each one of them roughly every second.
I could submit a PR reverting the conflicting change, but I'd appreciate having other opinions on how to better fix this.

nilsbillo · 2022-02-11T07:54:44Z

Running argocd 2.1.3 in EKS and have problem with high cpu usage and throttling of application controller aswell. So do not think 2.2 is the only issue though.

albgus · 2022-03-09T16:06:58Z

For what it's worth I tried the solution suggested by @patrickjahns above and our ArgoCD went from consuming ~1000-1500m to ~ 20m CPU.

i.e. setting this in argocd-cm and restarting the argocd-application-controller deployment:

data:
  resource.compareoptions: |
    ignoreResourceStatusField: all

Running ArgoCD 2.2.5 in EKS 1.21.

pyromaniac3010 · 2022-03-23T07:52:23Z

I'm also hit by the high cpu caused by reconciliation loop. Thanks to @Funk66 I verified that it is caused by the leader election configmaps.
Is there any workaround available or a fix in progress?
The problem exists for me in Argo CD 2.3.1 and 2.3.2 with the following configmaps:

aws-load-balancer-controller-leader
karpenter-leader-election
ingress-controller-leader (ingress-nginx)
cert-manager-controller
cert-manager-cainjector-leader-election
cp-vpc-resource-controller
fargate-scheduler
eks-certificates-controller

pyromaniac3010 · 2022-03-23T08:17:00Z

FYI: If you remove spec.orphanedResources completely from your "kind: AppProject" the reconciliation loop and high cpu stops.
I had it set to warn: false to be able to see orphaned resources in the web ui:

spec:
  description: Argocd Project
  orphanedResources:
    warn: false

Removing it lead to a complete stop of the reconciliation loop and a significant drop in cpu:

ybialik · 2022-03-30T13:55:01Z

using the command suggested by @Funk66 I was also able to see that I have several cm that keep popping in the list, but one of them is in a namespace we see many reconciliations for.

is there a workaround?

Vladyslav-Miletskyi · 2022-04-27T21:53:32Z

Delete orphanedResources (even if it is empty, but present in spec issue is still ongoing) spec. Reconciliation loop #8100 (comment)
Restart application controller(-s)
Enjoy

Tested with version V2.3.3

bakkerpeter · 2022-05-18T12:14:28Z

@Vladyslav-Miletskyi thanks! That did the trick. We were having the exact same problem and now the load is normal.

agaudreault · 2022-10-28T19:37:52Z

Is there something else than a debug log that we could use to detect this in a production deployment? Enabling debug in production is not something that is possible for us.

I am mainly looking at a way to find resources that are continuously regenerated.

prein · 2022-11-03T09:39:12Z

Disabling orphanedResources didn't do the trick for me. I am observing around 2k / min of "Refreshing app status (controller refresh requested) in logs with only 170 apps. ArgoCD v2.4.11

roeizavida · 2023-01-02T12:21:33Z

The issue is still present in v2.5.1 and the orphanedResources is not in spec.

jamesalucas · 2023-02-08T18:04:23Z

We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField to crd or all doesn't appear to make a difference.
Is there any way to ignore reconciliation on specific resources or fields?

#8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!

jamesalucas · 2023-02-08T21:51:42Z

We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField to crd or all doesn't appear to make a difference. Is there any way to ignore reconciliation on specific resources or fields?

#8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!

In case it helps anyone else, increasing the ScaledObject pollingInterval made a massive difference to the ArgoCD CPU usage.

BongoEADGC6 · 2023-02-24T16:08:21Z

I've been seeing this a lot still on v2.6.2 with two different metallb deployments. Constantly loops over them and the orphanedResources is not in the project spec for default.

roeizavida · 2023-02-27T09:18:35Z

In v2.6.1 with ignoreAggregatedRoles: true, ignoreResourceStatusField: all, timeout.reconciliation: 300s and increased polling interval for Keda, the issue is still present. The application controller (4 replicas) is using 16 CPUs with ~280 applications.

neiljain · 2023-03-02T07:03:18Z

ArgoCD version:

{
    "Version": "v2.5.7+e0ee345",
    "BuildDate": "2023-01-18T02:23:39Z",
    "GitCommit": "e0ee3458d0921ad636c5977d96873d18590ecf1a",
    "GitTreeState": "clean",
    "GoVersion": "go1.18.10",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v4.5.7 2022-08-02T16:35:54Z",
    "HelmVersion": "v3.10.3+g835b733",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.18.0"
}

we even bumped timeout.reconciliation from 30m to 2h, but that didn't help.

we ran into this issue when using custom plugins for our applications:

      plugin:
        env: []
        name: custom-plugin
      repoURL: ssh://git@<your-repo-server>/argo/deploy-sample-app.git
      targetRevision: main

and noticed the following logs in application controller:
{"application":"argocd/deploy-sample-“app,”level":"info","msg":"Refreshing app status (spec.source differs), level (3)","time":"2023-03-02T06:16:35Z"}

with multiple test environments configured to use argocd and 100s of argo apps per env, this crashed our git servers every couple of days.

so we had to add the following dummy var to fix the constant refresh of the app:

      plugin:
        env:
        - name: DUMMY_VAR_TO_STOP_ARGO_REFRESH
          value: "true"

nferro · 2023-03-10T01:21:50Z

I'm also seeing this issue with AzureKeyVaultSecret

argocd-application-controller-8] time="2023-03-10T00:39:15Z" level=debug msg="Refreshing app argocd/application for change in cluster of object namespace/avk of type spv.no/v1/AzureKeyVaultSecret"

this then triggers a level (1) refresh that takes a long time:

[argocd-application-controller-8] time="2023-03-10T00:39:14Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application= argocd/application

agaudreault · 2023-08-07T18:48:41Z

The behavior can be configured in ignoreResourceUpdates to resolve this issue.

tooptoop4 · 2024-05-11T22:52:34Z

@Funk66 did u submit a PR for #8100 (comment) ?

Funk66 · 2024-05-13T05:47:50Z

I tried implementing a fix but couldn't make it work fully. I may try again in the coming weeks, if nobody else does.

Funk66 added the bug Something isn't working label Jan 5, 2022

alexmt added this to the v2.3 milestone Jan 11, 2022

jannfis mentioned this issue Jan 16, 2022

chore: Log out the resource triggering reconciliation #8192

Merged

10 tasks

rbreeze removed this from the v2.3 milestone Jan 20, 2022

able8 mentioned this issue Jun 30, 2022

application-controller constant high CPU use with little activity #6108

Closed

3 tasks

alexmt mentioned this issue Nov 9, 2022

fix: application stuck in infinite reconciliation loop if using wrong project #11246

Merged

This was referenced Jan 13, 2023

[Feature request] Ignore resources #9819

Closed

Large Dynamic Applications resulting in stale resource state #8175

Closed

noroutine mentioned this issue Mar 15, 2023

Argo Events deployment causes frequent App reconciliations argoproj/argo-events#2519

Closed

d-wierdsma mentioned this issue May 1, 2023

Repo Server high number of git requests when no changes are requested #12878

Open

agaudreault mentioned this issue Jun 7, 2023

feat: add ignoreResourceUpdates to reduce controller CPU usage (#13534) #13912

Merged

13 tasks

agaudreault closed this as completed Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciliation loop #8100

Reconciliation loop #8100

Funk66 commented Jan 5, 2022

patrickjahns commented Jan 10, 2022

Funk66 commented Jan 10, 2022

cnfatal commented Jan 11, 2022

patrickjahns commented Jan 11, 2022

Funk66 commented Jan 11, 2022

patrickjahns commented Jan 11, 2022

MrSaints commented Jan 11, 2022

patrickjahns commented Jan 12, 2022 •

edited

Loading

jannfis commented Jan 16, 2022

Funk66 commented Jan 17, 2022

Funk66 commented Feb 9, 2022

nilsbillo commented Feb 11, 2022

albgus commented Mar 9, 2022

pyromaniac3010 commented Mar 23, 2022

pyromaniac3010 commented Mar 23, 2022

ybialik commented Mar 30, 2022

Vladyslav-Miletskyi commented Apr 27, 2022 •

edited

Loading

bakkerpeter commented May 18, 2022 •

edited

Loading

agaudreault commented Oct 28, 2022

prein commented Nov 3, 2022

roeizavida commented Jan 2, 2023

jamesalucas commented Feb 8, 2023 •

edited

Loading

jamesalucas commented Feb 8, 2023

BongoEADGC6 commented Feb 24, 2023 •

edited

Loading

roeizavida commented Feb 27, 2023

neiljain commented Mar 2, 2023 •

edited

Loading

nferro commented Mar 10, 2023

agaudreault commented Aug 7, 2023

tooptoop4 commented May 11, 2024

Funk66 commented May 13, 2024

Reconciliation loop #8100

Reconciliation loop #8100

Comments

Funk66 commented Jan 5, 2022

patrickjahns commented Jan 10, 2022

Funk66 commented Jan 10, 2022

cnfatal commented Jan 11, 2022

patrickjahns commented Jan 11, 2022

Funk66 commented Jan 11, 2022

patrickjahns commented Jan 11, 2022

MrSaints commented Jan 11, 2022

patrickjahns commented Jan 12, 2022 • edited Loading

jannfis commented Jan 16, 2022

Funk66 commented Jan 17, 2022

Funk66 commented Feb 9, 2022

nilsbillo commented Feb 11, 2022

albgus commented Mar 9, 2022

pyromaniac3010 commented Mar 23, 2022

pyromaniac3010 commented Mar 23, 2022

ybialik commented Mar 30, 2022

Vladyslav-Miletskyi commented Apr 27, 2022 • edited Loading

bakkerpeter commented May 18, 2022 • edited Loading

agaudreault commented Oct 28, 2022

prein commented Nov 3, 2022

roeizavida commented Jan 2, 2023

jamesalucas commented Feb 8, 2023 • edited Loading

jamesalucas commented Feb 8, 2023

BongoEADGC6 commented Feb 24, 2023 • edited Loading

roeizavida commented Feb 27, 2023

neiljain commented Mar 2, 2023 • edited Loading

nferro commented Mar 10, 2023

agaudreault commented Aug 7, 2023

tooptoop4 commented May 11, 2024

Funk66 commented May 13, 2024

patrickjahns commented Jan 12, 2022 •

edited

Loading

Vladyslav-Miletskyi commented Apr 27, 2022 •

edited

Loading

bakkerpeter commented May 18, 2022 •

edited

Loading

jamesalucas commented Feb 8, 2023 •

edited

Loading

BongoEADGC6 commented Feb 24, 2023 •

edited

Loading

neiljain commented Mar 2, 2023 •

edited

Loading