-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading to Argo CD 2.10.1 drop reconciliation/sync to nearly a zero. #17257
Comments
Yeah, the same thing |
This was meant to fix it: #17167 Can you try 2.10.0 and see if you hit the same issue? |
I'm on 2.10.1 and everything is fine for me... but I only have few clusters and few Apps |
We are facing a similar issue but when we enable dynamic cluster sharding and run application-controller as a deployment with 2 replicas(We need more than 1). I made application-controller 'statefulset' replica 0 and tried deleting the statefulset itself to ensure the deployment runs properly. The env var 'ARGOCD_ENABLE_DYNAMIC_CLUSTER_DISTRIBUTION' is set to true on the deployment. But syncs stopped working with below error Attaching application-controller deployment pod logs However when we run application-controller as statefulest everything works as expected. But we are interested in the dynamic cluster sharding feature hence tried running it as deployment. |
Same behaviour on 2.10.0 When I create 100 Applications reconciliation drops to zero and recovers after a while Any mass operations cause this, in bigger environments it is a permanent state. I am trying to reproduce it on a vanilla Argo CD deployed on Kind so we can have steps to reproduce. |
I forgot to mention restarting application-controller helps for a short period, it processes a bunch of operations and stuck again. |
I have repro steps: Create cluster and install Argo CD kind create cluster -n argocd
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml Create AppProject with apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: default-orphaned-resources
namespace: argocd
spec:
clusterResourceWhitelist:
- group: '*'
kind: '*'
destinations:
- namespace: '*'
server: '*'
sourceRepos:
- '*'
orphanedResources:
warn: false Create 225 Apps apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: argocd-loadtest
spec:
goTemplate: true
goTemplateOptions: ["missingkey=error"]
generators:
- matrix:
generators:
- list:
elements:
- prefix: a
- prefix: b
- prefix: c
- prefix: d
- prefix: e
- prefix: f
- prefix: g
- prefix: h
- prefix: i
- prefix: j
- prefix: k
- prefix: l
- prefix: m
- prefix: n
- prefix: o
- list:
elements:
- number: 10
- number: 11
- number: 12
- number: 13
- number: 14
- number: 15
- number: 16
- number: 17
- number: 18
- number: 19
- number: 20
- number: 21
- number: 22
- number: 23
- number: 24
template:
metadata:
name: "argocd-loadtest-{{.prefix}}-{{.number}}"
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default-orphaned-resources
sources:
- repoURL: "https://github.com/argoproj/argocd-example-apps.git"
path: helm-guestbook
helm:
valuesObject:
replicaCount: 0
destination:
name: in-cluster
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true
Go to UI Argo CD is stuck at around 80 apps and really slowly processing them afterwards reconcile_count barely changing
@crenshaw-dev could you please advise what else can I check to help find the root cause? |
Just reading the issue, I am not sure if the behavior is expected due to the rate limiter. We added 2 rate limiters 1 - global Bucket limiter and 2 - per item ExponentialBackoff limiter. By default the exp. limiter is disabled and I don't see the it being turned on so the only possible limiter being on is the Bucket limiter and the defaults for that are as follows :
So it should at least let 500 items in the queue in the burst and then 50 items per second if queue get's full. In that case 225 apps should easily be handled in the burst as it has enough capacity. @crenshaw-dev correct me if I am wrong but I believe that the workqueue dedups the items so it shouldn't be putting in the same item multiple times until it's processed so I am not sure if the issue is due to the rate limiter. But I will dig more and get back in case I find anything suspicious with the limiter. |
Whenever reconcilliation drops to zero, controller deadlocks may be culpable. @daftping - when this happens, can we get a dump of stacktrace of the controller to see if this might be the case?
This is correct. Workqueues will dedup |
That is possible we had some deadlock issues before causing the controller to freeze |
@gdsoumya setting bigger values for Global limiter improves reconciliation performance for the report above. It is somehow related. - name: WORKQUEUE_BUCKET_SIZE
value: 5000
- name: WORKQUEUE_BUCKET_QPS
value: 500 Is it possible to disable all rate limiters altogether to validate this assumption? |
@jessesuen |
@daftping Currently it's not possible to completely disable the limiter, the simplest way to simulate a disabled bucket limiter would be to set a very large value for both QPS and bucket size like you have done. I would set a even higher limit something like 10000000000 for both |
I've set them both to 10000000000 as you suggested and all 225 apps were created and fully in sync in about 50 seconds. No issues or halts whatsoever. - name: WORKQUEUE_BUCKET_SIZE
value: "10000000000"
- name: WORKQUEUE_BUCKET_QPS
value: "10000000000" value of
Thank you for suggestion @gdsoumya, at least we have a workaround now. |
I did a few more tests with a higher load (520 Apps) On v2.9.6 On v2.10.1 even with "10000000000" for WORKQUEUE_BUCKET_SIZE and WORKQUEUE_BUCKET_QPS I see multiple stalls in the process where reconciliation almost stopped and CPU load drops to its average minimum. |
Thanks @daftping for the tests, will check on why this is happening. I didn't expect the bucket limiter to behave like this with such a small no. of apps. |
@daftping I set them both to |
@daftping here are the tests I ran :
From this I can conclude that the rate limiter is working as expected. In your case you ran with a heavier load by creating an appset where each app generated overwrites the same resources in the same ns which possibly was creating an avalanche of requeues larger than the default bucket limiter causing the behaviour we saw. The example appset used is probably not a valid scenario though I can see that there might be valid apps with such large requeues too in which case we have only 2 options :
|
Thanks for the investigation, @gdsoumya! The number of thumbs up on the original issue indicates to me that either this far edge case impacts a surprisingly high number of people or that others are experiencing a completely different issue with similar symptoms. Do we need an option number 3, increase our default workqueue limits? Or will that cause disproportionate performance degradation for the large majority of users? |
To add a little bit more context to it. Steps to reproduce just random weird setup I came up with. In real production, we have ~500 Applications in 150 namespaces on 16 clusters. Most of the Project has orphaned resources monitoring enabled. Upgrading to 2.10.1 causes a complete halt in this environment and never recovers. |
@crenshaw-dev according to @daftping 's point maybe it's specific to orphaned resources somehow then, because in my test with 750 apps much larger than 500 I did not see any halting. I don't have a in depth understanding of what orphaned resource monitoring does in context to requeues we might want to investigate that a bit.
We can surely increase the no.s, as @alex-souslik-hs pointed out setting the max value would almost be equivalent to disabling the limiter, we can either do this by modifying the code or just setting the values in manifest whichever is a better option. I don't think it would affect the other users as for them it should behave the same. |
Scalability sig here, if required we have a testing environment that can be used to determine the best default settings or test any code changes that might be required. I was running some scalability testing with 2.10.1 and saw performance degradation from 2.8.x. Running a sync test with 4k apps (2kbconfigmaps), the first sync tests run as expected but any subsequent sync test takes significantly longer. I assume this is normal for the rate limiting as 4k apps is much greater than the 500 bucket limit. You can see that the ops queue is able to load up for the first test and every subsequent test I am unable to load up to 500 items because of rate limiting. Of course my setup is not normal as I'm syncing all 4k apps in one shot. |
What's the definition of an "item" in terms of rate limiting? Is it whole I have one environment with ~100 apps and no issues. However, after seeing previous posts where you all are spinning up apps I decided to do an In another environment the apps are closer to 500 and syncs take hours before working. I'm guessing this is a slow build up filling up presumably rate limiting queues and eventually it just comes to a near halt. Similar to @daftping I can restart the single application controller I have and work gets done for a while. I don't think I have many orphaned resources, but I do have a ton of dependent resources from kyverno (sometimes a 1:1 ratio.) I've yet to figure out a way to not get them to show up in the UI so I presume they aren't ignored. |
@snuggie12 by item I meant apps, we don't queue resources but we do requeue the parent apps if the dependant resources modify or change state so it might happen that given a small no. of apps with a large no. of managed child resources that can frequently change state like a deployment etc. the no. of times the app gets queued could be high.
@Enclavet The 500 burst limit is the max size of the queue, so it can handle at the most 500 items in the queue but when it's filled any new add() calls to the queue will be delayed and the delay is calculated on the basis of the qps which is 50 by default. So it might happen that the new items are being requested to be added to the queue but because the queue was already full the items got delayed for To add more context to the approach taken for the rate limiter implementation, initially the plan was to just use the default rate limiter provided by k8s client-go which can be seen here but we later decided to implement a custom one as the exponential limiter didn't work for us but we kept the bucket limiter as is from the default limiter. In the default limiter it had a similar setup to what we use, 1 item based exponential limiter and a bucket limiter with a even smaller bucket size and qps (100 and 10 respectively). |
@gdsoumya Thanks for the quick reply. Does that mean unmanaged dependent resources of managed resources (e.g. a Additionally, our setup all cascades from a root app called For any of these new rate limiting features are there metrics? I can't seem to find any containing words like "rate", "limit", or "shard" though only run 1:1 cluster to controller so maybe I won't see shard-based metrics. Seems like it would be helpful to know rate limiting is occurring. |
@snuggie12 we do not have any metrics for rate limiting yet, that's probably a good point we should see if we can add it to make it more visible to the users. As far as I understand the deployment itself is the only child resource to the app but because any changes to the child resources of the deployment would eventually lead to a change in state for the deployment itself the app would refresh in those cases too. I am not sure if we can call it an issue, but as you would expect any changes to dependent resources would eventually move up the tree to the root causing a refresh (if not a sync) for the app. The problem here might be the depth of the tree, if there are a lot of apps that need a refresh due to a change in a leaf resource then that could cause a significant no. of items being queued which would be rate limited according to the limits set. |
Just an addition from my side as I also experienced the same issue on my side with 8.000 apps and lots of clusters. We have orphaned resources disabled but make use of sync waves and also we are using keda scalers and app of apps. Bringing this together with the limiter having frequent updating children likely describes the root cause. I know that keda scalers are triggering updates of resources quite heavily and caused high cpu load in the past (was fixed by adding the ignoreupdates feature) But having sync waves or (helm) hooks somewhere in your resource tree could deadlock your dependency graph if intermediate status updates (cronjobs, deployments, keda, etc) chime in. I can give you an update as soon as I have time to try increasing the ratelimit values. |
Even with 200+ apps, I am facing the same issue, refresh and syncs both are getting stuck almost for hours. As mentioned applied the workaround on the controller and looks good now
|
@gdsoumya
Reading the PR comment, the assumption was that this was disable by default according to @jessesuen
|
@csantanapr I shall raise a PR to disable the bucket limiter too by default we can cherry pick it back into 2.10. In the original PR only the per item limiter was disabled which was expected to interfere in a normal setup but the bucket limiter wasn't expected to behave like this with the default limits and did not behave in the way in any of the tests I conducted. Though as seen specific configurations might be spiking the work queue higher than expected. |
is this fixed in 2.10.2? I've installed 2.10.2 from newest helm chart 6.6.0 and still seeing the issue that controller first logs "The cluster xyz has no assigned shard" described here and then stop processing apps. The restart of the controller fixes the issue for a few seconds and it stops again. |
https://github.com/argoproj/argo-cd/commits/v2.10.2/ Looks to need a new tag |
I don't think this has made it to a new release yet, it should be available when 2.10.3 is released. |
I've upgraded to 2.10.4 and it works fine. |
Thank you folks for quick turnaround! |
Checklist:
argocd version
.Describe the bug
After upgrading Argo CD from 2.9.2 to 2.10.1 it is unable to reconcile almost anything. Most of the Applications are stuck in the
Progressing
orOutOfSync
state, metrics drop close to zero.Sharding is not used. (1 replica in StatefulSet with default configuration)
To Reproduce
We are unable to reproduce in a similar dev environment with a handful of apps and only a few clusters connected.see below #17257 (comment)
Expected behavior
Argo CD should operate as usual after the upgrade.
Screenshots



~9:40 Argo CD was upgraded to 2.10.1. ~10:45 it was rolled back to 2.9.2
Application Controller
Repo Server

Redis

Config
Version
Logs
Lots of messages

From 20000 to 40000 messages below per cluster per 5 min
The text was updated successfully, but these errors were encountered: