-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize ApplicationSet Reconciliation Loop #10952
Comments
Hi, we are also experiencing slowness in the ApplicationSet Controller side when the number of ApplicationSets increases. We opened #9002 and tried to bring a solution with #9568 for the subject. I am not fully familiar with the code base but like mentioned in the #9002 it seems that Application Controller has some parallelism and sharding support, meanwhile, the ApplicationSet controller does not have any parallelism and sharding support. We can bring the same options to ApplicationSet controller. By the way, I tried to perform a small experiment about sharding. It may not be an ideal experiment but it can provide some insights. In an ArgoCD instance in my company. I created ~807 ApplicationSets and ~1614 Applications (2 apps generated by each Application). When there is one pod running for ApplicationSet controller. I added the 808th ApplicationSet. It approximately took 30 minutes to create its apps. Hence, the 808th deployment took 30 minutes. In the meantime, ApplicationSet controller's "workqueue_depth" metric was always high. Then tried to apply the sharding solution with a custom build. Deployed it with 9 replicas. I added the 809th ApplicationSet. It approximately took 1 minute. For each ApplicationSet controller instance, the "workqueue_depth" metric was zero for each replica. |
Thanks for sharing @hcelaloner! Did you determine why your applicationset controller was taking so long to process the application sets? Ie, network IO? |
@crenshaw-dev, should we look to make the git requests concurrent first? I am happy to submit a PR/work with @hcelaloner if we have an agreement on the next steps. |
@rumstead I guess it depends on how the git requests are made concurrent. Making them concurrent per-ApplicationSet would probably be relatively easy, but it would only be helpful for ApplicationSets which make multiple slow git requests. Sharding is kind of on the other end of the spectrum, parallelizing at the highest level possible. It solves the problem, but at the cost of spinning up a lot of pods. I feel like the best solution it going to be to figure out why the work queue depth is always so big. It looks to me as if the work queue is meant to be processable in parallel, and we simply aren't doing that now. @jgwest since you know the history of this controller, do you know if parallelizing reconciliation was a TBD that just hasn't been done yet? |
MaxConcurrentReconciles looks like it can be leveraged here. https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/controller#Options
|
No. I did not. Is there a way to investigate that one? I guess the applicationset controller has some metrics exposed by the controller runtime but it does not expose any metrics for git operations as far as I know. As discussed in the old issue, I guess an OpenTelemetry integration could be useful for observability.
I am not too familiar with the internals of controller runtime. Is this option causes any concurrency problems? Like if we set it to 2, will two goroutines try to reconcile the same ApplicationSets, and will it create some sort of race conditions? Is there a code change needed to handle such a possibility? If there is not such a race condition possibility, I guess we can just make it configurable via a flag. Somehow similar to "--operation-processors" or "--status-processors" options in the ArgoCD application controller. Basically, the users who need to increase the number of concurrent reconciles can tune it using the flag (or modify a configmap that holds the configurations for applicationset controller) |
As I said previously, I do not know the internals of controller runtime. Sorry, if I have the wrong guess. However, could the following owns statement cause unnecessary reconciles?
Let me try to explain it with an example. Let's say that we have an ApplicationSet that generates an application. Deploying a new image to the application probably causes a change in its status since it stores its health status and operation history etc. Would that change in the application |
Controllers should be idempotent. The ApplicationSet would drive any changes to the Applications and the From my understanding, the queue that backs the |
I wasn't around for the early early days of the ApplicationSet controller, but I don't recall much discussion of parallelizing it (or otherwise sharding multiple instances of ApplicationSet), and when PR were merged I don't recall much discussion around ensuring that contributed generators were thread safe. When enabling multithread reconciliation, one needs to be careful in locking shared resources, and preventing race conditions, deadlocks, etc. For example, ensuring that different goroutines each running Reconcile() (and calling shared generator objects) do not step on each others toes. |
The |
@hcelaloner did you have to do anything special to see the
|
No, I did not do anything special. Just applied the following servicemonitor in our deployments and the metric was available in our Prometheus. apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: prometheus-operator
name: argocd-applicationset-controller-metrics
spec:
endpoints:
- port: metrics
selector:
matchLabels:
app.kubernetes.io/name: argocd-applicationset-controller |
@crenshaw-dev / @jgwest what are your alls thoughts on someone taking a stab at increasing MaxConcurrentReconciles and making sure the current generators are thread safe? I still think the repo server should probably be used for git operations in the future but in the meantime |
@rumstead I'd be interested to know the results... the ApplicationSet controller is using a copy/pasta of old repo-server code, so it might have locking logic in place already. |
Hi everyone, I wanted to share something with you since it may be something to consider about this subject. In our company, we are testing the sharding approach. We deployed 10 pods of app set controller in one ArgoCD instance to speed up the appset processing process. Yesterday, it resulted in an increase in requests sent to kube API server. Some master nodes become unavailable momentarily because of the load. Logs show that the requests targets secrets ("argocd.argoproj.io/secret-type=cluster") and source is the applicationset controller. I did not check the source code in details but I doubt that
This may be something to consider when increasing parallelism. I do not know it is something worry about or not. If it is, maybe applicationset controller can manage some sort of cache of clusters to reduce the number of requests made for clusters? |
Git GeneratorThe ApplicationSet controller creates a git client here. The git client creates a temporary folder with directory under it with the name of the git repo URL as the name. It then does all git operations under that directory. Multiple ReconcileWouldn't be safe to add multiple Reconciles if multiple ApplicationSets were targeting the same repo with different revisions. The remaining generators use APIs and seem to be stateless. Repo Server observationsWhen the repo server creates a git client, it creates a directory using a "hashing strategy" but the hash is tied to the repo URL. However, the repo server uses the Lock struct to handle concurrent actions.
vs the ApplicationSet controller
The repo server also obviously uses a Redis cache. That, more or less, shelters against intermittent, slow IO requests once it's cached. Existing, semi-beneficial locking strategyWe could wrap the git interactions with the below. Though, idk how it would work for git interactions that return data, like
New, semi-beneficial locking strategyI feel like there are two high-level bodies of work.
Would love your thoughts @crenshaw-dev |
Continuing the investigation here, I found that we are being throttled by our Git provider which adds to the Git requests taking even more time. I think this points to a need in having Git-related generators cache. |
The proper solution is to move to use Repo Server instead of having another cache. |
Hard agree. |
Integrating the ApplicationSet Controller with the repo server would also provide better observability. The metric |
I threw together a POC of having the applicationset controller use the repo server. I would want to add more caching and obviously need tests but wanted to see what folks thought. EDIT: added some caching as well |
The PR is ready for review. E2e tests exercise the applicationset controller using the repo server as a cache. Unit tests have been updated. |
@ishitasequeira just wanted to see if you had any review capacity in the foreseeable future? |
@rumstead Apologies for the delay. I will give it a review today. |
I have a use case where I need to deploy over 10,000 applicationsets all of them using helm as a source. I faced some of the issues which was dicussed and is being fixed here in #12480. The reduntant reconciliation was causing great delays sometimes over 20 mins to deploy new applicationSet, sometimes even more depending on the type of generators being used. But with the fix, it does come down considerably. Thank you @rumstead But still see that adding MaxConcurrentReconciles as option does help in improving the performance. I do understand that it causing issues currently to git generators with the locking of shared resources. But what about the case of other generators, does it still cause any locking issue? In my case adding the MaxConcurrentReconciles made it super quick. |
I have been laser-focused on the git generator. However, I haven't seen anything in the other generators that would lead me to believe that concurrency on applicationset processing would cause any race conditions. I plan to increase the worker threads once this PR is merged as well. IMO, my other PR which reduces the reconciles is going to have the largest impact on performance. Out of curiosity, what value did you set |
@rumstead Absolutely agree that your PR for reducing the redundant reconciliation has improved its performance greately. I deployed a version argocd built out of your PR branch, and could see the deployment time went down from 20mins to like 2 mins. My setup had about 200 clusters with 10000 appsets (targetting one cluster per appset) and 20 appsets deploying to all clusters and with around 10 deployments happening every few secs, also causing some forced failures on some apps causing it to go outofsync - triggering reconciles to appset controller. So the idea was not keep it calm for appsets controller. Cause if its very calm, with your PR, appset was able to generate the apps in 1 sec, pretty good! But when its not calmer, there some reconciliations happening, for it to process everything sequentially meant I had to wait for my new deployment of appset to be picked up in the queue. Which is where But when I added/deleted a cluster, with all apps using cluster generator it was mayhem! As all the 10020 appsets got triggered! Although increasing the |
@naikn excuse the slightly off-topic question, but I'm really curious: what's the use case requiring 10k appsets? 😄 |
Hi @crenshaw-dev we have lots and lots of applications and lots of clusters 😄. And the usecase - well, its mainly to use the generator to identify the clusters rather than setting server url in the argocd application. Most appset will generate one application and will be placed on a single cluster, but the app teams need not know where the cluster is hosted or what its URL is. AppSet's generator is a very good tool for this. Its easy to abstract the cluster placement from the app development teams. |
@naikn yea I believe the cluster event handler requeues all appsets with a cluster generator. It sounds like in your case, is all of them. |
…) (argoproj#12714) feat(appset): applicationset controller use repo server (argoproj#10952) (argoproj#12714) Signed-off-by: rumstead <37445536+rumstead@users.noreply.github.com>
Summary
The ApplicationSet controller looks to do a lot, if not all, tasks sequentially. This can cause slow ApplicationSet reconciliation when the controller is stuck waiting on slow tasks, like network IO from a GIT generator.
For instance,
Motivation
Speeding up the reconciliation loop will allow for ApplicationSet changes to be applied faster providing a quicker feedback loop.
I have an app set controller managing about 100 apppsets, generating about 1800 applications. I am seeing large latency between a new application set being deployed and the controller picking them up. I do use the app of apps pattern for deploying the application sets. An example time line:
I am on version 2.4.14
Proposal
I haven't gotten this far yet but some sort of worker, consumer pattern :).
Another idea would be to integrate GIT interactions with the
repo-server
.The text was updated successfully, but these errors were encountered: