Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow performance of newest argocd versions - plugin + monorepo #17775

Open
3 tasks done
momilo opened this issue Apr 8, 2024 · 4 comments
Open
3 tasks done

Very slow performance of newest argocd versions - plugin + monorepo #17775

momilo opened this issue Apr 8, 2024 · 4 comments
Labels
bug Something isn't working component:cmp Config Management Plugin related issues more-information-needed Further information is requested type:scalability Issues related to scalability and performance related issues

Comments

@momilo
Copy link

momilo commented Apr 8, 2024

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Unfortunately, upgrading to the newest versions of ArgoCD (with the cmp migrated to a sidecar) resulted in an order of magnitude degradation of refresh/sync/deployment speeds.

Setup

Monorepo with c. 50 Applications, all defined under a single sub-dir, written in jsonnet + tanka. An Application in our realm is a microservice (deployment + a few helper manifests), nothing extraordinary. We run repo-server with --plugin-tar-exclude set to .git/*.

Bug / Observations

  1. It takes approximately 40 minutes in total (with some parallelism) to invoke app set (to specify image_tag, a plugin variable) and then app sync for all Applications. It takes c. 3-5 minutes per a single Application. In contrast, running a diff and applying manually (tk apply . --ext-str=image_tag=MY_TAG) takes a fraction of this time (c. 5 seconds per Application)
  2. It takes approximately 30-60 seconds to invoke a refresh operation on a single Application (via UI).
  3. It takes approximately 20-50 seconds to open Parameters page on a single Application Details view (UI). This is quite interesting, as showing the Diff tab is (usually) almost instantaneous.
  4. The repo server's CPU usage is spiking significantly during the deployment

Normal operation:

NAME                                                        CPU(cores)   MEMORY(bytes)
argo-cd-argocd-application-controller-0                     26m          868Mi
argo-cd-argocd-repo-server-77f98c748c-8w7z9                 2m           705Mi

During a manually-triggered update (app set) + sync:

NAME                                                        CPU(cores)   MEMORY(bytes)
argo-cd-argocd-application-controller-0                     456m         930Mi
argo-cd-argocd-repo-server-77f98c748c-8w7z9                 5220m        769Mi

Potential Solutions

We see some hope (and look forward to) to the following potential solutions (naturally, it is difficult to gauge a priori to what extent any of them would resolve the issues observed):

  1. Reintroducing the previous way of configuring CMPs (i.e. not via sidecars), perhaps as an alternative, leaving the choice (including any potential security implications) up to the users.
  2. Introducing --plugin-tar-include (i.e. include-only manifests dir) - probably of limited benefit, since we are already passing --plugin-tar-exclude, which excludes the bulk of the repo.
  3. Supporting sparse and/or limited depth checkout (I note the already started, and much appreciated, albeit potentially put on hold, work in feat: add depth flag #16064 and feat: add git sparse checkout #14272)
  4. Independent optimisation / debugging of the Parameters tab (surely, merely viewing the existing inputs should not be very time/resource consuming, perhaps indicating some unnecessary steps in the current implementation?)
  5. ... probably many more, which I cannot think of immediately.

To Reproduce

Store manifests of 50-100 Applications in a mono-repo, together with other code (Go, TS, etc.). Use tanka to apply them, configured via an argocd sidecar plugin (example plugin configuration here).

Expected behavior

A simple update of minimal changes should be relatively fast (taking a bit longer than a manual application, but not 10-50x longer).

Version

argocd-server: v2.10.6+d504d2b

@momilo momilo added the bug Something isn't working label Apr 8, 2024
@jannfis jannfis added type:scalability Issues related to scalability and performance related issues component:cmp Config Management Plugin related issues labels Apr 8, 2024
@crenshaw-dev
Copy link
Member

I don't think we have enough information to really brainstorm solutions yet.

The spiked repo-server CPU usage is a good hint. Do you know if any Argo CD component is hitting its CPU limits at any time, i.e. being throttled?

@rbreeze rbreeze added the more-information-needed Further information is requested label Apr 11, 2024
@momilo
Copy link
Author

momilo commented May 16, 2024

I appreciate I owe you a proper analysis. My apologies - I still haven't had a chance to set up prometheus etc. to scrape the metrics / traces / set up pprofiler etc.

To just answer you previous question - there i no throttling being observed, the nodes are fairly big with plenty of RAM and CPU headroom (under normal circumstances).

In the meantime, I wanted to share the following, in case useful and perhaps symptomatic of other issues listed above.

When navigating to Application -> Details -> Params tab (which, in theory, should only show two variables as inputs), argoCD is timing out (and not loading the tab).

The following three log lines can be observed (note the 35+ seconds duration before the kill):

repo-server (container: repo-server)

"jsonPayload": {
    "error": "failed to populate plugin app details: error sending file to cmp-server: error sending generate manifest metadata to cmp-server: EOF",
    "grpc.method": "GetAppDetails",
    "system": "grpc",
    "level": "error",
    "msg": "finished unary call with code Unknown",
    "span.kind": "server",
    "grpc.code": "Unknown",
    "grpc.start_time": "2024-05-16T12:06:25Z",
    "grpc.time_ms": 35504.46,
    "grpc.service": "repository.RepoServerService"
  },

repo-server (container: tanka)

"jsonPayload": {
    "grpc.code": "Canceled",
    "level": "info",
    "system": "grpc",
    "msg": "finished streaming call with code Canceled",
    "span.kind": "server",
    "grpc.method": "GetParametersAnnouncement",
    "error": "parameters announcement error receiving stream: error receiving stream header: rpc error: code = Canceled desc = context canceled",
    "grpc.time_ms": 8257.146,
    "grpc.service": "plugin.ConfigManagementPluginService",
    "grpc.start_time": "2024-05-16T12:06:47Z"
},

repo-server (container: tanka)

"jsonPayload": {
    "span.kind": "server",
    "level": "info",
    "grpc.service": "plugin.ConfigManagementPluginService",
    "grpc.method": "MatchRepository",
    "system": "grpc",
    "grpc.time_ms": 20714.465,
    "grpc.code": "OK",
    "msg": "finished streaming call with code OK",
    "grpc.start_time": "2024-05-16T12:06:26Z"
  },

In case useful:

  • total repo size: 320MB (inc. 216M for .git)
  • kubernetes folder size (all manifests + libraries): 27M
  • single application folder (located in kubernetes/.../someapp) size: 20K

repo-server is run with the following args (the exclusions list all folders other than our kubernetes folder):

containers:
  - args:
    - /usr/local/bin/argocd-repo-server
    - --port=8081
    - --metrics-port=8084
    - --plugin-tar-exclude=".git/*"
    - --plugin-tar-exclude="assets/*"
    - --plugin-tar-exclude="bin/*"
    - --plugin-tar-exclude="build/*"
    - --plugin-tar-exclude="docs/*"
    - --plugin-tar-exclude="go/*"
    - --plugin-tar-exclude="js/*"
    - --plugin-tar-exclude="proto/*"
    - --parallelismlimit=10

All applications are generated from AppSets looking at kubernetes/.../mycluster etc.

For the record - running tk show . --ext-str=image_tag=sometag for a given Application (which is what the CMP does for generate; init is just echo "tanka plugin init") takes c. 153ms (I've just checked). All jsonnet libraries are vendored (hence the 27MB size), so nothing is being downloaded on the fly.

It feels suspicious that: (a) sending across 27MB would take over 30 seconds and time out; (b) it is necessary to resolve the manifests to just show the Params (often used e.g. to quickly check the image_tag provided to argo-cd etc.).

(Our main problem is the overall sync-up of 30-50 Applications on deployment, which can take 30min+ just to update image tags, but... perhaps the above issue, which is also quite troublesome, is related).

@crenshaw-dev
Copy link
Member

Regarding sparse/shallow checkout, let's consolidate that part of the conversation here: #11198

@toyamagu-2021
Copy link
Member

I think performance will be improved by #18053 (to be released in v2.13), where we can skip MatchRepo if we set pluginName explicitly.
I noticed that MatchRepo takes a very long time for large monorepos.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:cmp Config Management Plugin related issues more-information-needed Further information is requested type:scalability Issues related to scalability and performance related issues
Projects
None yet
Development

No branches or pull requests

5 participants