Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle k8s resource size limit for large Application CRs #14486

Open
crenshaw-dev opened this issue Jul 12, 2023 · 9 comments
Open
Labels
enhancement New feature or request

Comments

@crenshaw-dev
Copy link
Member

crenshaw-dev commented Jul 12, 2023

Summary

Intuit had an app fail to sync when it hit ~3k resources managed in a single App. I believe the problem was that it attempted to update the sync status, which contained the status of all 3k resources, and we hit the k8s resource size limit.

We should provide more ways for the user to sacrifice certain features/conveniences to allow the resource to fit within the size limit. Ideas below.

Motivation

3000 really isn't that big a number.

Proposal

  1. store the status.resources and status.operationState.operation.sync.resources fields (or maybe just the whole status field) as gzip/base64 when the app hits a configurable number of managed resources
  2. Offload more resource info to Redis, like we did with health data
@leoluz
Copy link
Collaborator

leoluz commented Jul 17, 2023

To avoid API breaking changes another suggestion could be:

  1. Add a new fields in the status section dedicated for compressed data: status.resourcesGzip and status.operationState.operation.sync.resourcesGzip
  2. Check if the new CRD state will exceed the 1.5mb etcd limit and use the compressed fields instead of the existing ones.
  3. Change the logic in functions that read the status to verify if the data was persisted in the gzip field to decompress and add back in the original fields.

With this approach the great majority of the users wouldn't be impacted as the new fields would just be used when the CRD limit is exceeded.

@crenshaw-dev
Copy link
Member Author

Yep, I like this. One question:

Check if the new CRD state will exceed the 1.5mb etcd limit

How would you propose to perform that check?

@leoluz
Copy link
Collaborator

leoluz commented Jul 17, 2023

How would you propose to perform that check?

You have the computed status field state that is about to be persisted isn't it? I was thinking about just check its size to drive the persisting logic.

@crenshaw-dev
Copy link
Member Author

crenshaw-dev commented Jul 17, 2023

You have the computed status field state that is about to be persisted isn't it?

Not necessarily. Some places, e.g. persisting operation state, calculate only a patch:

func (ctrl *ApplicationController) setOperationState(app *appv1.Application, state *appv1.OperationState) {

Even if we know the full status field contents, I see a few potential problems:

  1. you're missing the sizes of top-level keys metadata, spec, and operation
  2. marshaling the status before every write operation could be a performance drag

I'd suggest a lightweight, configurable heuristic like "if it manages > N resources, compress."

@leoluz
Copy link
Collaborator

leoluz commented Jul 17, 2023

I'd suggest a lightweight, configurable heuristic like "if it manages > N resources, compress."

Yes.. I like that too.

@zswanson
Copy link

zswanson commented Aug 28, 2023

Related, we are looking to enable argo with a large scale of applications soon (5k+) and we're concerned about hitting GKE limits where any single resource type in etcd must be < 800MB. An option to always compress statuses, regardless of number of resources, would be nice.

Google documentation for reference, I assume other cloud vendors would have similar limits. https://cloud.google.com/kubernetes-engine/docs/concepts/planning-large-clusters

@cjin62
Copy link

cjin62 commented May 18, 2024

Any further updates on when ArgoCD will be able to implement the improvements?

@thesuperzapper
Copy link

I just want to highlight how important it is for ArgoCD to handle Apps with large numbers of resources.

I like the Idea of compressing status.resources as this tends to be the largest part of the Application spec and there is a hard ~1.5Mb limit on any resource.

@Reversaidx
Copy link

We faced the same issue, do we have any idea what to do?(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants