Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid resources lock contention (#8172) #20329

Conversation

mpelekh
Copy link
Contributor

@mpelekh mpelekh commented Oct 10, 2024

Closes #8172
Improve reconciliation performance for large clusters by avoiding resource lock contention - argoproj/gitops-engine#629.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Copy link

bunnyshell bot commented Oct 10, 2024

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 3 times, most recently from 83c216a to bb09323 Compare October 10, 2024 14:18
@mpelekh mpelekh marked this pull request as ready for review October 10, 2024 15:59
@mpelekh mpelekh requested a review from a team as a code owner October 10, 2024 15:59
Copy link
Member

@nitishfy nitishfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Integration tests are failing. Can you fix the CI Failures?

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch from bb09323 to 07c2774 Compare October 11, 2024 15:31
@mpelekh
Copy link
Contributor Author

mpelekh commented Oct 11, 2024

The Integration tests are failing. Can you fix the CI Failures?

Sure, I’ll examine the logs to determine the root cause of the failing integration tests.

@crenshaw-dev
Copy link
Member

@mpelekh lmk if you need any help running e2e tests locally to speed up iteration.

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 2 times, most recently from 25d04ff to e9f59e9 Compare October 23, 2024 14:32
@crenshaw-dev
Copy link
Member

From discussion at SIG Scalability:

  1. Looking great
  2. Tests are probably failing because operation status and sync status updates used to be effectively atomic; now there's a 1s delay
  3. Try increasing ticker to 10s to see what fails, decrease to 0.1s to see if that resolves some flakes
  4. Assess whether sync and operation statuses are updated actually atomic today; if we break that guarantee, that's probably an Actual Problem
  5. Instead of shipping behind a flag, YOLO it and plan to revert quickly if it causes trouble

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 2 times, most recently from 584c9af to 5b6ec16 Compare October 24, 2024 09:19
@mpelekh
Copy link
Contributor Author

mpelekh commented Oct 24, 2024

  1. Try increasing ticker to 10s to see what fails, decrease to 0.1s to see if that resolves some flakes

@crenshaw-dev Here are the test results when the ticker was increased to 10s:

Test results when the ticker was decreased to 0.1s:

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch from 5b6ec16 to f0781f5 Compare October 25, 2024 12:18
@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 2 times, most recently from 9f7b680 to 3302385 Compare October 25, 2024 20:12
@mpelekh
Copy link
Contributor Author

mpelekh commented Nov 6, 2024

Assess whether sync and operation statuses are updated actually atomic today; if we break that guarantee, that's probably an Actual Problem

This is what I see in the code:

  • SyncAppState is called when the application needs to be synced (argocd sync has been called)
    https://github.com/argoproj/argo-cd/blob/master/controller/sync.go#L95
  • It applies all resources and sets the operation status to Succeed in case of success. It doesn't mean that all resources are synced already, as you can see from the output (these examples are from the built with updates from pull request, where events are processed every 1 second. Application has been created but not synced yet)
root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app get argocd-e2e-external/test-namespaced-config-map --plaintext --server localhost:8080 --auth-token TOKEN --insecure
Name:               argocd-e2e-external/test-namespaced-config-map
Project:            default
Server:             https://kubernetes.default.svc/
Namespace:          argocd-e2e--test-namespaced-config-map-dwnet
URL:                http://localhost:8080/applications/test-namespaced-config-map
Source:
- Repo:             file:///tmp/argo-e2e/testdata.git
  Target:           
  Path:             config-map
SyncWindow:         Sync Allowed
Sync Policy:        Manual
Sync Status:        OutOfSync from  (bf5ea9e)
Health Status:      Healthy

GROUP  KIND       NAMESPACE                                     NAME    STATUS     HEALTH   HOOK  MESSAGE
       ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet  my-map  OutOfSync  Missing        



root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app sync argocd-e2e-external/test-namespaced-config-map  --prune --plaintext --server localhost:8080 --auth-token TOKEN --insecure
TIMESTAMP                  GROUP        KIND   NAMESPACE                                                    NAME    STATUS    HEALTH        HOOK  MESSAGE
2024-10-30T21:55:03+00:00          ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet                my-map  OutOfSync  Missing              
2024-10-30T21:55:03+00:00          ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet                my-map  OutOfSync  Missing              configmap/my-map created

Name:               argocd-e2e-external/test-namespaced-config-map
Project:            default
Server:             https://kubernetes.default.svc/
Namespace:          argocd-e2e--test-namespaced-config-map-dwnet
URL:                http://localhost:8080/applications/argocd-e2e-external/test-namespaced-config-map
Source:
- Repo:             file:///tmp/argo-e2e/testdata.git
  Target:           
  Path:             config-map
SyncWindow:         Sync Allowed
Sync Policy:        Manual
Sync Status:        OutOfSync from  (bf5ea9e)
Health Status:      Healthy

Operation:          Sync
Sync Revision:      bf5ea9e6c4184bb3ee436b957f54c1bc8adb3ce3
Phase:              Succeeded
Start:              2024-10-30 21:55:03 +0000 UTC
Finished:           2024-10-30 21:55:03 +0000 UTC
Duration:           0s
Message:            successfully synced (all tasks run)

GROUP  KIND       NAMESPACE                                     NAME    STATUS     HEALTH   HOOK  MESSAGE
       ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet  my-map  OutOfSync  Missing        configmap/my-map created


root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app get argocd-e2e-external/test-namespaced-config-map --plaintext --server localhost:8080 --auth-token TOKEN --insecure
Name:               argocd-e2e-external/test-namespaced-config-map
Project:            default
Server:             https://kubernetes.default.svc/
Namespace:          argocd-e2e--test-namespaced-config-map-dwnet
URL:                http://localhost:8080/applications/test-namespaced-config-map
Source:
- Repo:             file:///tmp/argo-e2e/testdata.git
  Target:           
  Path:             config-map
SyncWindow:         Sync Allowed
Sync Policy:        Manual
Sync Status:        Synced to  (bf5ea9e)
Health Status:      Healthy

GROUP  KIND       NAMESPACE                                     NAME    STATUS  HEALTH  HOOK  MESSAGE
       ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet  my-map  Synced                configmap/my-map created

Resources get synced once the related events are received in gitops-engine. Then for each updated resource the handleObjectUpdated callback is called - https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L395

This callback requests the app refresh, which in turn puts item in the queue:

Then, this item is received from the queue in a loop:

And processed:

And only after the Sync status gets Synced, as we can see from the output.

So, it means that the sync and operation status are not atomically updated. They just very quickly updated.

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 3 times, most recently from d017226 to 6c52c9a Compare November 7, 2024 13:54
Copy link
Contributor

@andrii-korotkov-verkada andrii-korotkov-verkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though I'd convert this to draft until the Gitops engine reference is updated back.

)

resourceEventsNumberGauge = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "argocd_resource_events_number",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name can be a bit misleading, I'd use something like argocd_resource_events_processed_in_batch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agreed. Updated.

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 3 times, most recently from 1dbf738 to 8555d33 Compare November 14, 2024 16:57
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 76.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 55.23%. Comparing base (87c853e) to head (4bed42d).
Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
controller/cache/cache.go 40.00% 2 Missing and 1 partial ⚠️
controller/metrics/metrics.go 84.21% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #20329      +/-   ##
==========================================
+ Coverage   53.80%   55.23%   +1.42%     
==========================================
  Files         324      324              
  Lines       55603    55676      +73     
==========================================
+ Hits        29918    30753     +835     
+ Misses      23082    22294     -788     
- Partials     2603     2629      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch from 8555d33 to b9bfe67 Compare November 15, 2024 11:51
Copy link
Contributor

@andrii-korotkov-verkada andrii-korotkov-verkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with expectation that gitops engine override would be removed.

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch 4 times, most recently from 74885c1 to c08615b Compare December 2, 2024 17:43
@@ -114,6 +120,7 @@ func init() {
clusterCacheListSemaphoreSize = env.ParseInt64FromEnv(EnvClusterCacheListSemaphore, clusterCacheListSemaphoreSize, 0, math.MaxInt64)
clusterCacheAttemptLimit = int32(env.ParseNumFromEnv(EnvClusterCacheAttemptLimit, int(clusterCacheAttemptLimit), 1, math.MaxInt32))
clusterCacheRetryUseBackoff = env.ParseBoolFromEnv(EnvClusterCacheRetryUseBackoff, false)
clusterCacheBatchEventsProcessing = env.ParseBoolFromEnv(EnvClusterCacheBatchEventsProcessing, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two broader questions:

  1. Should we expose this as part of the argocd-cmd-params config map?
  2. Should we default it to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review @rumstead

  1. Should we expose this as part of the argocd-cmd-params config map?

ARGOCD_CLUSTER_CACHE_BATCH_EVENTS_PROCESSING option was exposed as env variable the same way as all other clustercache (part of gitops-engine) options are exposed.

  1. Should we default it to true?

Currently, the default value should be false; that doesn't change the event's processing mode, so no one can be affected by this change. To change the event's processing mode, you must explicitly set the ARGOCD_CLUSTER_CACHE_BATCH_EVENTS_PROCESSING env variable to true.

@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch from c08615b to b71bf4d Compare December 11, 2024 09:26
Comment on lines +309 to +310
m.resourceEventsProcessingHistogram.WithLabelValues(server).Observe(duration.Seconds())
m.resourceEventsNumberGauge.WithLabelValues(server).Set(float64(processedEventsNumber))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I should have done it before. Done.

Signed-off-by: Mykola Pelekh <mpelekh@demonware.net>
Signed-off-by: Mykola Pelekh <mpelekh@demonware.net>
Signed-off-by: Mykola Pelekh <mpelekh@demonware.net>
Signed-off-by: Mykola Pelekh <mpelekh@demonware.net>
@mpelekh mpelekh force-pushed the 8172-improve-reconciliation-performance-for-large-clusters branch from b71bf4d to 8678e12 Compare December 12, 2024 10:36
@mpelekh mpelekh changed the title chore: avoid resources lock contention (#8172) fix: avoid resources lock contention (#8172) Dec 12, 2024
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
@crenshaw-dev crenshaw-dev marked this pull request as ready for review December 16, 2024 17:00
@crenshaw-dev crenshaw-dev requested a review from a team as a code owner December 16, 2024 17:00
@crenshaw-dev crenshaw-dev merged commit dc3f40c into argoproj:master Dec 16, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource tree slow refresh
5 participants