chore: Optimize usage of locking in the cluster [#602] #603

andrii-korotkov-verkada · 2024-07-05T14:44:59Z

Closes #602
Use read lock for handlers getting. Refactor updating resources in the cluster cache to only acquire lock for an update itself, not calling handlers.

andrii-korotkov-verkada · 2024-07-05T14:52:20Z

Not 100% sure it's safe, since resources and ns index can be updated by other threads while handlers are running. Maybe I'll add an additional read locking.

andrii-korotkov-verkada · 2024-07-06T04:34:15Z

Seems like watching of events is done for each API resource type separately. The only thing where namespace resources returned seem to used in practice in argo-cd is getAppRecursive. But the returned list shouldn't change even if some resources are being deleted, at most some fields would get modified.

Even with a read lock it's theoretically possible there's some race happening between releasing the write lock and acquiring the read lock. Note sure how much of a problem it would be though, thinking this through.

andrii-korotkov-verkada · 2024-07-06T04:42:49Z

At least there shouldn't be a race for the same key, so the worst case is namespace resources being updated a mix of fresher and out-of-date resources being used. Out-of-date is not a problem, since that'd essentially happen anyways with a write lock for a whole duration of the update and calling handlers, since new updates won't be processed yet. Fresher is probably okay too, since out-of-date objects that got removed shouldn't be referenced by fresher objects. So I think it's ok for purposes of getAppRecursive at least. I could be content with using read lock like it's used.

andrii-korotkov-verkada · 2024-07-06T05:03:56Z

Actually, nvm, the namespace resources view would always be same or fresher, since removal of objects happens on the same referenced map. So yeah, acquiring read lock before calling resource updated handlers seems good.

andrii-korotkov-verkada · 2024-07-07T20:52:13Z

Testing with ArgoCD argoproj/argo-cd#18972

andrii-korotkov-verkada · 2024-07-07T23:48:33Z

Seems like this works in a conjunction with argoproj/argo-cd#18694. I got an almost instant request of app refresh caused by config map update.

andrii-korotkov-verkada · 2024-07-09T01:35:34Z

Don't merge this yet, I'm investigating a potential bug that some updates may not be picked up by refresh, only hard refresh works sometimes.

andrii-korotkov-verkada · 2024-07-09T01:40:50Z

I've checked the code, but can't find where the mentioned issue may come from. namespace resources are only used to determine the app, and it's only for resource updates through watch, not for app refresh itself.

andrii-korotkov-verkada · 2024-07-09T02:48:12Z

The issue has apparently been in the incorrect usage of argocd.argoproj.io/manifest-generate-paths=.

andrii-korotkov-verkada · 2024-07-09T05:29:09Z

In addition, seems like app of apps in particular relies on managed apps having an explicit app.kubernetes.io/instance annotation, maybe regular apps do this too.

andrii-korotkov-verkada · 2024-07-09T14:52:14Z

Also, sounds like you can't ignore last-applied-configuration annotation in the app of apps, since it's used to compute the diff and is used after normalization.

agaudreault · 2024-07-09T20:10:45Z

pkg/cache/cluster.go

+func (c *clusterCache) updateResource(key kube.ResourceKey, event watch.EventType, un *unstructured.Unstructured) (bool, *Resource, *Resource, map[kube.ResourceKey]*Resource) {
 	c.lock.Lock()
 	defer c.lock.Unlock()


There seems to be a lack of consistency in the new method that makes it quite hard to understand which operation should hold a lock or not.

For instance, updateResource method holds a write locks, within the method. However, deleteResource does not hold one.

I think moving the lock to the caller method exposes the problem described in the comment. You end up with the following code. This clearly mean you can operate on stale data.

c.lock.Lock() ok, newRes, oldRes, ns := c.updateResource(key, event, un) c.lock.Unlock() c.lock.RLock() defer c.lock.RUnlock() if ok { for _, h := range c.getResourceUpdatedHandlers() { h(newRes, oldRes, ns) } }

I'm updating the method names.

agaudreault · 2024-07-09T20:20:19Z

pkg/cache/cluster.go

+	// Still requesting read lock since namespace resources can have a race of being used and being updated.
+	// Some race is still possible between releasing the lock and acquiring this, but since each resource type
+	// is processed by one goroutine, at most we'd get a fresher namespace resources, which should be fine,
+	// at least for existing resource update handler.


I think we should either decide, yes, we can call the onResourceUpdated handlers with stale ns data and the handlers must handle it, or no, we need to retain the lock. We cannot really say that it "should be fine". It is very likely that other go routines will obtain the lock when it is released and before it is acquired again. There are no ways to predict as far as I know what will be the next resources events being processed and they could very well be on the same resource.

For instance, a finalizer is completed and removed from the finalizer array (update event), then the resource is deleted (delete event). What happens if the updated handler is called and the resource is not in the namespace map anymore because it just got deleted?

I am pretty sure it will be ok, but it needs to be 100% sure and the comment should reflect that ns is the last known namespace cache, and not the cache at the time of the event.

they could very well be on the same resource.
Watches are per object group/kind IIRC, so updates to the same resource would be handled by the same goroutine, so at least we don't have that part of a problem.

I agree that comment has to be better. We can definitely have OnResourceUpdated operating with a fresher view of namespace resources. Namespace resources are used to derive the application, so if that's updated, we'd get a new event anyways.

I've also tested on live environment for a few days and don't see related issues so far. I'm quite confident that's okay.

I'm not 100% what happens when the cluster gets refreshed, but also think it'll be okay since whole namespace resources slice should be still held by the goroutine.

andrii-korotkov-verkada · 2024-07-09T21:23:15Z

Updated the PR

andrii-korotkov-verkada · 2024-07-09T22:51:24Z

Here are some perf views of the system collected following argoproj/argo-cd#13534 (comment).

The build is from master on 2024/07/07 including argoproj/argo-cd#18972, argoproj/argo-cd#18694, #601, #603.

crenshaw-dev · 2024-07-18T17:55:10Z

@andrii-korotkov-verkada I won't have time soon to review/test this. We can try to get help from other maintainers.

Closes argoproj#602 Use read lock for handlers getting. Refactor updating resources in the cluster cache to only acquire lock for an update itself, not calling handlers. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

andrii-korotkov-verkada · 2024-07-19T12:54:28Z

I've added a benchmark with some goroutines doing processing of updates. Seems like a new locking doesn't benefit us and even has some regression when using less updates or having cheaper updates. I wonder if the wins I saw were from having for less time while iterating hierarchy more efficiently.

New BenchmarkProcessEvents-10    	       1	3399116791 ns/op	128337808 B/op	 1931210 allocs/op
Old BenchmarkProcessEvents-10    	       1	3241700500 ns/op	128167704 B/op	 1930942 allocs/op

andrii-korotkov-verkada · 2024-07-19T13:00:42Z

One thing it doesn't measure is what's the latency of waiting to process some updates in the worst case, only shows throughput.

andrii-korotkov-verkada · 2024-07-19T14:26:30Z

Testing without a locking PR shows nearly no waiting time for waiting for config map at least in staging. I might just close the PR, since I haven't really found a configuration of benchmark where acquiring a read lock after write lock is better than acquiring write lock for a whole duration, but there are configurations that show a consistent regression.

andrii-korotkov-verkada mentioned this pull request Jul 5, 2024

fix: Put app to the operation queue after refresh queue processing to avoid race condition [#18500] argoproj/argo-cd#18694

Merged

14 tasks

andrii-korotkov-verkada force-pushed the 602-use-more-rw-lock branch from 2ab9f88 to 9ae5b07 Compare July 5, 2024 14:52

andrii-korotkov-verkada force-pushed the 602-use-more-rw-lock branch from 9ae5b07 to 371c4bb Compare July 6, 2024 15:50

This was referenced Jul 7, 2024

Optimize locking when processing events #602

Closed

Sync can be "waiting for healthy state of /ConfigMap/<name>" for a while despite the config map can already be created argoproj/argo-cd#18500

Closed

agaudreault requested changes Jul 9, 2024

View reviewed changes

andrii-korotkov-verkada force-pushed the 602-use-more-rw-lock branch from 371c4bb to 55947ce Compare July 9, 2024 21:12

This was referenced Jul 9, 2024

chore: Use more optimal iterate hierarchy v2 (#18929) argoproj/argo-cd#18972

Merged

feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

Merged

andrii-korotkov-verkada force-pushed the 602-use-more-rw-lock branch from 55947ce to d5cb4d1 Compare July 11, 2024 04:04

andrii-korotkov-verkada mentioned this pull request Jul 13, 2024

App of apps regularly gets out of sync when some apps have explictly specified empty sync policy argoproj/argo-cd#19043

Open

3 tasks

andrii-korotkov-verkada requested a review from agaudreault July 16, 2024 11:22

andrii-korotkov-verkada force-pushed the 602-use-more-rw-lock branch from d5cb4d1 to 637dca1 Compare July 19, 2024 12:52

andrii-korotkov-verkada closed this Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Optimize usage of locking in the cluster [#602] #603

chore: Optimize usage of locking in the cluster [#602] #603

andrii-korotkov-verkada commented Jul 5, 2024

andrii-korotkov-verkada commented Jul 5, 2024

andrii-korotkov-verkada commented Jul 6, 2024

andrii-korotkov-verkada commented Jul 6, 2024

andrii-korotkov-verkada commented Jul 6, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

agaudreault Jul 9, 2024

andrii-korotkov-verkada Jul 9, 2024

agaudreault Jul 9, 2024

andrii-korotkov-verkada Jul 9, 2024

andrii-korotkov-verkada Jul 9, 2024

andrii-korotkov-verkada Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024 •

edited

Loading

crenshaw-dev commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 19, 2024 •

edited

Loading

andrii-korotkov-verkada commented Jul 19, 2024

andrii-korotkov-verkada commented Jul 19, 2024

chore: Optimize usage of locking in the cluster [#602] #603

chore: Optimize usage of locking in the cluster [#602] #603

Conversation

andrii-korotkov-verkada commented Jul 5, 2024

andrii-korotkov-verkada commented Jul 5, 2024

andrii-korotkov-verkada commented Jul 6, 2024

andrii-korotkov-verkada commented Jul 6, 2024

andrii-korotkov-verkada commented Jul 6, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024

agaudreault Jul 9, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 9, 2024

Choose a reason for hiding this comment

agaudreault Jul 9, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 9, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 9, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 9, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada commented Jul 9, 2024

andrii-korotkov-verkada commented Jul 9, 2024 • edited Loading

crenshaw-dev commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 19, 2024 • edited Loading

andrii-korotkov-verkada commented Jul 19, 2024

andrii-korotkov-verkada commented Jul 19, 2024

andrii-korotkov-verkada commented Jul 9, 2024 •

edited

Loading

andrii-korotkov-verkada commented Jul 19, 2024 •

edited

Loading