-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Optimize usage of locking in the cluster [#602] #603
chore: Optimize usage of locking in the cluster [#602] #603
Conversation
Not 100% sure it's safe, since resources and ns index can be updated by other threads while handlers are running. Maybe I'll add an additional read locking. |
2ab9f88
to
9ae5b07
Compare
Seems like watching of events is done for each API resource type separately. The only thing where namespace resources returned seem to used in practice in argo-cd is Even with a read lock it's theoretically possible there's some race happening between releasing the write lock and acquiring the read lock. Note sure how much of a problem it would be though, thinking this through. |
At least there shouldn't be a race for the same key, so the worst case is namespace resources being updated a mix of fresher and out-of-date resources being used. Out-of-date is not a problem, since that'd essentially happen anyways with a write lock for a whole duration of the update and calling handlers, since new updates won't be processed yet. Fresher is probably okay too, since out-of-date objects that got removed shouldn't be referenced by fresher objects. So I think it's ok for purposes of |
Actually, nvm, the namespace resources view would always be same or fresher, since removal of objects happens on the same referenced map. So yeah, acquiring read lock before calling resource updated handlers seems good. |
9ae5b07
to
371c4bb
Compare
Testing with ArgoCD argoproj/argo-cd#18972 |
Seems like this works in a conjunction with argoproj/argo-cd#18694. I got an almost instant request of app refresh caused by config map update. |
Don't merge this yet, I'm investigating a potential bug that some updates may not be picked up by refresh, only hard refresh works sometimes. |
I've checked the code, but can't find where the mentioned issue may come from. namespace resources are only used to determine the app, and it's only for resource updates through watch, not for app refresh itself. |
The issue has apparently been in the incorrect usage of |
In addition, seems like app of apps in particular relies on managed apps having an explicit |
Also, sounds like you can't ignore |
pkg/cache/cluster.go
Outdated
func (c *clusterCache) updateResource(key kube.ResourceKey, event watch.EventType, un *unstructured.Unstructured) (bool, *Resource, *Resource, map[kube.ResourceKey]*Resource) { | ||
c.lock.Lock() | ||
defer c.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lack of consistency in the new method that makes it quite hard to understand which operation should hold a lock or not.
For instance, updateResource
method holds a write locks, within the method. However, deleteResource
does not hold one.
I think moving the lock to the caller method exposes the problem described in the comment. You end up with the following code. This clearly mean you can operate on stale data.
c.lock.Lock()
ok, newRes, oldRes, ns := c.updateResource(key, event, un)
c.lock.Unlock()
c.lock.RLock()
defer c.lock.RUnlock()
if ok {
for _, h := range c.getResourceUpdatedHandlers() {
h(newRes, oldRes, ns)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm updating the method names.
pkg/cache/cluster.go
Outdated
// Still requesting read lock since namespace resources can have a race of being used and being updated. | ||
// Some race is still possible between releasing the lock and acquiring this, but since each resource type | ||
// is processed by one goroutine, at most we'd get a fresher namespace resources, which should be fine, | ||
// at least for existing resource update handler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either decide, yes, we can call the onResourceUpdated handlers with stale ns data and the handlers must handle it, or no, we need to retain the lock. We cannot really say that it "should be fine". It is very likely that other go routines will obtain the lock when it is released and before it is acquired again. There are no ways to predict as far as I know what will be the next resources events being processed and they could very well be on the same resource.
For instance, a finalizer is completed and removed from the finalizer array (update event), then the resource is deleted (delete event). What happens if the updated handler is called and the resource is not in the namespace map anymore because it just got deleted?
I am pretty sure it will be ok, but it needs to be 100% sure and the comment should reflect that ns
is the last known namespace cache, and not the cache at the time of the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they could very well be on the same resource.
Watches are per object group/kind IIRC, so updates to the same resource would be handled by the same goroutine, so at least we don't have that part of a problem.
I agree that comment has to be better. We can definitely have OnResourceUpdated operating with a fresher view of namespace resources. Namespace resources are used to derive the application, so if that's updated, we'd get a new event anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also tested on live environment for a few days and don't see related issues so far. I'm quite confident that's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% what happens when the cluster gets refreshed, but also think it'll be okay since whole namespace resources slice should be still held by the goroutine.
371c4bb
to
55947ce
Compare
Updated the PR |
Here are some perf views of the system collected following argoproj/argo-cd#13534 (comment). The build is from master on 2024/07/07 including argoproj/argo-cd#18972, argoproj/argo-cd#18694, #601, #603. |
55947ce
to
d5cb4d1
Compare
@andrii-korotkov-verkada I won't have time soon to review/test this. We can try to get help from other maintainers. |
Closes argoproj#602 Use read lock for handlers getting. Refactor updating resources in the cluster cache to only acquire lock for an update itself, not calling handlers. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
d5cb4d1
to
637dca1
Compare
I've added a benchmark with some goroutines doing processing of updates. Seems like a new locking doesn't benefit us and even has some regression when using less updates or having cheaper updates. I wonder if the wins I saw were from having for less time while iterating hierarchy more efficiently.
|
One thing it doesn't measure is what's the latency of waiting to process some updates in the worst case, only shows throughput. |
Testing without a locking PR shows nearly no waiting time for waiting for config map at least in staging. I might just close the PR, since I haven't really found a configuration of benchmark where acquiring a read lock after write lock is better than acquiring write lock for a whole duration, but there are configurations that show a consistent regression. |
Closes #602
Use read lock for handlers getting. Refactor updating resources in the cluster cache to only acquire lock for an update itself, not calling handlers.