-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crier github reporter: Avoid lock contention issues #19053
Conversation
In 0056287 we introduced PR-Level locking for presubmits to the crier github reporter in order to allow running it with multiple workers without having the workers race with each other when creating the GitHub comment. The implementation naively used a map of sync.Mutex, blocking workers that are waiting for their lock. This can result in all workers being blocked, which is worse than the initial problem of only having one worker. This commit updates that to instead use a semaphore.Weightened and TryAcquire, returning a RequeueAfter if the lock is occupied in order to minimize lock contention.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Cool! Is there any concern over a worker getting starved out just by happen chance that every 5 seconds the locks are taken? I don't know much about Crier so it might not be a big deal if a lock is not taken for a while. |
If the locks are taken, the Reconciliation will return with a |
Thanks @alvaroaleman, I was more curious if the same job could get put back in the queue multiple times in a row due to happenstance where resources are locked each time it is pulled off the queue. So the same job keeps getting pushed back after 5 seconds say, 10 times in a row. I am not sure if that ~50 second wait will impact anything. |
Ah. So yeah, this is possible but that would mean that either the reporting of a different job for the same PR takes very long or that in those 50 seconds we don't get our turn but other jobs for the same PR do. The five seconds are intended to approximate the duration of a single successful report. Admittedly, its a somewhat pessimistic assumption (and we seem to not have metrics for this, we only have them for non-mutating requests :/ ). Maybe reduce it to two seconds, wdyt? |
I default to your expertise. It sounds like the likelihood of major delay is low and even if there is delay, it isn't high impact. I really was more curious for me own sake of understanding our systems a bit better. Quite a lot of moving parts to ramp up with! Thanks! /lgtm |
This PR:
Extends crier so reporters can defer their reporting ( Same commit as in K8SGCSReporter: Fix handling of aborted jobs #19048)
Fixes a lock contention issue in the github reporter:
Crier github reporter: Don't block workers waiting for lock
In 0056287 we introduced PR-Level
locking for presubmits to the crier github reporter in order to allow
running it with multiple workers without having the workers race with
each other when creating the GitHub comment.
The implementation naively used a map of sync.Mutex, blocking workers
that are waiting for their lock. This can result in all workers being
blocked, which is worse than the initial problem of only having one
worker.
This commit updates that to instead use a semaphore.Weightened and
TryAcquire, returning a RequeueAfter if the lock is occupied in order to
minimize lock contention.