Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crier github reporter: Avoid lock contention issues #19053

Merged
merged 2 commits into from
Aug 31, 2020

Conversation

alvaroaleman
Copy link
Member

This PR:

  • Extends crier so reporters can defer their reporting ( Same commit as in K8SGCSReporter: Fix handling of aborted jobs #19048)

  • Fixes a lock contention issue in the github reporter:

    Crier github reporter: Don't block workers waiting for lock

    In 0056287 we introduced PR-Level
    locking for presubmits to the crier github reporter in order to allow
    running it with multiple workers without having the workers race with
    each other when creating the GitHub comment.

    The implementation naively used a map of sync.Mutex, blocking workers
    that are waiting for their lock. This can result in all workers being
    blocked, which is worse than the initial problem of only having one
    worker.

    This commit updates that to instead use a semaphore.Weightened and
    TryAcquire, returning a RequeueAfter if the lock is occupied in order to
    minimize lock contention.

In 0056287 we introduced PR-Level
locking for presubmits to the crier github reporter in order to allow
running it with multiple workers without having the workers race with
each other when creating the GitHub comment.

The implementation naively used a map of sync.Mutex, blocking workers
that are waiting for their lock. This can result in all workers being
blocked, which is worse than the initial problem of only having one
worker.

This commit updates that to instead use a semaphore.Weightened and
TryAcquire, returning a RequeueAfter if the lock is occupied in order to
minimize lock contention.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 29, 2020
@k8s-ci-robot k8s-ci-robot added area/prow Issues or PRs related to prow area/prow/crier Issues or PRs related to prow's crier component area/prow/pubsub Issues or PRs related to prow's pubsub reporter component sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 29, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2020
@MushuEE
Copy link
Contributor

MushuEE commented Aug 30, 2020

Cool! Is there any concern over a worker getting starved out just by happen chance that every 5 seconds the locks are taken? I don't know much about Crier so it might not be a big deal if a lock is not taken for a while.

@alvaroaleman
Copy link
Member Author

Cool! Is there any concern over a worker getting starved out just by happen chance that every 5 seconds the locks are taken? I don't know much about Crier so it might not be a big deal if a lock is not taken for a while.

If the locks are taken, the Reconciliation will return with a RequeueAfter: 5 * time.Seconds. This puts the given work item (a prowjob) back into the workqueue and doesn't hand it out to workers in this timeframe. If there is any other item in the workqueue, the workers will take it. Does that answer your question (I am not sure if I understood it correctly)?

@MushuEE
Copy link
Contributor

MushuEE commented Aug 31, 2020

Thanks @alvaroaleman, I was more curious if the same job could get put back in the queue multiple times in a row due to happenstance where resources are locked each time it is pulled off the queue. So the same job keeps getting pushed back after 5 seconds say, 10 times in a row. I am not sure if that ~50 second wait will impact anything.

@alvaroaleman
Copy link
Member Author

Thanks @alvaroaleman, I was more curious if the same job could get put back in the queue multiple times in a row due to happenstance where resources are locked each time it is pulled off the queue. So the same job keeps getting pushed back after 5 seconds say, 10 times in a row. I am not sure if that ~50 second wait will impact anything.

Ah. So yeah, this is possible but that would mean that either the reporting of a different job for the same PR takes very long or that in those 50 seconds we don't get our turn but other jobs for the same PR do. The five seconds are intended to approximate the duration of a single successful report. Admittedly, its a somewhat pessimistic assumption (and we seem to not have metrics for this, we only have them for non-mutating requests :/ ). Maybe reduce it to two seconds, wdyt?

@MushuEE
Copy link
Contributor

MushuEE commented Aug 31, 2020

I default to your expertise. It sounds like the likelihood of major delay is low and even if there is delay, it isn't high impact. I really was more curious for me own sake of understanding our systems a bit better. Quite a lot of moving parts to ramp up with! Thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2020
@k8s-ci-robot k8s-ci-robot merged commit e25094f into kubernetes:master Aug 31, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/prow/crier Issues or PRs related to prow's crier component area/prow/pubsub Issues or PRs related to prow's pubsub reporter component area/prow Issues or PRs related to prow cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants