Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reliability of the check-pr-has-kind-label job #888

Closed
Tracked by #912
ghost opened this issue Sep 2, 2021 · 9 comments
Closed
Tracked by #912

Improve reliability of the check-pr-has-kind-label job #888

ghost opened this issue Sep 2, 2021 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ghost
Copy link

ghost commented Sep 2, 2021

Expected Behaviour

The check-pr-has-kind-label job should consistently report the correct state of kind labels on any given PR.

Actual Behaviour

This check can sometimes get confused - it will run on one update but not others, or will check an older "version" of the PR and incorrectly determine that no kind label has been applied.

More Information

The check-pr-has-kind-label job is run when a PR is created or updated to confirm that it is labelled with its "kind" - bug, feature, etc.

Related resources:

@ghost ghost added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2021
@bobcatfish
Copy link
Contributor

I wonder if this is possible without having some kind of mechanism to co-ordinate execution of the TaskRuns - e.g. if multiple runs start within a short period of time and point at different commits (e.g. someone pushes rapidly a few times) there might be some kind of race and no guarantee that the right one will update.

I think this is a problem that might be universal to all trigger based Tekton execution 🤔 I wonder if Workflows might be interested in taking this on as a requirement ( + @dibyom ). I've also wanted to put together at least a problem statement TEP about this for a while now.

@dibyom
Copy link
Member

dibyom commented Oct 14, 2021

Yes, I think this is a problem we should solve (we can decide whether this should be in workflows/triggers/pipelines etc.)

Thinking out aloud here:

  • This almost reminds me of the pattern for some deployment pipelines where you can only have one execution in flight. Or if a new execution starts up, the first thing it does is cancel the old one.

  • I wonder if there is some place to keep track of the state (commit SHA, label combo) - maybe a custom task?

@jerop
Copy link
Member

jerop commented Oct 15, 2021

there's a related issue raised in tektoncd/pipeline#3989 (comment):

I have a git repository with many users creating feature branches and pushing into master. Each time a commit is pushed it triggers a new TaskRun and too many of jobs eat the system resources and the system hangs. They also go into a race condition to deploy to the server.

@afrittoli
Copy link
Member

It looks like something has changed related to Conditions and CloudEvents.

When requesting the /test <some-ci-job for a CI job that is not the label check, the PipelineRun for the label check is started, the Condition on the name is evaluated and once it fails, the PipelineRun completes.
In this process the TaskRun should not be not started at all, because the Condition needs to be evaluated before, and indeed there is TaskRun to be found on the cluster. However the events about the TaskRun are sent:

image

A moment later, events for the TaskRun associated to the Condition are sent, however the start event seems to be missing:

image

The event about the Condition includes the conditionName and conditionCheck labels, which can be used to filter our the event (as it is done today), but events about the label TaskRun cause the CI job to be marked as pending on GitHub side - since the TaskRun is never executed it stays on pending indefinitely, until a retest of the label check is requested

image

Looking into the event, the taskSpec embedded in the spec shows that this event actually comes from the condition:

image

Rewriting this check in the ci-workspace folder, without Conditions would help, but we may still need to fix the issue on Conditions? May be not since they are deprecated?

@afrittoli
Copy link
Member

#948

@tekton-robot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2022
@tekton-robot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 6, 2022
@tekton-robot
Copy link
Contributor

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Contributor

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
Status: Done
Development

No branches or pull requests

5 participants