Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update taskrun/pipelinerun timeout logic to not rely on resync behavior #621

Merged
merged 1 commit into from
Mar 21, 2019

Conversation

shashwathi
Copy link
Contributor

@shashwathi shashwathi commented Mar 14, 2019

Changes

In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run times out handler adds
the object into respective controller queues.
When run controllers are restarted new goroutines are being created to
track existing timeouts. Mutexes added to safely update statuses.
Same timeout handler is used for pipelinerun / taskrun so mutex has
prefix "TaskRun" and "PipelineRun" to differentiate the keys.

why: As the number of taskruns and pipelineruns increase the controllers
cannot handle the number of reconciliations triggered. One of the
solutios to tackle this problems is to increase the resync period to 10h
instead of 30sec. This solution manifests a problem for
taskrun/pipelinerun timeouts because this implementation relied on the
resync behavior to update run status to "Timeout".

I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.

Fixes: #456

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

Additional info

  • In previous PR I ran into e2e test error that I could not resolve. I spent couple of days and went nowhere with the debugging error. I started fresh again in this PR. @vdemeester I have taken your feedback from previous PR and addressed them. Please review again and let me know your thoughts.

I apologize for multiple PRs
3rd times the charm 🎆

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2019
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 14, 2019
@andrew-su
Copy link
Contributor

I signed it!

@andrew-su
Copy link
Contributor

I'm good with having these commits contributed to this project. 👍

@shashwathi shashwathi force-pushed the new-run branch 2 times, most recently from 2b9ba16 to 12c6d92 Compare March 14, 2019 19:58
@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Mar 14, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@bobcatfish
Copy link
Collaborator

ive manually indicated that both contributors are good to go - i think if there are more commits we'll need to re-add this but happy to do that :)

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@nader-ziada
Copy link
Member

looks good to me too, @bobcatfish do you have any comments?

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic direction LGTM, some style/readability nits but nothing serious.

kubeconfig string
masterURL string
kubeconfig string
resyncPeriod = 10 * time.Hour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this a const instead? I don't see its value changed in tests or anything.

case <-finished:
case <-time.After(timeout):
t.StatusLock(tr)
if t.taskRuncallbackFunc == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if t.taskRuncallbackFun != nil {
  t.taskRuncallbackFunc(tr, t.logger)
}

Instead of calling a known-noop func

}
}

func (t *TimeoutSet) AddTrCallBackFunc(f func(interface{})) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name sounds like it will append a func, when in fact it replaces the func. How about SetTaskRunCallbackFunc ?

(and below for PipelineRun)

logger *zap.SugaredLogger
kubeclientset kubernetes.Interface
pipelineclientset clientset.Interface
taskRuncallbackFunc func(interface{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalization nit: taskRunCallbackFunc and pipelineRunCallbackFunc

}
}

func (t *TimeoutSet) AddTrCallBackFunc(f func(interface{})) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exported methods should have comments describing what they do.


select {
case <-t.stopCh:
case <-finished:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readability nit: empty case statements can make it hard to distinguish between "match any of these cases" and "match this case and do nothing"

I find it helpful to add a comment describing no-op cases, and/or explicitly return if you can, e.g.:

select {
  case <-t.stopCh:
    // we're stopping, give up
    return
  case <-finished:
    // it finished, we can stop watching
    return
  case <-time.After(timeout):
    ...
}


defer t.Release(pr)

select {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Mar 19, 2019
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting this reviewed! I left a bit of minor feedback - looks great!! Very nice work :D :D :D ❤️

@@ -52,7 +52,7 @@ func TestPipelineRun_TaskRunref(t *testing.T) {
}

expectTaskRunRef := corev1.ObjectReference{
APIVersion: "build-tekton.dev/v1alpha1",
APIVersion: "tekton.dev/v1alpha1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

finished = make(chan bool)
}
t.done[key] = finished
t.doneMut.Unlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so minor: curious why not defer t.doneMut.Unlock() in this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behaviour remains the same either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah kk, just a bit inconsistent since we're doing defer above, no big deal tho!

@@ -350,6 +352,7 @@ func (c *Reconciler) reconcile(ctx context.Context, pr *v1alpha1.PipelineRun) er

for _, rprt := range rprts {
if rprt != nil {
go c.timeoutHandler.WaitPipelineRun(pr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a bit confused, why do we need to call go WaitPipelineRun for every new taskrun created for the PipelineRun? I would assume we only needed to call go WaitPipelineRun once when the PipelineRun started. (I'm probably totally wrong, if so then maybe a comment here would help me understand!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume we only needed to call go WaitPipelineRun once when the PipelineRun started.

Yes. There is no specific place in reconciler code where it checks if its new or existing pipelinerun.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm doesnt this mean we'll end up with more goroutines tracking pipelineruns than we need?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh indeed, it should be just like task run below

go c.timeoutHandler.WaitPipelineRun(pr)
for _, rprt := rang rprts {
// […]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this logic to use "hasstarted" function. WDYT @vdemeester ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @shashwathi !! ❤️

@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Mar 19, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@imjasonh
Copy link
Member

LGTM, I'll let @bobcatfish give final sign-off. 🎉

@bobcatfish
Copy link
Collaborator

I'm happy to LGTM once that conflict is resolved 😅

I think that we might end up with more goroutines tracking pipelineruns than we need since we create one every time we create a taskrun for the pipelinerun (e.g. if a pipelinerun had 30 tasks we'd have 30 goroutines watching it by the end), but we can iterate on it if you want @shashwathi (or i might be misunderstanding the effect!)

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Mar 21, 2019
@shashwathi
Copy link
Contributor Author

@bobcatfish @vdemeester : Both of you are not misunderstanding the effect. You are absolutely right that it was generating more goroutines to track pipelinerun and it would be inefficient and resource heavy when there are lot of tasks involved.

I have updated the code to check pipelinerun has started or not before creating goroutine. Resolved conflicts as well. Ready for another review.

@dlorenc dlorenc added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Mar 21, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Mar 21, 2019
@shashwathi shashwathi force-pushed the new-run branch 2 times, most recently from 028d07f to 8f210ba Compare March 21, 2019 15:17
@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Mar 21, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

what:
In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run times out handler adds
the object into respective controller queues.
When run controllers are restarted new goroutines are being created to
track existing timeouts. Mutexes added to safely update statuses.
Same timeout handler is used for pipelinerun / taskrun so mutex has
prefix "TaskRun" and "PipelineRun" to differentiate the keys.

why: As the number of taskruns and pipelineruns increase the controllers
cannot handle the number of reconciliations triggered. One of the
solutios to tackle this problems is to increase the resync period to 10h
instead of 30sec. This solution manifests a problem for
taskrun/pipelinerun timeouts because this implementation relied on the
resync behavior to update run status to "Timeout".

I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based
reconciliation.

Signed-off-by: Andrew Su <asu@pivotal.io>
Co-authored-by: Andrew Su <asu@pivotal.io>
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Mar 21, 2019
@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Mar 21, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shashwathi, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot merged commit c3910c3 into tektoncd:master Mar 21, 2019
pradeepitm12 pushed a commit to pradeepitm12/pipeline that referenced this pull request Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle build/taskrun timeout outside of resync
9 participants