-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update taskrun/pipelinerun timeout logic to not rely on resync behavior #621
Conversation
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
I signed it! |
I'm good with having these commits contributed to this project. 👍 |
2b9ba16
to
12c6d92
Compare
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
ive manually indicated that both contributors are good to go - i think if there are more commits we'll need to re-add this but happy to do that :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
looks good to me too, @bobcatfish do you have any comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic direction LGTM, some style/readability nits but nothing serious.
cmd/controller/main.go
Outdated
kubeconfig string | ||
masterURL string | ||
kubeconfig string | ||
resyncPeriod = 10 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this a const instead? I don't see its value changed in tests or anything.
pkg/reconciler/timeout_handler.go
Outdated
case <-finished: | ||
case <-time.After(timeout): | ||
t.StatusLock(tr) | ||
if t.taskRuncallbackFunc == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if t.taskRuncallbackFun != nil {
t.taskRuncallbackFunc(tr, t.logger)
}
Instead of calling a known-noop func
pkg/reconciler/timeout_handler.go
Outdated
} | ||
} | ||
|
||
func (t *TimeoutSet) AddTrCallBackFunc(f func(interface{})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name sounds like it will append a func, when in fact it replaces the func. How about SetTaskRunCallbackFunc
?
(and below for PipelineRun)
pkg/reconciler/timeout_handler.go
Outdated
logger *zap.SugaredLogger | ||
kubeclientset kubernetes.Interface | ||
pipelineclientset clientset.Interface | ||
taskRuncallbackFunc func(interface{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitalization nit: taskRunCallbackFunc
and pipelineRunCallbackFunc
pkg/reconciler/timeout_handler.go
Outdated
} | ||
} | ||
|
||
func (t *TimeoutSet) AddTrCallBackFunc(f func(interface{})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exported methods should have comments describing what they do.
|
||
select { | ||
case <-t.stopCh: | ||
case <-finished: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readability nit: empty case statements can make it hard to distinguish between "match any of these cases" and "match this case and do nothing"
I find it helpful to add a comment describing no-op cases, and/or explicitly return if you can, e.g.:
select {
case <-t.stopCh:
// we're stopping, give up
return
case <-finished:
// it finished, we can stop watching
return
case <-time.After(timeout):
...
}
|
||
defer t.Release(pr) | ||
|
||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in getting this reviewed! I left a bit of minor feedback - looks great!! Very nice work :D :D :D ❤️
@@ -52,7 +52,7 @@ func TestPipelineRun_TaskRunref(t *testing.T) { | |||
} | |||
|
|||
expectTaskRunRef := corev1.ObjectReference{ | |||
APIVersion: "build-tekton.dev/v1alpha1", | |||
APIVersion: "tekton.dev/v1alpha1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅
pkg/reconciler/timeout_handler.go
Outdated
finished = make(chan bool) | ||
} | ||
t.done[key] = finished | ||
t.doneMut.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so minor: curious why not defer t.doneMut.Unlock()
in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behaviour remains the same either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah kk, just a bit inconsistent since we're doing defer
above, no big deal tho!
@@ -350,6 +352,7 @@ func (c *Reconciler) reconcile(ctx context.Context, pr *v1alpha1.PipelineRun) er | |||
|
|||
for _, rprt := range rprts { | |||
if rprt != nil { | |||
go c.timeoutHandler.WaitPipelineRun(pr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a bit confused, why do we need to call go WaitPipelineRun
for every new taskrun created for the PipelineRun? I would assume we only needed to call go WaitPipelineRun
once when the PipelineRun started. (I'm probably totally wrong, if so then maybe a comment here would help me understand!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume we only needed to call go WaitPipelineRun once when the PipelineRun started.
Yes. There is no specific place in reconciler code where it checks if its new or existing pipelinerun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm doesnt this mean we'll end up with more goroutines tracking pipelineruns than we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh indeed, it should be just like task run below
go c.timeoutHandler.WaitPipelineRun(pr)
for _, rprt := rang rprts {
// […]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this logic to use "hasstarted" function. WDYT @vdemeester ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @shashwathi !! ❤️
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
LGTM, I'll let @bobcatfish give final sign-off. 🎉 |
I'm happy to LGTM once that conflict is resolved 😅 I think that we might end up with more goroutines tracking pipelineruns than we need since we create one every time we create a taskrun for the pipelinerun (e.g. if a pipelinerun had 30 tasks we'd have 30 goroutines watching it by the end), but we can iterate on it if you want @shashwathi (or i might be misunderstanding the effect!) |
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
@bobcatfish @vdemeester : Both of you are not misunderstanding the effect. You are absolutely right that it was generating more goroutines to track pipelinerun and it would be inefficient and resource heavy when there are lot of tasks involved. I have updated the code to check pipelinerun has started or not before creating goroutine. Resolved conflicts as well. Ready for another review. |
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
028d07f
to
8f210ba
Compare
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
what: In this PR each new taskrun/pipelinerun starts goroutine that waits for either stop signal, finish or timeout to occur. Once run times out handler adds the object into respective controller queues. When run controllers are restarted new goroutines are being created to track existing timeouts. Mutexes added to safely update statuses. Same timeout handler is used for pipelinerun / taskrun so mutex has prefix "TaskRun" and "PipelineRun" to differentiate the keys. why: As the number of taskruns and pipelineruns increase the controllers cannot handle the number of reconciliations triggered. One of the solutios to tackle this problems is to increase the resync period to 10h instead of 30sec. This solution manifests a problem for taskrun/pipelinerun timeouts because this implementation relied on the resync behavior to update run status to "Timeout". I drew inspiration from @tzununbekov PR in knative/build. Credit to @pivotal-nader-ziada @dprotaso for suggesting level based reconciliation. Signed-off-by: Andrew Su <asu@pivotal.io> Co-authored-by: Andrew Su <asu@pivotal.io>
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shashwathi, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fix resolve yaml script
Changes
In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run times out handler adds
the object into respective controller queues.
When run controllers are restarted new goroutines are being created to
track existing timeouts. Mutexes added to safely update statuses.
Same timeout handler is used for pipelinerun / taskrun so mutex has
prefix "TaskRun" and "PipelineRun" to differentiate the keys.
why: As the number of taskruns and pipelineruns increase the controllers
cannot handle the number of reconciliations triggered. One of the
solutios to tackle this problems is to increase the resync period to 10h
instead of 30sec. This solution manifests a problem for
taskrun/pipelinerun timeouts because this implementation relied on the
resync behavior to update run status to "Timeout".
I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.
Fixes: #456
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide
for more details.
Additional info
I apologize for multiple PRs
3rd times the charm 🎆