-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TEP-0092] TaskRun scheduling timeout ⏱️ #540
Conversation
f1dd69b
to
52d70f8
Compare
/assign |
52d70f8
to
cf8de22
Compare
/assign @imjasonh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuration Granularity
The difference in granularity stood out to me, that is we can set:
- the execution timeout for all tasks using
tasks
- the scheduling timeout for each task using
taskrun-scheduling
Made me wonder about taskrun-execution
which sets the execution timeout for each task, but we don't have use cases for this and can be explored later if needed
Scheduling
Noting here that we discussed in the API WG on 10/18 that scheduling concerns have come up in other areas and we could explore solving them together:
- Controlling max parallel TaskRuns per PipelineRun: Controlling max parallel jobs per pipeline pipeline#2591
- Coordinate scheduling and execution of TaskRuns: Improve reliability of the check-pr-has-kind-label job plumbing#888 (comment)
- Run TaskRuns in non-parallel mode: How to run TaskRuns in non-parallel mode pipeline#3989 (comment)
- Remove max parallel TaskRuns in Task Loops: [task-loops] Remove max parallel TaskRuns in Task Loops experimental#804
Pending
TaskRuns
The Pending
TaskRun
mentioned in the notes/caveats in this proposal was also suggested as a solution in tektoncd/pipeline#3989. Could providing Pending
TaskRuns
, where the existing timeouts don't apply while they're pending
, be sufficient for the use cases discussed here? Maybe we could add this as an alternative approach to solve for the use cases in this TEP? Then we can add timeouts for both Pending
PipelineRuns
and Pending
TaskRuns
later (existing timeouts don't affect Pending
PipelineRuns
- TEP-0015: Support Pending PipelineRuns)
teps/0092-scheduling-timeout.md
Outdated
timeouts: | ||
pipeline: "0h0m60s" | ||
tasks: "0h0m40s" | ||
taskrun-scheduling: "5m" # this is the new option | ||
finally: "0h0m20s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current implementation we have the following constraint on timeouts: tasks + finally <= pipeline
. All three fields are set using this equation( with pipeline timeout
default value when needed). How does this change the new scheduling
timeout? If the scheduling
timeout is constrained by a pipeline overall timeout, I don't really get the value of adding it 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch @souleb!
maybe taskrun-scheduling
* number of tasks
+ tasks
+ finally
<= pipeline
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is accurate - a few lines below this is a description of how the timeout logic would apply, and like you said, the scheduling timeout would be constrained by the pipeline timeout
the example times don't make any sense b/c the scheduling timeout is 5 minutes and the pipeline only has 60 seconds total to execute, ill fix that 😆
I can see why you might want the scheduling timeout to be completely separate - if you want to allow your pipeline to run for 5 min total regardless of how long it takes to schedule each individual task, but I can't really think of a better solution and I think it's probably more common to want to constrain the overall time for a pipeline vs to want to say "overall time except for the scheduling time" @badamowicz let me know if you disagree - i.e. if it's important for your use case to be able to say "I want the pipeline execution to take only 5 minutes, even if it takes 30 minutes per pod to schedule"
Another thing to consider, if you only cared about the scheduling timeout and you didn't care about the overall execution time, you could always let those top level timeouts be infinite (I'm pretty sure 0 means no timeout, right? im a bit confused by our docs on this, will follow up with @jerop ), e.g. something like:
spec:
timeouts:
pipeline: "0s"
taskruns:
scheduling: "1m" # this is the new option
* We don't want to include init container execution time in the timeout | ||
* We don't want to include image pull time in the timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include init and pull time in the pre-start timeout.
If I have a task where step N specifies an image that I don't have access to (or I've typoed and the image doesn't exist), then that container won't start and will go into ImagePullBackOff
. Apparently (I didn't know this until I tested it), step 0 will not proceed during this state. I don't know if this was intentional:
$ k get tr test-6j3k2
...
status:
steps:
- container: step-unnamed-0
imageID: docker.io/library/ubuntu@sha256:626ffe58f6e7566e00254b638eb7e0f3b11d4da9675088f4781a50ae288f3322
name: unnamed-0
running:
startedAt: "2021-10-25T14:26:02Z"
- container: step-unnamed-1
name: unnamed-1
waiting:
message: Back-off pulling image "non-existent"
reason: ImagePullBackOff
(step 0 has started, but hasn't produced any logs indicating execution has actually started).
If that's the case, I'd rather have this task time out under the (presumably shorter) pre-start timeout regime than have it wait and get timed out by the longer task timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjasonh would you be okay with proceeding without the image pull timeout or the init container execution being included in the scheduling timeout, and then potentially adding these timeouts as an additional feature if needed later?
Overall my reason for preferring this is that image pull time and time to execute init contains feel like they address separate use cases than what we're trying to address here
The situations a user might want to address here (i.e. image pull backoff, long image pulls and long init container execution) seem like they're all fairly distinct and the kinds of timeouts users would want to provide would be very different, so bundling them into one option doesn't totally make sense to me.
At the very least I'm adding a section about this in the alternatives and adding an example of how we could still address it in the future.
This would mean the timeout would also include the time for init containers to execute (which Pipelines still uses | ||
for some features) as well as the time to pull any images needed by the containers. `Ready` would also include a | ||
positive indication from any readiness probes that might have been added to the pod. | ||
* We could stop the timeout once [the Pod phase is `Running`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#Pod-phase), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for this, personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussing in comment above #540 (comment))
API WG - @bobcatfish responding to comments |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
/remove-lifecycle rotten I am going to update this 🙏 |
As the original author of this issue may I ask for which Tekton version this TEP is scheduled? Would help us in our planning. Thanks! |
now that i've updated the syntax to allow for more task level timeouts in future I think we could account for this if we had use cases in the future, e.g.: spec:
timeouts:
pipeline: "0h15m0s"
tasks: "0h10m0s"
finally: "0h5m0s"
taskruns: # new section
scheduling: "1m" # new option
execution: "1m" # could be added in future if needed
I can see the overlap in these concerns all being scheduling related (maybe a scheduling "problem space"? :D) but addressing parallel execution and coordination between the scheduling of tasks feels to me like something we'd address separately - my reasoning is that in this TEP we're not changing the scheduling behavior, we're just saying we want to stop if things are happening too slowly, vs. those issues seem to be about actually changing what is scheduled and when? (not sure how convincing that is!)
I'll add this as an alternative ("Leverage pending status") - let me know if I didn't understand the suggestion, my assumption is that in that solution users would need to build their own controller to manage pending TaskRuns and apply the timeouts logic they want. (There's also another alternative listed in the bullet list of alternatives that involves Tekton Pipelines moving all timeout logic into a separate more customizable controller.) |
cf8de22
to
e0910e8
Compare
@imjasonh @jerop @afrittoli @souleb after a significant delay I've finally responded to your feedback and updated this TEP - I've also ambitiously updated it to "implementable" 🙏 PTAL! ive made a few updates:
@badamowicz my apologies for this extreme delay 😬 !! to answer your question:
At the moment this isn't explicitly scheduled into a particular Tekton version. Implementing this particular feature is unfortunately not high priority for the Google team (so if anyone reading this is interested in volunteering to do the implementation that is very welcome!!) BUT we have some new folks starting on the team soon so my plan is to get this proposal marked as implementable so we can have one of our new team members take it on as part of their ramp up. |
hahaha i said PTAL @afrittoli b/c i was convinced you left feedback on this but apparently i imagined it... XD |
@bobcatfish Thanks for the update! Meanwhile I've left the project this issue occurred. However, will inform my colleagues to further track this issue. |
This seems right to me. 👍 /lgtm |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @bobcatfish!
teps/0092-scheduling-timeout.md
Outdated
increase all timeouts to avoid those pending Pods being killed due to timeout. | ||
2. **Supporting an environment where users may "fire and forget"** a number of PipelineRuns or TaskRuns, more than the | ||
environment can handle, causing the Runs to be queued such that eventually the user no longer cares about these | ||
older queued runs and wants them to naturally expire to make way for newer, more relvant runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
older queued runs and wants them to naturally expire to make way for newer, more relvant runs | |
older queued runs and wants them to naturally expire to make way for newer, more relevant runs |
teps/0092-scheduling-timeout.md
Outdated
### Notes/Caveats | ||
|
||
* `PipelineRuns` can be created in [pending](https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#pending-pipelineruns) | ||
state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203) | |
state. This is not a feature of `TaskRuns` _yet_ but [the initial PR description](https://github.com/tektoncd/community/pull/203) |
Hi @bobcatfish, |
Welcome @dfuessl !! :D 🎉 🎉 🎉 and thanks again for your team's patience around this |
This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
e0910e8
to
5571a39
Compare
/hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jerop, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime
TaskRuns
can be configured with a newtimeout that applies before they actually start executing - and only
after that are the existing timeouts applied.
This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with Google Cloud Build's existing queuettl feature.