[TEP-0092] TaskRun scheduling timeout ⏱️ #540

bobcatfish · 2021-10-14T00:07:58Z

This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime TaskRuns can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with Google Cloud Build's existing queuettl feature.

jerop · 2021-10-14T00:10:52Z

/assign

dibyom · 2021-10-18T16:19:23Z

/assign @imjasonh

jerop

Configuration Granularity

The difference in granularity stood out to me, that is we can set:

the execution timeout for all tasks using tasks
the scheduling timeout for each task using taskrun-scheduling

Made me wonder about taskrun-execution which sets the execution timeout for each task, but we don't have use cases for this and can be explored later if needed

Scheduling

Noting here that we discussed in the API WG on 10/18 that scheduling concerns have come up in other areas and we could explore solving them together:

Controlling max parallel TaskRuns per PipelineRun: Controlling max parallel jobs per pipeline pipeline#2591
Coordinate scheduling and execution of TaskRuns: Improve reliability of the check-pr-has-kind-label job plumbing#888 (comment)
Run TaskRuns in non-parallel mode: How to run TaskRuns in non-parallel mode pipeline#3989 (comment)
Remove max parallel TaskRuns in Task Loops: [task-loops] Remove max parallel TaskRuns in Task Loops experimental#804

`Pending` `TaskRuns`

The Pending TaskRun mentioned in the notes/caveats in this proposal was also suggested as a solution in tektoncd/pipeline#3989. Could providing Pending TaskRuns, where the existing timeouts don't apply while they're pending, be sufficient for the use cases discussed here? Maybe we could add this as an alternative approach to solve for the use cases in this TEP? Then we can add timeouts for both Pending PipelineRuns and Pending TaskRuns later (existing timeouts don't affect Pending PipelineRuns - TEP-0015: Support Pending PipelineRuns)

teps/0092-scheduling-timeout.md

souleb · 2021-10-19T09:52:43Z

teps/0092-scheduling-timeout.md

+  timeouts:
+    pipeline: "0h0m60s"
+    tasks: "0h0m40s"
+    taskrun-scheduling: "5m" # this is the new option
+    finally: "0h0m20s"


With the current implementation we have the following constraint on timeouts: tasks + finally <= pipeline. All three fields are set using this equation( with pipeline timeout default value when needed). How does this change the new scheduling timeout? If the scheduling timeout is constrained by a pipeline overall timeout, I don't really get the value of adding it 🤔

great catch @souleb!

maybe taskrun-scheduling * number of tasks + tasks + finally <= pipeline?

i think this is accurate - a few lines below this is a description of how the timeout logic would apply, and like you said, the scheduling timeout would be constrained by the pipeline timeout

the example times don't make any sense b/c the scheduling timeout is 5 minutes and the pipeline only has 60 seconds total to execute, ill fix that 😆

I can see why you might want the scheduling timeout to be completely separate - if you want to allow your pipeline to run for 5 min total regardless of how long it takes to schedule each individual task, but I can't really think of a better solution and I think it's probably more common to want to constrain the overall time for a pipeline vs to want to say "overall time except for the scheduling time" @badamowicz let me know if you disagree - i.e. if it's important for your use case to be able to say "I want the pipeline execution to take only 5 minutes, even if it takes 30 minutes per pod to schedule"

Another thing to consider, if you only cared about the scheduling timeout and you didn't care about the overall execution time, you could always let those top level timeouts be infinite (I'm pretty sure 0 means no timeout, right? im a bit confused by our docs on this, will follow up with @jerop ), e.g. something like:

spec: timeouts: pipeline: "0s" taskruns: scheduling: "1m" # this is the new option

imjasonh · 2021-10-25T14:30:43Z

teps/0092-scheduling-timeout.md

+* We don't want to include init container execution time in the timeout
+* We don't want to include image pull time in the timeout


I think we should include init and pull time in the pre-start timeout.

If I have a task where step N specifies an image that I don't have access to (or I've typoed and the image doesn't exist), then that container won't start and will go into ImagePullBackOff. Apparently (I didn't know this until I tested it), step 0 will not proceed during this state. I don't know if this was intentional:

$ k get tr test-6j3k2 ... status: steps: - container: step-unnamed-0 imageID: docker.io/library/ubuntu@sha256:626ffe58f6e7566e00254b638eb7e0f3b11d4da9675088f4781a50ae288f3322 name: unnamed-0 running: startedAt: "2021-10-25T14:26:02Z" - container: step-unnamed-1 name: unnamed-1 waiting: message: Back-off pulling image "non-existent" reason: ImagePullBackOff

(step 0 has started, but hasn't produced any logs indicating execution has actually started).

If that's the case, I'd rather have this task time out under the (presumably shorter) pre-start timeout regime than have it wait and get timed out by the longer task timeout.

@imjasonh would you be okay with proceeding without the image pull timeout or the init container execution being included in the scheduling timeout, and then potentially adding these timeouts as an additional feature if needed later?

Overall my reason for preferring this is that image pull time and time to execute init contains feel like they address separate use cases than what we're trying to address here

The situations a user might want to address here (i.e. image pull backoff, long image pulls and long init container execution) seem like they're all fairly distinct and the kinds of timeouts users would want to provide would be very different, so bundling them into one option doesn't totally make sense to me.

At the very least I'm adding a section about this in the alternatives and adding an example of how we could still address it in the future.

imjasonh · 2021-10-25T14:31:19Z

teps/0092-scheduling-timeout.md

+  This would mean the timeout would also include the time for init containers to execute (which Pipelines still uses
+  for some features) as well as the time to pull any images needed by the containers. `Ready` would also include a
+  positive indication from any readiness probes that might have been added to the pod.
+* We could stop the timeout once [the Pod phase is `Running`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#Pod-phase),


I'd vote for this, personally.

(discussing in comment above #540 (comment))

pritidesai · 2021-11-01T16:17:05Z

API WG - @bobcatfish responding to comments

tekton-robot · 2022-01-30T17:07:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-03-01T17:20:29Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

bobcatfish · 2022-03-07T17:17:12Z

/remove-lifecycle rotten

I am going to update this 🙏

badamowicz · 2022-03-22T11:28:35Z

As the original author of this issue may I ask for which Tekton version this TEP is scheduled? Would help us in our planning. Thanks!

bobcatfish · 2022-04-11T22:35:17Z

@jerop:

Made me wonder about taskrun-execution which sets the execution timeout for each task, but we don't have use cases for this and can be explored later if needed

now that i've updated the syntax to allow for more task level timeouts in future I think we could account for this if we had use cases in the future, e.g.:

spec:
  timeouts:
    pipeline: "0h15m0s"
    tasks: "0h10m0s"
    finally: "0h5m0s"
    taskruns: # new section
      scheduling: "1m" # new option
      execution: "1m" # could be added in future if needed

Noting here that we discussed in the API WG on 10/18 that scheduling concerns have come up in other areas and we could explore solving them together

I can see the overlap in these concerns all being scheduling related (maybe a scheduling "problem space"? :D) but addressing parallel execution and coordination between the scheduling of tasks feels to me like something we'd address separately - my reasoning is that in this TEP we're not changing the scheduling behavior, we're just saying we want to stop if things are happening too slowly, vs. those issues seem to be about actually changing what is scheduled and when? (not sure how convincing that is!)

Could providing Pending TaskRuns, where the existing timeouts don't apply while they're pending, be sufficient for the use cases discussed here? Maybe we could add this as an alternative approach to solve for the use cases in this TEP?

I'll add this as an alternative ("Leverage pending status") - let me know if I didn't understand the suggestion, my assumption is that in that solution users would need to build their own controller to manage pending TaskRuns and apply the timeouts logic they want.

(There's also another alternative listed in the bullet list of alternatives that involves Tekton Pipelines moving all timeout logic into a separate more customizable controller.)

bobcatfish · 2022-04-11T22:55:20Z

@imjasonh @jerop @afrittoli @souleb after a significant delay I've finally responded to your feedback and updated this TEP - I've also ambitiously updated it to "implementable" 🙏 PTAL!

ive made a few updates:

implementable (instead of just proposed)
upated the syntax to create a new "taskruns" section within the pipelinerun level configuration to:
- a. make it more clear that this option applies per task(run)
- b. make it possible to add more task(run) level timeouts in future if needed
tried to make the example of how the timeouts interact with each other more clear (values didnt make sense before)
Added alternative: including image pull time in scheduling timeout; also highlighting that we could add image pull time specific timeouts later (let me know what you think @imjasonh )
Added alternative: leveraging "pending" status

@badamowicz my apologies for this extreme delay 😬 !! to answer your question:

As the original author of this tektoncd/pipeline#4078 may I ask for which Tekton version this TEP is scheduled? Would help us in our planning. Thanks!

At the moment this isn't explicitly scheduled into a particular Tekton version. Implementing this particular feature is unfortunately not high priority for the Google team (so if anyone reading this is interested in volunteering to do the implementation that is very welcome!!) BUT we have some new folks starting on the team soon so my plan is to get this proposal marked as implementable so we can have one of our new team members take it on as part of their ramp up.

bobcatfish · 2022-04-11T22:58:26Z

hahaha i said PTAL @afrittoli b/c i was convinced you left feedback on this but apparently i imagined it... XD

badamowicz · 2022-04-13T12:31:28Z

@bobcatfish Thanks for the update! Meanwhile I've left the project this issue occurred. However, will inform my colleagues to further track this issue.

imjasonh · 2022-04-13T12:41:52Z

Added alternative: including image pull time in scheduling timeout; also highlighting that we could add image pull time specific timeouts later (let me know what you think @imjasonh )

This seems right to me. 👍

/lgtm

bobcatfish · 2022-04-13T15:24:24Z

/hold

jerop

thanks @bobcatfish!

jerop · 2022-04-13T11:55:33Z

teps/0092-scheduling-timeout.md

+   increase all timeouts to avoid those pending Pods being killed due to timeout.
+2. **Supporting an environment where users may "fire and forget"** a number of PipelineRuns or TaskRuns, more than the
+   environment can handle, causing the Runs to be queued such that eventually the user no longer cares about these
+   older queued runs and wants them to naturally expire to make way for newer, more relvant runs


Suggested change

older queued runs and wants them to naturally expire to make way for newer, more relvant runs

older queued runs and wants them to naturally expire to make way for newer, more relevant runs

jerop · 2022-04-13T12:33:56Z

teps/0092-scheduling-timeout.md

+### Notes/Caveats
+
+* `PipelineRuns` can be created in [pending](https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#pending-pipelineruns)
+  state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203)


Suggested change

state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203)

state. This is not a feature of `TaskRuns` _yet_ but [the initial PR description](https://github.com/tektoncd/community/pull/203)

dfuessl · 2022-04-13T16:48:27Z

Hi @bobcatfish,
thank you for all of your activities in this ticket so far. I am taking over from @badamowicz because he has left the project. I was involved in our internal discussion and I know all the details.

bobcatfish · 2022-04-15T00:35:08Z

Welcome @dfuessl !! :D 🎉 🎉 🎉 and thanks again for your team's patience around this

This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).

bobcatfish · 2022-04-15T00:40:16Z

/hold cancel

tekton-robot · 2022-04-15T08:14:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~teps/OWNERS~~ [jerop,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dibyom · 2022-04-18T16:13:19Z

/lgtm

tekton-robot requested review from steveodonovan and sthaha October 14, 2021 00:08

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2021

bobcatfish added kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 14, 2021

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 14, 2021

bobcatfish force-pushed the penidng_timeout branch from f1dd69b to 52d70f8 Compare October 14, 2021 00:08

tekton-robot assigned jerop Oct 14, 2021

bobcatfish mentioned this pull request Oct 14, 2021

When does a timeout counter actually start? tektoncd/pipeline#4078

Open

bobcatfish changed the title ~~TEP for runtime scheduling only TaskRun timeouts ⏱️~~ [TEP-0092] TaskRun scheduling timeout ⏱️ Oct 14, 2021

bobcatfish force-pushed the penidng_timeout branch from 52d70f8 to cf8de22 Compare October 15, 2021 20:32

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2021

tekton-robot assigned imjasonh Oct 18, 2021

jerop reviewed Oct 18, 2021

View reviewed changes

teps/0092-scheduling-timeout.md Outdated Show resolved Hide resolved

teps/0092-scheduling-timeout.md Outdated Show resolved Hide resolved

souleb reviewed Oct 19, 2021

View reviewed changes

imjasonh reviewed Oct 25, 2021

View reviewed changes

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2022

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2022

tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 7, 2022

bobcatfish force-pushed the penidng_timeout branch from cf8de22 to e0910e8 Compare April 11, 2022 22:55

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2022

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2022

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2022

jerop approved these changes Apr 13, 2022

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2022

bobcatfish force-pushed the penidng_timeout branch from e0910e8 to 5571a39 Compare April 15, 2022 00:36

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2022

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2022

vdemeester approved these changes Apr 15, 2022

View reviewed changes

tekton-robot assigned dibyom Apr 18, 2022

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2022

tekton-robot merged commit b42b420 into tektoncd:main Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEP-0092] TaskRun scheduling timeout ⏱️ #540

[TEP-0092] TaskRun scheduling timeout ⏱️ #540

bobcatfish commented Oct 14, 2021

jerop commented Oct 14, 2021

dibyom commented Oct 18, 2021

jerop left a comment •

edited

Loading

souleb Oct 19, 2021

jerop Oct 19, 2021

bobcatfish Apr 11, 2022

imjasonh Oct 25, 2021

bobcatfish Apr 11, 2022

imjasonh Oct 25, 2021

bobcatfish Apr 11, 2022

pritidesai commented Nov 1, 2021

tekton-robot commented Jan 30, 2022

tekton-robot commented Mar 1, 2022

bobcatfish commented Mar 7, 2022

badamowicz commented Mar 22, 2022

bobcatfish commented Apr 11, 2022

bobcatfish commented Apr 11, 2022

bobcatfish commented Apr 11, 2022

badamowicz commented Apr 13, 2022

imjasonh commented Apr 13, 2022

bobcatfish commented Apr 13, 2022

jerop left a comment

jerop Apr 13, 2022

jerop Apr 13, 2022

dfuessl commented Apr 13, 2022

bobcatfish commented Apr 15, 2022

bobcatfish commented Apr 15, 2022

tekton-robot commented Apr 15, 2022

dibyom commented Apr 18, 2022

		* We don't want to include init container execution time in the timeout
		* We don't want to include image pull time in the timeout

	older queued runs and wants them to naturally expire to make way for newer, more relvant runs
	older queued runs and wants them to naturally expire to make way for newer, more relevant runs

	state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203)
	state. This is not a feature of `TaskRuns` _yet_ but [the initial PR description](https://github.com/tektoncd/community/pull/203)

[TEP-0092] TaskRun scheduling timeout ⏱️ #540

[TEP-0092] TaskRun scheduling timeout ⏱️ #540

Conversation

bobcatfish commented Oct 14, 2021

jerop commented Oct 14, 2021

dibyom commented Oct 18, 2021

jerop left a comment • edited Loading

Choose a reason for hiding this comment

Configuration Granularity

Scheduling

Pending TaskRuns

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritidesai commented Nov 1, 2021

tekton-robot commented Jan 30, 2022

tekton-robot commented Mar 1, 2022

bobcatfish commented Mar 7, 2022

badamowicz commented Mar 22, 2022

bobcatfish commented Apr 11, 2022

bobcatfish commented Apr 11, 2022

bobcatfish commented Apr 11, 2022

badamowicz commented Apr 13, 2022

imjasonh commented Apr 13, 2022

bobcatfish commented Apr 13, 2022

jerop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfuessl commented Apr 13, 2022

bobcatfish commented Apr 15, 2022

bobcatfish commented Apr 15, 2022

tekton-robot commented Apr 15, 2022

dibyom commented Apr 18, 2022

jerop left a comment •

edited

Loading

`Pending` `TaskRuns`