Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEP-0092] TaskRun scheduling timeout ⏱️ #540

Merged
merged 1 commit into from
Apr 18, 2022

Conversation

bobcatfish
Copy link
Contributor

This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime TaskRuns can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with Google Cloud Build's existing queuettl feature.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2021
@bobcatfish bobcatfish added kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 14, 2021
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 14, 2021
@jerop
Copy link
Member

jerop commented Oct 14, 2021

/assign

@bobcatfish bobcatfish changed the title TEP for runtime scheduling only TaskRun timeouts ⏱️ [TEP-0092] TaskRun scheduling timeout ⏱️ Oct 14, 2021
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2021
@dibyom
Copy link
Member

dibyom commented Oct 18, 2021

/assign @imjasonh

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration Granularity

The difference in granularity stood out to me, that is we can set:

  • the execution timeout for all tasks using tasks
  • the scheduling timeout for each task using taskrun-scheduling

Made me wonder about taskrun-execution which sets the execution timeout for each task, but we don't have use cases for this and can be explored later if needed

Scheduling

Noting here that we discussed in the API WG on 10/18 that scheduling concerns have come up in other areas and we could explore solving them together:

Pending TaskRuns

The Pending TaskRun mentioned in the notes/caveats in this proposal was also suggested as a solution in tektoncd/pipeline#3989. Could providing Pending TaskRuns, where the existing timeouts don't apply while they're pending, be sufficient for the use cases discussed here? Maybe we could add this as an alternative approach to solve for the use cases in this TEP? Then we can add timeouts for both Pending PipelineRuns and Pending TaskRuns later (existing timeouts don't affect Pending PipelineRuns - TEP-0015: Support Pending PipelineRuns)

teps/0092-scheduling-timeout.md Outdated Show resolved Hide resolved
teps/0092-scheduling-timeout.md Outdated Show resolved Hide resolved
Comment on lines 161 to 165
timeouts:
pipeline: "0h0m60s"
tasks: "0h0m40s"
taskrun-scheduling: "5m" # this is the new option
finally: "0h0m20s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation we have the following constraint on timeouts: tasks + finally <= pipeline. All three fields are set using this equation( with pipeline timeout default value when needed). How does this change the new scheduling timeout? If the scheduling timeout is constrained by a pipeline overall timeout, I don't really get the value of adding it 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch @souleb!

maybe taskrun-scheduling * number of tasks + tasks + finally <= pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is accurate - a few lines below this is a description of how the timeout logic would apply, and like you said, the scheduling timeout would be constrained by the pipeline timeout

the example times don't make any sense b/c the scheduling timeout is 5 minutes and the pipeline only has 60 seconds total to execute, ill fix that 😆

I can see why you might want the scheduling timeout to be completely separate - if you want to allow your pipeline to run for 5 min total regardless of how long it takes to schedule each individual task, but I can't really think of a better solution and I think it's probably more common to want to constrain the overall time for a pipeline vs to want to say "overall time except for the scheduling time" @badamowicz let me know if you disagree - i.e. if it's important for your use case to be able to say "I want the pipeline execution to take only 5 minutes, even if it takes 30 minutes per pod to schedule"

Another thing to consider, if you only cared about the scheduling timeout and you didn't care about the overall execution time, you could always let those top level timeouts be infinite (I'm pretty sure 0 means no timeout, right? im a bit confused by our docs on this, will follow up with @jerop ), e.g. something like:

spec:
  timeouts:
    pipeline: "0s"
    taskruns:
      scheduling: "1m" # this is the new option

Comment on lines +88 to +89
* We don't want to include init container execution time in the timeout
* We don't want to include image pull time in the timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include init and pull time in the pre-start timeout.

If I have a task where step N specifies an image that I don't have access to (or I've typoed and the image doesn't exist), then that container won't start and will go into ImagePullBackOff. Apparently (I didn't know this until I tested it), step 0 will not proceed during this state. I don't know if this was intentional:

$ k get tr test-6j3k2
...
status:
  steps:
  - container: step-unnamed-0
    imageID: docker.io/library/ubuntu@sha256:626ffe58f6e7566e00254b638eb7e0f3b11d4da9675088f4781a50ae288f3322
    name: unnamed-0
    running:
      startedAt: "2021-10-25T14:26:02Z"
  - container: step-unnamed-1
    name: unnamed-1
    waiting:
      message: Back-off pulling image "non-existent"
      reason: ImagePullBackOff

(step 0 has started, but hasn't produced any logs indicating execution has actually started).

If that's the case, I'd rather have this task time out under the (presumably shorter) pre-start timeout regime than have it wait and get timed out by the longer task timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjasonh would you be okay with proceeding without the image pull timeout or the init container execution being included in the scheduling timeout, and then potentially adding these timeouts as an additional feature if needed later?

Overall my reason for preferring this is that image pull time and time to execute init contains feel like they address separate use cases than what we're trying to address here

The situations a user might want to address here (i.e. image pull backoff, long image pulls and long init container execution) seem like they're all fairly distinct and the kinds of timeouts users would want to provide would be very different, so bundling them into one option doesn't totally make sense to me.

At the very least I'm adding a section about this in the alternatives and adding an example of how we could still address it in the future.

This would mean the timeout would also include the time for init containers to execute (which Pipelines still uses
for some features) as well as the time to pull any images needed by the containers. `Ready` would also include a
positive indication from any readiness probes that might have been added to the pod.
* We could stop the timeout once [the Pod phase is `Running`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#Pod-phase),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for this, personally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussing in comment above #540 (comment))

@pritidesai
Copy link
Member

API WG - @bobcatfish responding to comments

@tekton-robot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2022
@tekton-robot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2022
@bobcatfish
Copy link
Contributor Author

/remove-lifecycle rotten

I am going to update this 🙏

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 7, 2022
@badamowicz
Copy link

As the original author of this issue may I ask for which Tekton version this TEP is scheduled? Would help us in our planning. Thanks!

@bobcatfish
Copy link
Contributor Author

@jerop:

Made me wonder about taskrun-execution which sets the execution timeout for each task, but we don't have use cases for this and can be explored later if needed

now that i've updated the syntax to allow for more task level timeouts in future I think we could account for this if we had use cases in the future, e.g.:

spec:
  timeouts:
    pipeline: "0h15m0s"
    tasks: "0h10m0s"
    finally: "0h5m0s"
    taskruns: # new section
      scheduling: "1m" # new option
      execution: "1m" # could be added in future if needed

Noting here that we discussed in the API WG on 10/18 that scheduling concerns have come up in other areas and we could explore solving them together

I can see the overlap in these concerns all being scheduling related (maybe a scheduling "problem space"? :D) but addressing parallel execution and coordination between the scheduling of tasks feels to me like something we'd address separately - my reasoning is that in this TEP we're not changing the scheduling behavior, we're just saying we want to stop if things are happening too slowly, vs. those issues seem to be about actually changing what is scheduled and when? (not sure how convincing that is!)

Could providing Pending TaskRuns, where the existing timeouts don't apply while they're pending, be sufficient for the use cases discussed here? Maybe we could add this as an alternative approach to solve for the use cases in this TEP?

I'll add this as an alternative ("Leverage pending status") - let me know if I didn't understand the suggestion, my assumption is that in that solution users would need to build their own controller to manage pending TaskRuns and apply the timeouts logic they want.

(There's also another alternative listed in the bullet list of alternatives that involves Tekton Pipelines moving all timeout logic into a separate more customizable controller.)

@bobcatfish
Copy link
Contributor Author

@imjasonh @jerop @afrittoli @souleb after a significant delay I've finally responded to your feedback and updated this TEP - I've also ambitiously updated it to "implementable" 🙏 PTAL!

ive made a few updates:

  • implementable (instead of just proposed)
  • upated the syntax to create a new "taskruns" section within the pipelinerun level configuration to:
    • a. make it more clear that this option applies per task(run)
    • b. make it possible to add more task(run) level timeouts in future if needed
  • tried to make the example of how the timeouts interact with each other more clear (values didnt make sense before)
  • Added alternative: including image pull time in scheduling timeout; also highlighting that we could add image pull time specific timeouts later (let me know what you think @imjasonh )
  • Added alternative: leveraging "pending" status

@badamowicz my apologies for this extreme delay 😬 !! to answer your question:

As the original author of this tektoncd/pipeline#4078 may I ask for which Tekton version this TEP is scheduled? Would help us in our planning. Thanks!

At the moment this isn't explicitly scheduled into a particular Tekton version. Implementing this particular feature is unfortunately not high priority for the Google team (so if anyone reading this is interested in volunteering to do the implementation that is very welcome!!) BUT we have some new folks starting on the team soon so my plan is to get this proposal marked as implementable so we can have one of our new team members take it on as part of their ramp up.

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2022
@bobcatfish
Copy link
Contributor Author

hahaha i said PTAL @afrittoli b/c i was convinced you left feedback on this but apparently i imagined it... XD

@badamowicz
Copy link

@bobcatfish Thanks for the update! Meanwhile I've left the project this issue occurred. However, will inform my colleagues to further track this issue.

@imjasonh
Copy link
Member

Added alternative: including image pull time in scheduling timeout; also highlighting that we could add image pull time specific timeouts later (let me know what you think @imjasonh )

This seems right to me. 👍

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2022
@bobcatfish
Copy link
Contributor Author

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2022
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bobcatfish!

increase all timeouts to avoid those pending Pods being killed due to timeout.
2. **Supporting an environment where users may "fire and forget"** a number of PipelineRuns or TaskRuns, more than the
environment can handle, causing the Runs to be queued such that eventually the user no longer cares about these
older queued runs and wants them to naturally expire to make way for newer, more relvant runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
older queued runs and wants them to naturally expire to make way for newer, more relvant runs
older queued runs and wants them to naturally expire to make way for newer, more relevant runs

### Notes/Caveats

* `PipelineRuns` can be created in [pending](https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#pending-pipelineruns)
state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
state. This is not a feature of `TaskRuns` _yet_ but [the intial PR description](https://github.com/tektoncd/community/pull/203)
state. This is not a feature of `TaskRuns` _yet_ but [the initial PR description](https://github.com/tektoncd/community/pull/203)

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2022
@dfuessl
Copy link

dfuessl commented Apr 13, 2022

Hi @bobcatfish,
thank you for all of your activities in this ticket so far. I am taking over from @badamowicz because he has left the project. I was involved in our internal discussion and I know all the details.

@bobcatfish
Copy link
Contributor Author

Welcome @dfuessl !! :D 🎉 🎉 🎉 and thanks again for your team's patience around this

This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime `TaskRuns` can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2022
@bobcatfish
Copy link
Contributor Author

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2022
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dibyom
Copy link
Member

dibyom commented Apr 18, 2022

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2022
@tekton-robot tekton-robot merged commit b42b420 into tektoncd:main Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: UnAssigned
Status: Implementable
Development

Successfully merging this pull request may close these issues.

10 participants