Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0114: Clarify the Behavior of Custom Task Retries #5393

Closed
wants to merge 1 commit into from

Conversation

XinruZhang
Copy link
Member

Changes

Prior to this PR, it is not clear enough how retries works when
the custom task run is timeout. This PR refines the developer
guide for custom controllers supporting retries to clarify it.

/kind doc

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

@tekton-robot
Copy link
Collaborator

@XinruZhang: The label(s) kind/doc cannot be applied, because the repository doesn't have them.

In response to this:

Changes

Prior to this PR, it is not clear enough how retries works when
the custom task run is timeout. This PR refines the developer
guide for custom controllers supporting retries to clarify it.

/kind doc

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 30, 2022
@XinruZhang XinruZhang changed the title TEP-0114: Clarify the Behavior of Retries TEP-0114: Clarify the Behavior of Custom Task Retries Aug 30, 2022
@XinruZhang
Copy link
Member Author

/assign @jerop

@XinruZhang
Copy link
Member Author

/release-note-none

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 30, 2022
@tekton-robot
Copy link
Collaborator

@XinruZhang: The label(s) kind/docs cannot be applied, because the repository doesn't have them.

In response to this:

/kind docs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@XinruZhang
Copy link
Member Author

/kind documentation

@tekton-robot tekton-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Aug 30, 2022
@lbernick lbernick self-assigned this Aug 30, 2022
docs/runs.md Outdated Show resolved Hide resolved
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from jerop after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lbernick
Copy link
Member

I don't think it's true that timeout should be considered for the sum of retries. The timeout value passed to runs comes from the Pipeline Task timeout. For TaskRuns, we've decided that the pipeline task timeout should apply to each retry independently, and the tasks and pipeline timeouts on the PipelineRun take into account time spent during retries. We should be consistent between TaskRuns and Runs here.

@lbernick
Copy link
Member

Although I guess if we've been documenting that it applies to all retries people have already been implementing it in this way? FYI @jerop there's some inconsistency here between taskruns and runs

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2022
@XinruZhang
Copy link
Member Author

/unhold Thanks @lbernick for pointing it out! Updated the doc. To reflect what we discussed, basically, the behavior of Run should be aligned with PipelineTask -- timeout is for each retries not the sum of all retries.

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 31, 2022
@lbernick
Copy link
Member

Thanks @XinruZhang for the updates! my only concern is that even though this is a documentation change, it's sort of backwards incompatible because it reflects the contract we expect run controllers to implement (defined in TEP-0069). I think it's worth getting some more opinions on whether we'd prefer to keep the run reconciler contract as is, or whether we'd prefer consistency between TaskRuns and Runs. In the meantime, it's worth documenting that the Pipeline Task timeout is applied independently to each retry of a TaskRun.

@dibyom
Copy link
Member

dibyom commented Aug 31, 2022

I think it's worth getting some more opinions on whether we'd prefer to keep the run reconciler contract as is, or whether we'd prefer consistency between TaskRuns and Runs. In the meantime, it's worth documenting that the Pipeline Task timeout is applied independently to each retry of a TaskRun.

I think we should at least mark this as a breaking change with an action-required - probably worth sending to the mailing lists as well if we decide to go ahead

@XinruZhang
Copy link
Member Author

Thanks @lbernick, that's a great concern. Thanks @dibyom for your input. What I can think of we need to do is: creating an issue to

  • Clarify the retry behavior gap between Runs and TaskRuns
  • Collect people's opinion on this issue
  • Reach consensus about next steps

Do you think this is a good idea?

BTW, @dibyom I'm wondering what needs to be done for breaking change? Adding action-required notice in release notes? or anything else?

@lbernick
Copy link
Member

we could probably just discuss here-- @tektoncd/core-maintainers

@XinruZhang
Copy link
Member Author

Though I do think it would be better if we keep the behaviors of Runs & TaskRuns aligned, then the action item would be what's the process of making breaking changes to Custom Task Run.

@XinruZhang
Copy link
Member Author

Opened a TEP (TEP-0121 tektoncd/community#816) to fully discuss the behavior of retries, we can continue our discussions there.

@XinruZhang XinruZhang closed this Sep 8, 2022
@XinruZhang
Copy link
Member Author

As discussed in the Tekton API Working Group today, we'd like to modify the development guidance of Custom Task, suggesting users to set timeout for each retry attempt.

@XinruZhang XinruZhang reopened this Sep 19, 2022
Prior to this commit, it is not clear enough how retries works when
the custom task run is timeout. This commit refines the developer
guide for custom controllers supporting `retries` to clarify it.
@XinruZhang
Copy link
Member Author

/hold this PR until TEP-0121 is implemented.

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 20, 2022
@XinruZhang
Copy link
Member Author

I'm closing this PR because we've made changes to v1beta1.CustomRun. We can leave it as what it is for v1alpha1.Run. Please feel free to comment if there's any other concerns.

@XinruZhang XinruZhang closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/documentation Categorizes issue or PR as related to documentation. release-note-none Denotes a PR that doesnt merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants