Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0069: Support retries for custom task in a pipeline. #441

Merged
merged 1 commit into from
Jul 26, 2021

Conversation

ScrapCodes
Copy link
Contributor

@ScrapCodes ScrapCodes commented May 31, 2021

I need feedback on the proposal, if it looks good then I can work on demo and other design aspects.

Summary

A pipeline task can be configured with a retries count, this is
currently only supported for TaskRuns and not Runs (i.e. custom tasks).

This TEP is about, a pipeline task can be configured with a retries count
for Custom tasks.

Also, a PipelineRun already manages a retry for regular task
by updating it's status. However, for custom task, a tekton owned controller
can signal a custom task controller, to retry. A custom task controller may
optionally support it.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 31, 2021
@Tomcli
Copy link
Contributor

Tomcli commented Jun 1, 2021

This TEP can be part of the TEP 02 proposal if needed.

/cc @afrittoli

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2021
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2021
@jerop
Copy link
Member

jerop commented Jun 7, 2021

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Jun 7, 2021
@afrittoli
Copy link
Member

/assign @afrittoli
/assign @vdemeester

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Jun 14, 2021

Thank you @afrittoli and @vdemeester for volunteering to review this TEP. Can you please take a look !! 🙏

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2021
@ScrapCodes ScrapCodes force-pushed the tep-69 branch 2 times, most recently from 9a0a8a5 to 57c02d5 Compare June 28, 2021 13:40
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2021
@ScrapCodes ScrapCodes force-pushed the tep-69 branch 8 times, most recently from 3c1552a to 71f5f0e Compare July 6, 2021 11:49
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @imjasonh

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2021
@ScrapCodes ScrapCodes force-pushed the tep-69 branch 2 times, most recently from 62d19a4 to 99bdb82 Compare July 12, 2021 15:13
@jerop
Copy link
Member

jerop commented Jul 12, 2021

/assign

Consider both the user's role (are they a Task author? Catalog Task user?
Cluster Admin? etc...) and experience (what workflows or actions are enhanced
if this problem is solved?).
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some use cases here describing why this is needed?

It almost makes sense to me without use cases but the part I am missing is that the way we do this with Pipeline -> TaskRun is that hte Pipeline controls the retries; I'd like to understand why we want the Pipeline to pass responsibility to the Run in this case, and why having a "retry" parameter for cases where it's important that the Custom Task do it's own retrying isn't enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can add.

A pipelineRun will have the information of retry history of a Run, just as it has for a TaskRun. So Pipeline run is responsible for controlling the retry, e.g. the passing of Retries is FYI only. Actual, retry is triggered by pipelinerun controller, by setting /spec/status for a Run. However, how exactly the retry is done, is taken care of by the custom task controller. It may not respond at all, and we should be able to handle that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra info - another alternative approach could be to have the Pipeline create a new Run every time a retry is needed, do you think there's any possibility that might work? I'm guessing that would have some downsides in that the custom task wouldn't have any control over how the retries work? 🤔

(This reminds me of our other conversations in #422 in that I only recently realized we are re-using the same TaskRun when we retry; part of the motivation of decoupling TaskRuns from Pipelines was that something like a retry of a TaskRun could be accomplished by simply making as many TaskRuns as we wanted, but apparently thats not the route we went - so I'm hoping we can decide conclusively which approach we want to embrace)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A custom task is different from a regular task, retrying can be very different from just retrying a Pod. So, a custom task may need to examine the state between two retries and optimise.

e.g. PipelineLoop controller can retry only failed pipelines based on certain criteria, instead of retrying everything incase we created a fresh Run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit more about what kind of criteria the PipelineLoop controller might be looking at?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PipelineLoop controller maintains the list of failed pipeline runs for each Run.
e.g. for a particular run - 2 out of 5 loops were not successful and as a result, it updates its status as Failed.

  1. We give it a signal to retry, it can simply retry the two failed PipelineRun loops and not everything.

  2. We create a fresh Run to retry, it has no information of what happened before. So, it starts as if fresh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh okay that makes a lot of sense - if you're willing to add this to the use cases description i think that would be great context to record.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, updated the alternatives section too!

@bobcatfish
Copy link
Contributor

/assign

@vdemeester
Copy link
Member

/assign

@bobcatfish
Copy link
Contributor

/approve

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@afrittoli
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2021
@tekton-robot tekton-robot merged commit 285ff99 into tektoncd:main Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Implemented
Development

Successfully merging this pull request may close these issues.

7 participants