Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0069: Support retries for custom task in a pipeline - design. #491

Merged
merged 2 commits into from
Sep 21, 2021

Conversation

ScrapCodes
Copy link
Contributor

@ScrapCodes ScrapCodes commented Aug 4, 2021

Added design and mark as implementable.

/kind tep

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). labels Aug 4, 2021
@@ -126,7 +126,7 @@ Proposed algorithm for performing a retry for custom task.
to request a custom task to cancel.

- Step 3. In addition to patching the `pipelinerun` controller also enqueue a timer
`EnqueueAfter(30*time.Second)` (configurable). On completion of timeout
`time.After(30*time.Second)` (configurable). On completion of timeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using time.After for now.

Once the knative pkg is updated, I will be able to switch to NewRequeueAfter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit more about this change? (maybe some links to docs e.g. about NewRequeueAfter) is there are reason we couldn't continue to use EnqueueKeyAfter in the meantime (i.e. is there something wrong with it?) also looking at tektoncd/pipeline#4131 it looks like NewRequeueAfter is available now

@pritidesai
Copy link
Member

@bobcatfish @afrittoli @imjasonh @vdemeester @jerop

This is a follow up of the proposal PR, anyone interested in reviewing these changes?

@vdemeester
Copy link
Member

/assign

@bobcatfish
Copy link
Contributor

/assign

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some initial feedback @ScrapCodes

two requests:

  1. can this proposal include an example of what the pipeline configuration looks like when the new field(s) are used? (and the Run, and any other relevant CRDs)
  2. if possible I think it would really help to maybe include an example timeline of events around the retries - which controllers are doing what

i think it's something like:

  1. pipelines controller creates Run object with Retries set in RunSpec
  2. pipelines controller waits for either timeout, success, or failure of Run
  3. custom task controller sees Run object, updates status to indicate it is executing
  4. on failure of whatever the custom task controller is doing, it sees the value in retries if it needs to retry, it updates something in the Run's status to indicate it is retrying, and retries
  5. eventually the custom task controller will either indicate the Run succeeded, or if it runs out of retries, that it failed

# seconds to wait for custom task to respond, on timeout it is assumed
# custom task does not support the feature. Currently, it is used to
# quickly timeout a retry in a custom-task.
default-short-timeout-seconds: "30" # 30 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm surprised to see this config option and i have a couple thoughts:

  • Is this addressing something already described in the existing proposal, and if so, what? If it isn't, is there a requirement or some other details we can add to explain why this is needed (maybe with an example)
  • Is this a custom task specific timeout? if so, maybe we can include "custom task" in the name to make that clear - or maybe this applies to other timeout operations as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, at the moment it is custom task retry specific.

But, it is pretty generic too. For example, certain task require a short timeout, for example wait for a response or an update.

If you think we should make it specific to custom tasks, then happy to do that !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I'm hoping we can avoid adding this timeout if at all possible - but also adding timeouts like this has been discussed in the past - for example in tektoncd/pipeline#3385 we added a currently hardcoded value minimumResourceAge. if we do end up needing this timeout here, maybe we can merge these into one configurable value - or maybe you could even use minimumResourceAge directly (with a new name to reflect its expanded role? something like expectedControllerResponseTime or something.

there is some discussion about this in the custom task TEP as well: https://github.com/tektoncd/community/blob/main/teps/0002-custom-tasks.md#initial-update-timeout & https://github.com/tektoncd/community/blob/main/teps/0002-custom-tasks.md#open-questions

- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
Introduce a new status `RunSpecStatusRetry RunSpecStatus = "RunRetry"` for
`/spec/status` of a `Run`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit more about what using this value would do? i was under the impression the custom task controller would be in charge of the retry logic, it's not clear to me what controller (or user) would want to set the run's spec.status to RunRetry (is it maybe a new value for the condition that we'd want instead?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact retry logic is custom task controllers responsibility, but when to retry is communicated by tekton controller by setting the spec.status to RunRetry. A custom task controller indicates that it has begun to retry by clearing that flag. Do you think this makes sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, looking at the algorithm you described I think I see where my confusion is, which is basically: why does the tekton pipelines controller kick of the initial retry?

in the proposal it looks like the tekton pipelines controller has to notice an initial failure and then tell the custom task to start retrying - at which point I think the custom task controller would take over and do multiple retries? or does it fail each time and then the tekton pipelines controller has to again tell the custom task controller to retry?

for example say 2 retries are configured, is the behavior like this (tekton pipelines controller is in charge of kicking off each retry):

  1. custom task fails
  2. tekton pipelines controller sees this, updates spec.status to tell the custom task controller to run again
  3. custom task controller runs the task again, updates the status
  4. custom task fails (retry 1)
  5. tekton pipelines controller sees this, updates spec.status to tell the custom task controller to run again
  6. custom task controller runs the task again, updates the status
  7. custom task fails (retry 2)
  8. tekton pipelines controller sees this, sees no more retries are available, then it treats the custom task as failed

or is the custom task controller fully in charge of retries:

  1. custom task fails
  2. tekton pipelines controller sees this, updates spec.status to tell the custom task controller to run again
  3. custom task controller runs the task again, updates the status
  4. custom task fails (retry 1)
  5. custom task controller sees this, and runs the task again, updates the status
  6. custom task fails (retry 2)
  7. custom task controller sees this, and runs the task again, updates the status
  8. tekton pipelines controller sees this, sees no more retries are available, then it treats the custom task as failed

It seems like in both of the above versions, the retry logic is shared by both the tekton pipelines controller and the custom task controller - im wondering if it can be one or the other, either:

  1. The tekton pipelines controller is fully in charge of retries, in which case there is no need to configure Retries in the RunSpec b/c the custom task does not need to know about retries
  2. The custom task controller is fully in charge of retries; when it sees a Run with Retries configured, if whatever it is doing fails, it immediately retries, and there is no need for the new RunRetry option in spec.status

What do you think? Maybe there are some use cases im not aware of that make it so we want both controllers to be involved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bobcatfish, to be precise the example 1 is the closest to what is being proposed here.

Q. Why do we need retries in RunSpec?
It is FYI only (already stated in spec), in case the custom task controller also wishes to track the retires done and the current retry.

Q. Why do Custom task controller need to clear the spec.status flag?
This is how it can respond that it has begun to retry. If we watch conditions, then it is possible we may miss an update, for example a task fails immediately after it starts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is FYI only (already stated in spec), in case the custom task controller also wishes to track the retires done and the current retry.

ahhh sorry, I totally missed that it was FYI only 🤔

I don't like the idea of making this FYI only and I'm trying to think of how to explain why - one reason is because you can create Runs without a Pipeline being involved. Here's an example of a Run that uses the task loop custom controller. If we put Retries into the RunSpec, as a user I might think I can update that example to have retries, e.g.:

...
kind: Run
...
spec:
  params:
    - name: word
      value:
        - jump
        - land
        - roll
  retries: 5 # adding retries
  ref:
    apiVersion: custom.tekton.dev/v1alpha1
    kind: TaskLoop
    name: simpletaskloop

The spec makes it seem like I can make this retry, but in reality this wouldn't do anything - or possibly might cause the controller to mistakenly think retries are going to happen when they don't.

I would really prefer a solution where either the Pipelines controller is entirely in charge of retries, or the custom task controller is - what do you think @ScrapCodes , is there an approach where we could make that work?

A couple potential ideas:

  • Try to use the same approach used for retrying TaskRuns (using retriesstatus) - it would be a bit weird b/c it would mean the PIipelines controller would be changing the status of the Run which I think up until this point we've avoided (i.e. it's the custom task controller's responsibility to deal with status)
  • Create new runs for each retry - instead of trying to re-use the same Run, the Pipelines controller could create a new Run for each retry. It sounds like it's important that the custom task controller understand this is a retry, so we could encode that somewhere - im also tempted to suggest the status again - it seems weird for both controllers to be updating the status but maybe that's the answer here
  • Make the custom task controller entirely responsible for the retry logic - more work for custom task controller implementers, but also removes the need for the short timeout

@vdemeester curious if you have any other thoughts/ideas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this @ScrapCodes !

Custom controller cannot track retries and optimize.

I'm wondering if there is a way to have the pipelines controller entirely responsible AND make it possible for the custom controller to retry - it seems like the missing piece is that the custom task controller has to be able to tell that an invocation of a Run is a retry and maybe able to look at information about previous attempts as well.

Could it be that adding the failed run's the RetriesStatus is enough to communicate that? It looks like in your proposal you're suggesting using RetriesStatus - if the Pipelines controller is using that, maybe that's enough information for the custom task controller to see a retry is happening?

In that case the total number of retries would still be missing - if it's important we pass that along to the Custom Task controller as well, I'm wondering if we can find a way to include that in the RetriesStatus object as well (even for TaskRuns)

(biggest downside here is that it has the Pipelines controller writing the Runs status but as we weigh the options this is seeming more and more ok to me?)

___ approach (2) custom task controller entirely responsible ___

One more potential downside for approach (2): d) every custom task controller that wants to support retries has to implement the retry logic; if the pipelines controller is in charge, every custom task gets retry logic for free

a) Pipeline controller has no clue about run retry . How does timeout behave?

great point - if we wenat that route, im thinking the best option might be to assume the timeout provided by the Pipeline author includes all of the retries

this might be a reasonable assumption? im guessing most of the time when you put something into your Pipeline, e.g. say we had a Custom Task that calls out to Jenkins, you probably want to say something like "the execution of this entire unit of work shouldnt take longer than X all together, including retries", e.g. we allow 10 minutes for the total execution of the Jenkins Custom Task

there could still be timeouts on the individual attempts, but we'd pass those down to custom controller to decided (e.g. the Pipeline could say "10 minutes total allowed for the jenkins custom task" and the custom task would say "we will retry 3 times and each try gets 2 minutes to run"?)

b) Tracking history of each runs retry becomes difficult. i.e. one has to drill down into each runs status.

From my perspective the status is the appropriate place to put information about what has happened for the Run - AND this has parity with the existing TaskRun retry approach (where the retry info is in the TaskRun status)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh sorry, I totally missed that it was FYI only 🤔

I don't like the idea of making this FYI only and I'm trying to think of how to explain why - one reason is because you can create Runs without a Pipeline being involved. Here's an example of a Run that uses the task loop custom controller. If we put Retries into the RunSpec, as a user I might think I can update that example to have retries, e.g.: ...

Is this not true of params as well, e.g. someone may not implement them and yet they can be specified. My understanding so far is, ultimately people have to consult the specific custom tasks documentation. If you feel strongly against the API update in RunSpec, it can be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda see what you're saying about params but I feel it's a bit different, in my opinion:

  • it's hard to imagine a Run that won't want to use params
  • the presence of the params doesn't directly imply any particular functionality. At most it implies the Run may have some logic (completely specific to how it works) that might use the params, somehow (what a "retry" is seems less ambiguous to me)
  • while consuming params is optional for a custom task controller, retries in fact should not be acted on in the proposed design - i.e. the pipelines controller is going to be retrying; if the custom task author didn't realize this and they also implemented retry logic, you'd end up with retries * retries total retries @_@

My preference is that, if we feel it's important to tell the custom task controller what the total number of retries is, we find some other way to communicate this information (e.g. in retry information in the status, maybe even adding this into TaskRun status as well)

Copy link
Contributor Author

@ScrapCodes ScrapCodes Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I am somehow leaning towards approach 2 i.e. Let the custom task controller do the hard work, and timeout will be applicable to all the retries. This way, we will solve the problem of coupling, pipeline controller updating status of Run and problems stated above.

Actually, this is how things are currently i.e. we do not know the cause of timeout. It could have happened while performing retries or just the task was hanging. For example, how does one retry a task that is stuck - tektoncd controller will use the full timeout for it. I can document this limitation.

Thanks a lot @bobcatfish for the brainstorm, and forgive me for changing the entire approach.

P.S. I have updated the design document to reflect this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all the back and forth discussion @ScrapCodes !! :D

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 16, 2021
@bobcatfish
Copy link
Contributor

I like it! Only thing still on my mind is I wonder if eventually we'll want to add yet another level of retry and have the pipeline controller also trying to retry - but if we do have some use cases that require it, maybe it won't be the end of the world to have the retires at both levels 🤔

Very curious what @vdemeester will think as well!! Apologies in advance if he points you in a totally different direction @ScrapCodes 🙇‍♀️ I'm putting this on the API WG agenda next week as well to try to make sure it's getting visibility

/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 16, 2021
@jerop
Copy link
Member

jerop commented Sep 20, 2021

/assign

@pritidesai
Copy link
Member

Based on the discussion in the API WG on 9/20 and after reviewing the changes, I am ready to lgtm this PR.

@jerop please let us know your thoughts/comments on this.

This is ready to merge since we have approval from both assignees (thank you @bobcatfish and @vdemeester).

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds great to me - thanks @ScrapCodes!

my only concern is how a user would know that the custom task controller doesn't support retries -- it may be surprising or confusing to users, maybe documentation or tooling can address this

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, jerop, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pritidesai
Copy link
Member

/lgtm

thank you everyone 🙏

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2021
@tekton-robot tekton-robot merged commit be60a4b into tektoncd:main Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Implemented
Development

Successfully merging this pull request may close these issues.

6 participants