Retries for Custom Tasks #5218

lbernick · 2022-07-26T17:52:03Z

Spin-off from #4313.

Right now, some custom tasks support retries implemented in the custom task controller logic. Pipeline tasks pass their desired number of retries to the custom run.

I'm wondering if it would make sense to have the PipelineRun controller responsible for retries of custom runs. This would make custom tasks easier to author, and make them more consistent with how we handle TaskRuns. It would involve removing spec.retries from customRun, and the PR controller would be fully responsible for updating retriesstatus when a run has failed. This solution was discussed in tektoncd/community#491 (comment), and it looks like it was rejected to avoid having the PipelineRun controller be responsible for updating Run statuses, but that's a pattern we already use with TaskRuns here. In this solution, a run would not be able to retry on its own.

An alternative option is to provide some SDK for retries for custom task controllers. This would make authoring custom tasks easier, but not as easy as having it implemented in the PR controller.

ScrapCodes · 2022-07-27T06:40:38Z

The reason I stepped back on the pipeline run controller managing retries is, it gives more freedom to the custom tasks how exactly they want to approach retries. In reality, custom tasks are very different from each other for some - custom tasks retires does not mean anything at all, and for some it is similar to how the task runs handle retries.

Also, so far we did not force update the status of a Run, once we get into managing retries, we will have to update it.

Lastly, in my implementation of retries here managed by PipelineRun, it was rather complex and had a need to synchronize with custom task controller on each retry (this might impact performance not just timing issues when scaling).

Lastly, this approach requires updating existing custom task controllers.

Side question, Can custom task controllers have replicas?

lbernick · 2022-07-27T15:46:46Z

Thanks @ScrapCodes for the discussion!

The reason I stepped back on the pipeline run controller managing retries is, it gives more freedom to the custom tasks how exactly they want to approach retries. In reality, custom tasks are very different from each other for some - custom tasks retires does not mean anything at all, and for some it is similar to how the task runs handle retries.

I think it would be helpful to list out a few examples of custom tasks and how they would work if the PipelineRun controller handled retries. For example, it doesn't make sense to have retries for the wait task, so I doubt anyone would write a pipeline with retries for the wait task, but even if they did the wait task doesn't fail, so it wouldn't get retried. Are there custom tasks where having the PipelineRun retry the task would result in undesired behavior?

Also, so far we did not force update the status of a Run, once we get into managing retries, we will have to update it.

One other option is to have the PipelineRun controller create a new Run for each retry once the initial attempt has failed. Also it's worth noting that we already use this pattern for TaskRuns (but perhaps we should change that).

Lastly, in my implementation of retries here managed by PipelineRun, it was rather complex and had a need to synchronize with custom task controller on each retry (this might impact performance not just timing issues when scaling).

Could you elaborate a bit? What do you mean here that the PR controller had to synchronize with the custom task controller?

Lastly, this approach requires updating existing custom task controllers.

I agree this is an unfortunate downside. However, Runs follow our alpha compatibility policy, meaning they can change in breaking ways. To mitigate this, I propose that if we go ahead with this change, we do it only in the beta version of runs (v1beta1.CustomRun). I think it's more important to get the API right going forward than to prioritize backwards compatibility beyond what's required by our compatibility policy.

Side question, Can custom task controllers have replicas?

Sure, I don't see why not! You could run them via a k8s deployment with multiple replicas. Curious if you see some alternate solution here related to replicas?

lbernick · 2022-08-01T17:42:41Z

Plan is to first revisit how we do TaskRun retries (tracked in #5248). After this, it may be easier to explore implementing CustomRun retries in a similar way.

lbernick · 2022-08-24T15:59:59Z

I think we can close this, since we're going to make TaskRun retries more like Run retries instead of the other way around. The only remaining AI here would be make Run retries easier to implement via an SDK or similar but this is tracked in #5157.

lbernick added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 26, 2022

lbernick mentioned this issue Jul 26, 2022

TEP-0114: Custom Tasks Beta - Required Work #4313

Closed

7 tasks

lbernick mentioned this issue Aug 8, 2022

Inconsistency/race conditions in handling of Pipeline Task timeouts for Runs #5286

Closed

lbernick closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retries for Custom Tasks #5218

Retries for Custom Tasks #5218

lbernick commented Jul 26, 2022

ScrapCodes commented Jul 27, 2022 •

edited

Loading

lbernick commented Jul 27, 2022

lbernick commented Aug 1, 2022

lbernick commented Aug 24, 2022

Retries for Custom Tasks #5218

Retries for Custom Tasks #5218

Comments

lbernick commented Jul 26, 2022

ScrapCodes commented Jul 27, 2022 • edited Loading

lbernick commented Jul 27, 2022

lbernick commented Aug 1, 2022

lbernick commented Aug 24, 2022

ScrapCodes commented Jul 27, 2022 •

edited

Loading