Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries for Custom Tasks #5218

Closed
lbernick opened this issue Jul 26, 2022 · 4 comments
Closed

Retries for Custom Tasks #5218

lbernick opened this issue Jul 26, 2022 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@lbernick
Copy link
Member

Spin-off from #4313.

Right now, some custom tasks support retries implemented in the custom task controller logic. Pipeline tasks pass their desired number of retries to the custom run.

I'm wondering if it would make sense to have the PipelineRun controller responsible for retries of custom runs. This would make custom tasks easier to author, and make them more consistent with how we handle TaskRuns. It would involve removing spec.retries from customRun, and the PR controller would be fully responsible for updating retriesstatus when a run has failed. This solution was discussed in tektoncd/community#491 (comment), and it looks like it was rejected to avoid having the PipelineRun controller be responsible for updating Run statuses, but that's a pattern we already use with TaskRuns here. In this solution, a run would not be able to retry on its own.

An alternative option is to provide some SDK for retries for custom task controllers. This would make authoring custom tasks easier, but not as easy as having it implemented in the PR controller.

@lbernick lbernick added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 26, 2022
@ScrapCodes
Copy link
Contributor

ScrapCodes commented Jul 27, 2022

The reason I stepped back on the pipeline run controller managing retries is, it gives more freedom to the custom tasks how exactly they want to approach retries. In reality, custom tasks are very different from each other for some - custom tasks retires does not mean anything at all, and for some it is similar to how the task runs handle retries.

Also, so far we did not force update the status of a Run, once we get into managing retries, we will have to update it.

Lastly, in my implementation of retries here managed by PipelineRun, it was rather complex and had a need to synchronize with custom task controller on each retry (this might impact performance not just timing issues when scaling).

Lastly, this approach requires updating existing custom task controllers.

Side question, Can custom task controllers have replicas?

@lbernick
Copy link
Member Author

Thanks @ScrapCodes for the discussion!

The reason I stepped back on the pipeline run controller managing retries is, it gives more freedom to the custom tasks how exactly they want to approach retries. In reality, custom tasks are very different from each other for some - custom tasks retires does not mean anything at all, and for some it is similar to how the task runs handle retries.

I think it would be helpful to list out a few examples of custom tasks and how they would work if the PipelineRun controller handled retries. For example, it doesn't make sense to have retries for the wait task, so I doubt anyone would write a pipeline with retries for the wait task, but even if they did the wait task doesn't fail, so it wouldn't get retried. Are there custom tasks where having the PipelineRun retry the task would result in undesired behavior?

Also, so far we did not force update the status of a Run, once we get into managing retries, we will have to update it.

One other option is to have the PipelineRun controller create a new Run for each retry once the initial attempt has failed. Also it's worth noting that we already use this pattern for TaskRuns (but perhaps we should change that).

Lastly, in my implementation of retries here managed by PipelineRun, it was rather complex and had a need to synchronize with custom task controller on each retry (this might impact performance not just timing issues when scaling).

Could you elaborate a bit? What do you mean here that the PR controller had to synchronize with the custom task controller?

Lastly, this approach requires updating existing custom task controllers.

I agree this is an unfortunate downside. However, Runs follow our alpha compatibility policy, meaning they can change in breaking ways. To mitigate this, I propose that if we go ahead with this change, we do it only in the beta version of runs (v1beta1.CustomRun). I think it's more important to get the API right going forward than to prioritize backwards compatibility beyond what's required by our compatibility policy.

Side question, Can custom task controllers have replicas?

Sure, I don't see why not! You could run them via a k8s deployment with multiple replicas. Curious if you see some alternate solution here related to replicas?

@lbernick
Copy link
Member Author

lbernick commented Aug 1, 2022

Plan is to first revisit how we do TaskRun retries (tracked in #5248). After this, it may be easier to explore implementing CustomRun retries in a similar way.

@lbernick
Copy link
Member Author

I think we can close this, since we're going to make TaskRun retries more like Run retries instead of the other way around. The only remaining AI here would be make Run retries easier to implement via an SDK or similar but this is tracked in #5157.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants