Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEP-0050] Add proposed design details #785

Merged

Conversation

QuanZhang-William
Copy link
Member

This PR proposes adding a new field OnErrorto PipelineTaskdefinition to allow users define task failure strategy

@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2022
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 16, 2022
@QuanZhang-William QuanZhang-William force-pushed the tep-0050-add-proposed-design branch 2 times, most recently from 5b4be2c to 5412084 Compare August 16, 2022 17:21
@QuanZhang-William QuanZhang-William marked this pull request as ready for review August 16, 2022 17:22
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2022
@QuanZhang-William
Copy link
Member Author

/cc @jerop @pritidesai

@dibyom
Copy link
Member

dibyom commented Aug 17, 2022

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Aug 17, 2022
@QuanZhang-William QuanZhang-William marked this pull request as draft August 18, 2022 13:08
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2022
@QuanZhang-William QuanZhang-William force-pushed the tep-0050-add-proposed-design branch 2 times, most recently from 3c772aa to cfe56e8 Compare August 18, 2022 15:26
@QuanZhang-William QuanZhang-William marked this pull request as ready for review August 18, 2022 15:26
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2022
@jerop
Copy link
Member

jerop commented Aug 18, 2022

/assign

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @QuanZhang-William!

teps/0050-ignore-task-failures.md Outdated Show resolved Hide resolved
teps/0050-ignore-task-failures.md Show resolved Hide resolved
teps/0050-ignore-task-failures.md Outdated Show resolved Hide resolved
teps/0050-ignore-task-failures.md Outdated Show resolved Hide resolved
@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2022
@jerop
Copy link
Member

jerop commented Aug 22, 2022

/assign @pritidesai

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Setting ```Retry``` and ```OnFail``` to ```continue``` at the same time is not allowed in this iteration of the TEP, as there is no point to retry a task that allows to fail. Pipeline validation will be added accordingly. We can support retries with ignored failed task in the future if needed.

### Tasks with Missing Resource Dependency
In the following example, the first task fails to produce a result (due to step error [details](https://github.com/tektoncd/pipeline/issues/3749)) that is going to be consumed by the second task. The second task will be skipped with reason ```Results were missing``` irrelevant of its OnFail type. This behaviour is consistent with [Guarding a Task only](https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md#guarding-a-task-only). When we add support for [default Results](https://github.com/tektoncd/community/blob/main/teps/0048-task-results-without-results.md), then the resource-dependent Tasks may be executed if the default Results from the skipped parent Task are specified.
Copy link
Member

@pritidesai pritidesai Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is little confusing and inconsistent with step.onError.

The linked issue has been partially fixed in which pipeline results are emitted even when a task fails. Pipeline results are initialized with the task results from all successful tasks until a task has failed.

step.onError does support emitting results and consuming them - https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#produce-a-task-result-with-onerror

As long as a task has initialized results before failing, they can be made available for the consumption. This only should be supported with task.onFail: continue.

When a pipeline author has chosen a strategy where a task failure is expected, the author must have complete understanding of what the result could mean and how to consume that result if it's initialized before failing.

For example, git-clone produces commit as a task result. If pipeline author chooses to ignore failure from git-clone, the author is aware that the commit might/might not be initialized depending on the failure. Now, as a pipeline author, when a git-clone failed for some reason but commit is available, its guaranteed that my repo was cloned as expected. If the commit is not initialized and git-clone failed, the repo was not cloned for some reason.

Another example is boskos-acquire. As a pipeline author, I can choose to ignore failure from boskos-acquire. If boskos-acquire failed for some reason to create heartbeat pod but was able to lease resources, leased-resource will be initialized and can be made available to the cleanup task if its initialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as a task has initialized results before failing, they can be made available for the consumption. This only should be supported with task.onFail: continue.

When a pipeline author has chosen a strategy where a task failure is expected, the author must have complete understanding of what the result could mean and how to consume that result if it's initialized before failing.

Thanks for your input, @pritidesai! This is a good point and the question is essentially: Should we emit results for failed tasks with onFail:continue? And this may need some discussion.

Currently we don't emit results for failed tasks. There are some discussions around it and I'm not sure why we ended up not implement it. Looking at the requirement 3 of this TEP: the task would be considered "successful" for the purposes of determining the status of the pipeline, it seems to me we should respect the existing behaviour about emitting result, and change behaviour of emitting task is another discussion/issue/TEP? Is it a good idea that we respect the existing behaviour as default, and create another TEP/Issue specifically deal with the emitting result behaviour if we can such request from users?

In terms of consistency, I think step.OnError and task.OnFail are different things. step.onError emits the result only when the overall task run status is success, but in task.OnFail we keep the task run status as failed even when it is set to continue

Thoughts? @jerop @vdemeester ?

Copy link
Member

@pritidesai pritidesai Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an opportunity here to design this right for many different use cases we have seen. I would recommend attempting to solve here rather than delegating it to a separate TEP.

task.onError is a failure strategy and no different from step.onError. Pipeline is not permitted to continue running after a first failure is encountered. With task.onError, even though the task is failed (container exit code stays non-zero), the pipeline is allowed to continue executing the rest of the tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. If we want to consolidate the behaviour about emitting results from failed task results in this TEP, I think it makes more sense to make the emitting result behaviour consistent (i.e. we emit or not emit results from failed tasks regardless of the OnErrorType), otherwise such inconsistency would be confusing to users, since OnErrorType simply describes "should we stop executing the pipeline on task failure or not", and not about "should we emit results from such failed tasks or not".

Looking at the discussion and some feedback from this PR, it looks we do have such requests, why don't we emit result for failed tasks across the board?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai @lbernick @jerop

I have added a new section Emit Results from Ignored Failed Tasks and updated Task with Missing Resource Dependency section based on our discussion. Please take a look

@pritidesai
Copy link
Member

Thanks a bunch @QuanZhang-William for taking this forward, appreciate all the efforts you have put in 🤗.

@jerop @vdemeester I have left some comments on the design, please feel free to ignore them if they are not aligned, thanks 🙏

@pritidesai
Copy link
Member

API WG - @QuanZhang-William working on comments

@QuanZhang-William QuanZhang-William force-pushed the tep-0050-add-proposed-design branch 3 times, most recently from da51a1f to 2f7c61d Compare August 30, 2022 14:05
Copy link
Member

@lbernick lbernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if we have any desire for this to be consistent with step onError, or if they're addressing different use cases? If so, might be worth including a short section discussing similarities/differences between the two features. Not blocking

teps/0050-ignore-task-failures.md Outdated Show resolved Hide resolved
teps/0050-ignore-task-failures.md Outdated Show resolved Hide resolved
teps/0050-ignore-task-failures.md Outdated Show resolved Hide resolved
This commit proposes adding a new field "OnError" to "PipelineTask" definition to allow users define task failure strategy
@pritidesai
Copy link
Member

Thank you @QuanZhang-William once again 🤗I have one last NIT but will not block this PR.

“continueAndFail” sounds very confusing to me 🙏 “continue” and “fail” are two opposites states and meant for two different runs here I.e. continue pipelineRun but fail taskRun. The simpler options are “continue” (continue execution of PipelineRun) or “ignoreFailure” (ignore taskRun failure).

Please feel free to address this NIT in a separate PR or we implement as is.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2022
@pritidesai
Copy link
Member

/retest

@pritidesai
Copy link
Member

/test pull-community-teps-lin

@tekton-robot
Copy link
Contributor

@pritidesai: No presubmit jobs available for tektoncd/community@main

In response to this:

/test pull-community-teps-lin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot merged commit f495d48 into tektoncd:main Sep 22, 2022
QuanZhang-William added a commit to QuanZhang-William/community that referenced this pull request Sep 28, 2023
Prior to this commit, the proposal 2 valid values for `PipelineTask` `OnError` field: `continueAndFail` and `StopAndFail`.
`continueAndFail` indicates to fail the `PipelineTask`, continue to execute the DAG, and DO NOT fail the `PipelineRun`.
However, the name `continueAndFail` is very confusing since

- 'continue' and 'Fail' sounds like a conflict (see [comment](tektoncd#785 (comment)))
- 'Fail' could be intepreted as failing the whole `pipelinerun`, which is not the expected behavior

This commit simplifies the value to `continue`. This is more straightforward and more consistent with `step.OnError`.

/kind tep
tekton-robot pushed a commit that referenced this pull request Sep 29, 2023
Prior to this commit, the proposal 2 valid values for `PipelineTask` `OnError` field: `continueAndFail` and `StopAndFail`.
`continueAndFail` indicates to fail the `PipelineTask`, continue to execute the DAG, and DO NOT fail the `PipelineRun`.
However, the name `continueAndFail` is very confusing since

- 'continue' and 'Fail' sounds like a conflict (see [comment](#785 (comment)))
- 'Fail' could be intepreted as failing the whole `pipelinerun`, which is not the expected behavior

This commit simplifies the value to `continue`. This is more straightforward and more consistent with `step.OnError`.

/kind tep
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: UnAssigned
Development

Successfully merging this pull request may close these issues.

7 participants