From dd8699383cc796a050ad11bfa70199bd3ec09c0b Mon Sep 17 00:00:00 2001 From: Prashant Sharma Date: Mon, 26 Jul 2021 17:06:03 +0530 Subject: [PATCH 1/2] TEP-0069: Support retries for custom task in a pipeline - design. --- ...t-retries-for-custom-task-in-a-pipeline.md | 64 +++++++++++++------ teps/README.md | 2 +- 2 files changed, 44 insertions(+), 22 deletions(-) diff --git a/teps/0069-support-retries-for-custom-task-in-a-pipeline.md b/teps/0069-support-retries-for-custom-task-in-a-pipeline.md index e9888913b..45273a2a5 100644 --- a/teps/0069-support-retries-for-custom-task-in-a-pipeline.md +++ b/teps/0069-support-retries-for-custom-task-in-a-pipeline.md @@ -1,8 +1,8 @@ --- -status: proposed +status: implementable title: Support retries for custom task in a pipeline. creation-date: '2021-05-31' -last-updated: '2021-05-31' +last-updated: '2021-07-26' authors: - '@Tomcli' - '@ScrapCodes' @@ -41,7 +41,7 @@ This TEP is about, a pipeline task can be configured with a `retries` count for Custom tasks. Also, a `PipelineRun` already manages a retry for regular task -by updating it's status. However, for custom task, a tekton owned controller +by updating its status. However, for custom task, a tekton owned controller can signal a custom task controller, to retry. A custom task controller may optionally support it. @@ -126,7 +126,7 @@ Proposed algorithm for performing a retry for custom task. to request a custom task to cancel. - Step 3. In addition to patching the `pipelinerun` controller also enqueue a timer - `EnqueueAfter(30*time.Second)` (configurable). On completion of timeout + `time.After(30*time.Second)` (configurable). On completion of timeout (i.e. 30s), it checks if `/spec/status` is `RunRetry`, then it assumes that custom task does not support retry. - a) if custom task does not supports retry as above, It sets no. of `retry done` @@ -216,23 +216,36 @@ performance requirements. ## Design Details +Add an optional `Retries` field of type `int` to `RunSpec`. -## Test Plan +Add optional `RetriesStatus` field to `RunStatusFields` of type `[]RunStatus`. - +## Test Plan + +Add unit tests and e2e integration tests for following two cases. + +1. If the custom task, does not support a retry, we wait until the configured shorter timeout + (30 seconds by default) and exhaust all retries. + +2. If the custom task *does support* a retry i.e., it does clear its 'spec.status', + on each retry. Verify it performs the correct number of retries. ## Design Evaluation +An upgrade strategy for existing custom controllers, + +1. Custom controller already supports a retry field. + - It can deprecate the existing retry field and refer to `Run.spec.retries`. + - Watch the status field i.e. `/spec/status` of `Run` if it is `RunRetry` + then start executing retry and clear its status to let `tektoncd` + controller that the custom task has started retrying. +2. If custom-task does not already support retry and does not wish to support + it, then they can ignore it and tektoncd controller will be able detect + that. +3. If custom-task does not already support retry and wish to support it, + then it should start watching the `/spec/status` of `Run`. If it is + `RunRetry` then start executing retry and clear its status to let + `tektoncd` controller know that the custom task has started retrying. ## References (optional) diff --git a/teps/README.md b/teps/README.md index 440e4effb..199f84d21 100644 --- a/teps/README.md +++ b/teps/README.md @@ -216,7 +216,7 @@ This is the complete list of Tekton teps: |[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 | |[TEP-0066](0066-dogfooding-tekton.md) | Dogfooding Tekton | proposed | 2021-05-16 | |[TEP-0067](0067-tekton-catalog-pipeline-organization.md) | Tekton Catalog Pipeline Organization | implementable | 2021-02-22 | -|[TEP-0069](0069-support-retries-for-custom-task-in-a-pipeline.md) | Support retries for custom task in a pipeline. | proposed | 2021-05-31 | +|[TEP-0069](0069-support-retries-for-custom-task-in-a-pipeline.md) | Support retries for custom task in a pipeline. | implementable | 2021-07-26 | |[TEP-0070](0070-tekton-catalog-task-platform-support.md) | Platform support in Tekton catalog | proposed | 2021-06-02 | |[TEP-0071](0071-custom-task-sdk.md) | Custom Task SDK | proposed | 2021-06-15 | |[TEP-0072](0072-results-json-serialized-records.md) | Results: JSON Serialized Records | implementable | 2021-07-26 | From c72cb19b9e2e51eca8e509c6040dde72c11ad37f Mon Sep 17 00:00:00 2001 From: Prashant Sharma Date: Thu, 16 Sep 2021 14:32:25 +0530 Subject: [PATCH 2/2] TEP-0069: Support retries for custom task in a pipeline - design - approach 2. --- ...t-retries-for-custom-task-in-a-pipeline.md | 139 +++++------------- 1 file changed, 34 insertions(+), 105 deletions(-) diff --git a/teps/0069-support-retries-for-custom-task-in-a-pipeline.md b/teps/0069-support-retries-for-custom-task-in-a-pipeline.md index 45273a2a5..3664cadc7 100644 --- a/teps/0069-support-retries-for-custom-task-in-a-pipeline.md +++ b/teps/0069-support-retries-for-custom-task-in-a-pipeline.md @@ -29,6 +29,7 @@ authors: - [Alternatives](#alternatives) - [Infrastructure Needed (optional)](#infrastructure-needed-optional) - [Upgrade & Migration Strategy (optional)](#upgrade--migration-strategy-optional) +- [Implementation Pull request(s)](#implementation-pull-request-s) - [References (optional)](#references-optional) @@ -67,8 +68,8 @@ stub code to make it easy how to support it. ### Goals * Support propagating `pipelineSpec.task.retries` count information to custom-task controllers. -* Support signalling a `retry` to custom task controller, for a specific run. A - custom controller may optionally support it. +* Support updating the status of retry history to `tektoncd pipeline` + controller. * Gracefully handle the case where, custom controller does not support retry and yet the `PipelineRun` happens to be configured with retry. This also implies, an existing controller should not mis-behave if it is _not_ upgraded @@ -94,11 +95,6 @@ stub code to make it easy how to support it. and retrying. An end user, cannot know the status of a `PipelineRun` unless they drill down the status of each custom task e.g. if they are viewing their Pipeline progress on UI. -5. Make it easier for custom task controller to implement retry, i.e. they do - not need to track status. - Custom task SDK point of view, it will be easier to provide support for - retries. e.g. we could even provide a callback, where custom task controller - can just reset the status and certain parameters for it to begin retrying. ## Requirements @@ -108,74 +104,17 @@ None. Requesting API changes: -1. Add field `Retries` to `RunSpec`, an integer count which acts as a FYI to +1. Add field `Retries` to `RunSpec`, an integer count which is communicated to custom task controller. -2. Add a new `RunRetry`, in addition to `RunCancelled` status to `RunSpecStatus` - i.e. `v1alpha1.RunSpecStatusRetry` -3. Add a field `RetriesStatus` to `RunStatusFields`, to maintain the retry +2. Add a field `RetriesStatus` to `RunStatusFields`, to maintain the retry history for a `Run`, similar to `v1beta1.TaskRunStatusFields.RetriesStatus` + This field is updated by the custom task controller. -Proposed algorithm for performing a retry for custom task. - -- Step 1. A `pipelineTask` consisting of a custom task X, is configured with - `retries` count. - -- Step 2. On failure of task X, `pipelinerun` controller sees a request for a - retry. It then communicates the same to custom task `Run` by patching - `/spec/status` with a `v1alpha1.RunSpecStatusRetry` i.e. `RunRetry`. Similar - to request a custom task to cancel. - -- Step 3. In addition to patching the `pipelinerun` controller also enqueue a timer - `time.After(30*time.Second)` (configurable). On completion of timeout - (i.e. 30s), it checks if `/spec/status` is `RunRetry`, then it assumes that - custom task does not support retry. - - a) if custom task does not supports retry as above, It sets no. of `retry done` - to the `retries` count configured - i.e. exhaust all retries. - - b) if custom task does support retry, update retry history. - -- Step 4. The custom task that wants to support the retry, has to update - - a) `status.conditions` to indicate it is `Running`. - - b) clear `/spec/status` if it is `RunRetry`. - -_A task may retry and immediately fail, so controller cannot fully rely on -`status.conditions`._ +A pipeline task may be configured with a timeout, and the timeout includes time +required to perform all the retries. ### Notes/Caveats (optional) - -Q. A Custom task does not support retry, and is configured to run with retry. - How to gracefully handle this case? - - _Approach proposed:_ The `pipelineRun` waits for a configurable shorter timeout - (say 30s), and if the custom task controller does not signal that it has begun - to retry, assume it does not support `retry`. - -Other options: - -* Option 1: The `pipelineRun` should wait till the timeout and fail. The - downside of this approach is, it may wait for a very long period of time. -* Option 2: It should have a way of knowing custom task does not support a - retry. e.g. Custom controllers declaring that they support retry, somehow - (not sure how this can be done). - -Q. In a rare scenario, what if there is race between "RunRetry" and - "RunCancelled", i.e. tektoncd controller asks the custom controller to retry - and soon after decides to cancel (or user invoked cancel). Meanwhile, custom - controller detects, that it has been asked to retry and begins by clearing - its status. This may cause, custom controller to miss the cancel update. - -```go -patches := []jsonpatch.JsonPatchOperation{{ - Operation: "test", - Path: "/spec/status", - Value: v1alpha1.RunSpecStatusRetry, - }, { - Operation: "remove", - Path: "/spec/status", - }} -``` - -A patch as above can be used i.e. test if `/spec/status` -has `v1alpha1.RunSpecStatusRetry` then clear it, else fail. +None. ### Risks and Mitigations @@ -220,32 +159,18 @@ Add an optional `Retries` field of type `int` to `RunSpec`. Add optional `RetriesStatus` field to `RunStatusFields` of type `[]RunStatus`. -Add a config map entry (default-short-timeout-seconds) to `config-defaults` in -order to make short timeout configurable. - -```yaml -# default-short-timeout-seconds contains the default number of -# seconds to wait for custom task to respond, on timeout it is assumed -# custom task does not support the feature. Currently, it is used to -# quickly timeout a retry in a custom-task. -default-short-timeout-seconds: "30" # 30 seconds -``` - -Introduce a new status `RunSpecStatusRetry RunSpecStatus = "RunRetry"` for -`/spec/status` of a `Run`. - -Algorithm for performing a retry for a custom task is same as proposal section -[Proposal](#proposal). +A custom task controller can optionally support retry, and can honor the retries +count, and update the `RetriesStatus` on each retry. ## Test Plan Add unit tests and e2e integration tests for following two cases. -1. If the custom task, does not support a retry, we wait until the configured shorter timeout - (30 seconds by default) and exhaust all retries. +1. If the custom task, does not support a retry, we wait until the configured + pipeline task timeout and fail. -2. If the custom task *does support* a retry i.e., it does clear its 'spec.status', - on each retry. Verify it performs the correct number of retries. +2. If the custom task *does support* a retry i.e., it does update the `RetriesStatus`. + Verify it performs the correct number of retries. ## Design Evaluation