feat: add retry API for archived workflows. Fixes #7906 #7988

dpadhiar · 2022-02-23T23:46:35Z

Signed-off-by: Dillen Padhiar dpadhiar99@gmail.com

Previously workflows were only able to be retried if they existed in the cluster. If a workflow is GC-ed immediately after completion, you must save the workflow manually as a plain YAML or into workflow archive, then kubectl create before using argo retry. This API allows for archived workflows to be created and retried removing the need for this otherwise hacky method.

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar · 2022-02-24T21:37:17Z

@sarabala1979

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

pkg/apiclient/workflowarchive/workflow-archive.proto

persist/sqldb/workflow_archive.go

server/workflowarchive/archived_workflow_server.go

alexec · 2022-02-26T00:51:17Z

workflow/util/util.go

+
+// retryWorkflow takes a wf in method signature instead and has boolean to determine if this is from archive or not - archive means we must create the workflow before update
+func retryWorkflow(ctx context.Context, kubeClient kubernetes.Interface, hydrator hydrator.Interface, wfClient v1alpha1.WorkflowInterface, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string, retryArchive bool) (*wfv1.Workflow, error) {
+
 switch wf.Status.Phase {


I think we should have a new func prepareWorkflowForRettry, specifically it should not take any kubeClient, hydrator, or wfClient - all it should do is take a workflow and make it ready for retrying

the calling function can do any hydration or whatever

func prepareWorkflowForRetry(wf *wfv1.Workflow) (*wfv1.Workflow, error) { switch wf.Status.Phase { case wfv1.WorkflowFailed, wfv1.WorkflowError: default: return nil, errors.Errorf(errors.CodeBadRequest, "workflow must be Failed/Error to retry") } newWF := wf.DeepCopy() // Delete/reset fields which indicate workflow completed delete(newWF.Labels, common.LabelKeyCompleted) delete(newWF.Labels, common.LabelKeyWorkflowArchivingStatus) newWF.Status.Conditions.UpsertCondition(wfv1.Condition{Status: metav1.ConditionFalse, Type: wfv1.ConditionTypeCompleted}) newWF.ObjectMeta.Labels[common.LabelKeyPhase] = string(wfv1.NodeRunning) newWF.Status.Phase = wfv1.WorkflowRunning newWF.Status.Nodes = make(wfv1.Nodes) newWF.Status.Message = "" newWF.Status.StartedAt = metav1.Time{Time: time.Now().UTC()} newWF.Status.FinishedAt = metav1.Time{} newWF.Spec.Shutdown = "" if newWF.Spec.ActiveDeadlineSeconds != nil && *newWF.Spec.ActiveDeadlineSeconds == 0 { // if it was terminated, unset the deadline newWF.Spec.ActiveDeadlineSeconds = nil } newWF.Status.StoredTemplates = make(map[string]wfv1.Template) for id, tmpl := range wf.Status.StoredTemplates { newWF.Status.StoredTemplates[id] = tmpl } return newWF, nil }

Should this function replace retryWorkflow? In that case, we would move all the node logic into the calling functions RetryWorkflow and RetryArchiveWorkflow

prepareWorkflowForRetry is great

workflow/util/util.go

alexec

Great work identifying that you cannot just resubmit an archived workflow.

alexec

changes requested

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec

Some quick comments.

What happens if you move the Update/Create to the service level?

server/workflowarchive/archived_workflow_server.go

workflow/util/util.go

alexec · 2022-03-01T22:54:14Z

workflow/util/util.go

+
+// retryWorkflow takes a wf in method signature instead and has boolean to determine if this is from archive or not - archive means we must create the workflow before update
+func retryWorkflow(ctx context.Context, kubeClient kubernetes.Interface, hydrator hydrator.Interface, wfClient v1alpha1.WorkflowInterface, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string, retryArchive bool) (*wfv1.Workflow, error) {
+
 switch wf.Status.Phase {


prepareWorkflowForRetry is great

workflow/util/util_test.go

dpadhiar · 2022-03-02T18:21:19Z

What happens if you move the Update/Create to the service level?

Functionality is the same if we move the update/create calls to the service level. We can remove the RetryArchiveWorkflow function altogether since RetryWorkflow will only just prepare a workflow for retry and the server will either update or create.

alexec · 2022-03-02T18:40:37Z

I think that's the right thing to do. The current util finc combines two responsibilies:

Preparing the workflow for retry.
Persisting the workflow.

It if only does the first, the second responsibly can be moved to the service layer.

https://en.wikipedia.org/wiki/Single-responsibility_principle

… to service level Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

server/workflowarchive/archived_workflow_server.go

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

server/workflow/workflow_server.go

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec

can you please add an e2e (API) test?

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

terrytangyuan

LGTM.

alexec · 2022-03-10T17:59:32Z

test/e2e/argo_server_test.go

+ })
+
+ s.Run("Retry", func() {
+ s.Need(fixtures.BaseLayerArtifacts)


Currently, retry archived workflow will fail and return a 500 internal server error if the original workflow isn't deleted so for testing purposes, this is required.

s.Need(fixtures.BaseLayerArtifacts) should not be needed?

alexec · 2022-03-10T18:00:26Z

workflow/util/util.go

- updated, err = retryWorkflow(ctx, kubeClient, hydrator, wfClient, name, restartSuccessful, nodeFieldSelector)
- return !errorsutil.IsTransientErr(err), err
- })
+func RetryWorkflow(ctx context.Context, podIf v1.PodInterface, wfClient v1alpha1.WorkflowInterface, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string) (*wfv1.Workflow, error) {


can we follow the pattern we found in Resubmit, i.e. create a FormulateResubmitWorkflow and move all the client-go calls into the service

We can remove wfClient as a whole from the RetryWorkflow and RetryArchivedWorkflow methods since they no longer are used at all in the helper prepareWorkflowForRetry

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec · 2022-03-15T17:30:15Z

workflow/util/util.go

-func retryWorkflow(ctx context.Context, kubeClient kubernetes.Interface, hydrator hydrator.Interface, wfClient v1alpha1.WorkflowInterface, name string, restartSuccessful bool, nodeFieldSelector string) (*wfv1.Workflow, error) {
- wf, err := wfClient.Get(ctx, name, metav1.GetOptions{})
+// RetryWorkflow creates a workflow from the workflow archive
+func RetryArchivedWorkflow(ctx context.Context, podIf v1.PodInterface, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string) (*wfv1.Workflow, error) {


This is a weird one. We want to retry an archived workflow. But the workflow itself may or may not have been deleted. So I think logic can be:

Hydrate workflow

Formulate the workflow for retry.

Delete any pods for any failed node.

De-hydrate workflow.

Update the workflow. If 404 error, zero fields like UID and resourceVersion, then create it.

I want to see these changes:

Rename prepareWorkflowForRetry FormulateRetryWorkflow.

Rather than pass in podIf. Instead return a list of pod names that need to be deleted.

Wrap that in RetryWorkflow. I think you can use that is both cases.

Should we move deleting of the pods all to the server side then?

// Iterate the previous nodes. If it was successful Pod carry it forward deletedNodes := make(map[string]bool) for _, node := range wf.Status.Nodes { doForceResetNode := false if _, present := nodeIDsToReset[node.ID]; present { // if we are resetting this node then don't carry it across regardless of its phase doForceResetNode = true } switch node.Phase { case wfv1.NodeSucceeded, wfv1.NodeSkipped: if !strings.HasPrefix(node.Name, onExitNodeName) && !doForceResetNode { newWF.Status.Nodes[node.ID] = node continue } case wfv1.NodeError, wfv1.NodeFailed, wfv1.NodeOmitted: if !strings.HasPrefix(node.Name, onExitNodeName) && (node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup) { newNode := node.DeepCopy() newNode.Phase = wfv1.NodeRunning newNode.Message = "" newNode.StartedAt = metav1.Time{Time: time.Now().UTC()} newNode.FinishedAt = metav1.Time{} newWF.Status.Nodes[newNode.ID] = *newNode continue } else { deletedNodes[node.ID] = true } // do not add this status to the node. pretend as if this node never existed. default: // Do not allow retry of workflows with pods in Running/Pending phase return nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase) } if node.Type == wfv1.NodeTypePod { templateName := getTemplateFromNode(node) version := GetWorkflowPodNameVersion(wf) podName := PodName(wf.Name, node.Name, templateName, node.ID, version) log.Infof("Deleting pod: %s", podName) err := podIf.Delete(ctx, podName, metav1.DeleteOptions{}) if err != nil && !apierr.IsNotFound(err) { return nil, errors.InternalWrapError(err) } } else if node.Name == wf.ObjectMeta.Name { newNode := node.DeepCopy() newNode.Phase = wfv1.NodeRunning newNode.Message = "" newNode.StartedAt = metav1.Time{Time: time.Now().UTC()} newNode.FinishedAt = metav1.Time{} newWF.Status.Nodes[newNode.ID] = *newNode continue } } if len(deletedNodes) > 0 { for _, node := range newWF.Status.Nodes { var newChildren []string for _, child := range node.Children { if !deletedNodes[child] { newChildren = append(newChildren, child) } } node.Children = newChildren var outboundNodes []string for _, outboundNode := range node.OutboundNodes { if !deletedNodes[outboundNode] { outboundNodes = append(outboundNodes, outboundNode) } } node.OutboundNodes = outboundNodes newWF.Status.Nodes[node.ID] = node } }

Update the workflow. If 404 error, zero fields like UID and resourceVersion, then create it.

This can also occur on the server side so we can remove RetryArchivedWorkflow and use RetryWorkflow for both.

correct - to the service layer

Moved deletion of pods to server side. RetryWorkflow now also returns a list of podNames to delete which allows us to remove podIf needing to be passed at all.

func RetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string) (*wfv1.Workflow, []string, error) { updatedWf, podsToDelete, err := FormulateRetryWorkflow(ctx, wf, restartSuccessful, nodeFieldSelector) if err != nil { return nil, nil, err } return updatedWf, podsToDelete, err }

func FormulateRetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string) (*wfv1.Workflow, []string, error) { ... if node.Type == wfv1.NodeTypePod { templateName := getTemplateFromNode(node) version := GetWorkflowPodNameVersion(wf) podName := PodName(wf.Name, node.Name, templateName, node.ID, version) podsToDelete = append(podsToDelete, podName)

func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req *workflowarchivepkg.RetryArchivedWorkflowRequest) (*wfv1.Workflow, error) { wfClient := auth.GetWfClient(ctx) kubeClient := auth.GetKubeClient(ctx) wf, err := w.GetArchivedWorkflow(ctx, &workflowarchivepkg.GetArchivedWorkflowRequest{Uid: req.Uid}) if err != nil { return nil, err } wf, podsToDelete, err := util.RetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector) if err != nil { return nil, err } for _, podName := range podsToDelete { log.Infof("Deleting pod: %s", podName) err := kubeClient.CoreV1().Pods(wf.Namespace).Delete(ctx, podName, metav1.DeleteOptions{}) if err != nil && !apierr.IsNotFound(err) { return nil, errors.InternalWrapError(err) } } wf, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Update(ctx, wf, metav1.UpdateOptions{}) if apierr.IsBadRequest(err) { wf.ObjectMeta.ResourceVersion = "" wf, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{}) if err != nil { return nil, err } } return wf, nil }

…ForRetry to FormulateRetryWorkflow Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec · 2022-03-16T23:40:16Z

workflow/util/util_test.go

- }
- })
-}
+// func TestRetryArchivedWorkflow(t *testing.T) {


Missed on a quick scan, removed

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec · 2022-03-17T17:00:02Z

server/workflow/workflow_server.go

+ }
+
+ for _, podName := range podsToDelete {
+ log.Infof("Deleting pod: %s", podName)


please use structured logging

alexec · 2022-03-17T17:00:13Z

server/workflow/workflow_server.go

+ log.Infof("Deleting pod: %s", podName)
+ err := kubeClient.CoreV1().Pods(wf.Namespace).Delete(ctx, podName, metav1.DeleteOptions{})
+ if err != nil && !apierr.IsNotFound(err) {
+ return nil, errors.InternalWrapError(err)


remove InternalWrapError, not needed

alexec · 2022-03-17T17:00:33Z

server/workflowarchive/archived_workflow_server.go

+ }
+
+ for _, podName := range podsToDelete {
+ log.Infof("Deleting pod: %s", podName)


structured logging please

alexec · 2022-03-17T17:00:40Z

server/workflowarchive/archived_workflow_server.go

+ log.Infof("Deleting pod: %s", podName)
+ err := kubeClient.CoreV1().Pods(wf.Namespace).Delete(ctx, podName, metav1.DeleteOptions{})
+ if err != nil && !apierr.IsNotFound(err) {
+ return nil, errors.InternalWrapError(err)


remove InternalWrapError

alexec · 2022-03-17T17:01:05Z

server/workflowarchive/archived_workflow_server.go

+ }
+
+ wf, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Update(ctx, wf, metav1.UpdateOptions{})
+ if apierr.IsAlreadyExists(err) {


is this condition correct?

IsNotFound? how did you test this?

used wrong error check, this is meant for going from Create to Update however we're doing Update to Create, will fix

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec · 2022-03-17T18:34:06Z

workflow/util/util.go

- updated, err = retryWorkflow(ctx, kubeClient, hydrator, wfClient, name, restartSuccessful, nodeFieldSelector)
- return !errorsutil.IsTransientErr(err), err
- })
+func RetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string) (*wfv1.Workflow, []string, error) {


I think you should delete this func

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

…proj#7988) Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar added 2 commits February 23, 2022 15:42

feat: add retry API for archived workflows

40475a1

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

chore: fixed lint errors and removed comments

4d74854

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar and others added 4 commits February 24, 2022 13:47

fix: retryArchive flag check was set incorrectly

95a88c0

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

test: added test cases for retry wfarchive and wfarchiveserver

bbb1465

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

fix: changed method signature and removed comments

949ab6c

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

Merge branch 'master' into issue-5405

e419499

dpadhiar marked this pull request as ready for review February 26, 2022 00:38

dpadhiar requested review from jessesuen and alexec as code owners February 26, 2022 00:38

alexec reviewed Feb 26, 2022

View reviewed changes

alexec requested changes Feb 28, 2022

View reviewed changes

dpadhiar added 2 commits February 28, 2022 13:34

fix: removed validateWorkflow() from WorkflowArchive

b46324c

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

feat: refactored retryWorkflow into prepareWorkflowForRetry

fe493ac

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 1, 2022 21:42

alexec reviewed Mar 1, 2022

View reviewed changes

refactor: remove RetryArchiveWorkflow and move wfClient update/create…

15a2f77

… to service level Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 2, 2022 19:24

alexec reviewed Mar 2, 2022

View reviewed changes

server/workflowarchive/archived_workflow_server.go Outdated Show resolved Hide resolved

server/workflowarchive/archived_workflow_server.go Outdated Show resolved Hide resolved

dpadhiar added 2 commits March 2, 2022 13:44

chore: remove comments

be9b1d9

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

fix: ignored AlreadyExists err for idempotence

cee8d76

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 2, 2022 22:33

alexec reviewed Mar 2, 2022

View reviewed changes

server/workflow/workflow_server.go Show resolved Hide resolved

fix: dehydrate workflow in RetryWorkflow

aede75e

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 2, 2022 23:47

alexec reviewed Mar 3, 2022

View reviewed changes

terrytangyuan self-requested a review March 3, 2022 01:33

dpadhiar requested a review from alexec March 8, 2022 21:24

chore: merge master to resolve conflicts with resubmit commit

3fc7eaf

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from terrytangyuan March 9, 2022 19:14

terrytangyuan approved these changes Mar 9, 2022

View reviewed changes

alexec requested changes Mar 10, 2022

View reviewed changes

dpadhiar added 3 commits March 10, 2022 11:58

test: added RetryArchivedWorkflow test

4a4493e

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

chore: remove wfClient from RetryWorkflow/RetryArchivedWorkflow methods

51c4b5d

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

chore: rerun codegen -B

7fba704

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 10, 2022 22:17

alexec self-assigned this Mar 15, 2022

alexec reviewed Mar 15, 2022

View reviewed changes

refactor: move pod deletion to server side and rename prepareWorkflow…

8add77e

…ForRetry to FormulateRetryWorkflow Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 16, 2022 23:15

alexec reviewed Mar 16, 2022

View reviewed changes

test: clean up

d619ebb

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

dpadhiar requested a review from alexec March 17, 2022 16:46

alexec reviewed Mar 17, 2022

View reviewed changes

refactor: added structured logging and changed error check

9cd4ab7

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec reviewed Mar 17, 2022

View reviewed changes

refactor: remove RetryWorkflow

68caea6

Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

alexec approved these changes Mar 17, 2022

View reviewed changes

alexec enabled auto-merge (squash) March 17, 2022 20:46

alexec merged commit d4b1afe into argoproj:master Mar 17, 2022

dpadhiar added a commit to dpadhiar/argo-workflows that referenced this pull request Mar 23, 2022

feat: add retry API for archived workflows. Fixes argoproj#7906 (argo…

8cde762

…proj#7988) Signed-off-by: Dillen Padhiar <dpadhiar99@gmail.com>

sarabala1979 mentioned this pull request Apr 14, 2022

Cherry pick v3.3.2 #8401

Closed

85 tasks

agilgur5 mentioned this pull request Oct 30, 2023

fix: Clean up pods of fulfilled nodes when workflow manual retry. Fix… #12105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add retry API for archived workflows. Fixes #7906 #7988

feat: add retry API for archived workflows. Fixes #7906 #7988

dpadhiar commented Feb 23, 2022 •

edited

Loading

dpadhiar commented Feb 24, 2022

alexec Feb 26, 2022

dpadhiar Feb 28, 2022

dpadhiar Feb 28, 2022

alexec Mar 1, 2022

alexec left a comment

alexec left a comment

alexec left a comment

alexec Mar 1, 2022

dpadhiar commented Mar 2, 2022

alexec commented Mar 2, 2022

alexec left a comment

terrytangyuan left a comment

alexec Mar 10, 2022

dpadhiar Mar 10, 2022

alexec Mar 16, 2022

alexec Mar 10, 2022

dpadhiar Mar 10, 2022

alexec Mar 10, 2022

alexec Mar 15, 2022

dpadhiar Mar 15, 2022

alexec Mar 15, 2022

dpadhiar Mar 15, 2022

alexec Mar 16, 2022

dpadhiar Mar 16, 2022

alexec Mar 17, 2022

alexec Mar 17, 2022

alexec Mar 17, 2022

alexec Mar 17, 2022

alexec Mar 17, 2022

alexec Mar 17, 2022

dpadhiar Mar 17, 2022

alexec Mar 17, 2022

feat: add retry API for archived workflows. Fixes #7906 #7988

feat: add retry API for archived workflows. Fixes #7906 #7988

Conversation

dpadhiar commented Feb 23, 2022 • edited Loading

dpadhiar commented Feb 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexec left a comment

Choose a reason for hiding this comment

alexec left a comment

Choose a reason for hiding this comment

alexec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpadhiar commented Mar 2, 2022

alexec commented Mar 2, 2022

alexec left a comment

Choose a reason for hiding this comment

terrytangyuan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpadhiar commented Feb 23, 2022 •

edited

Loading