feat: Allow step restart on workflow retry. Closes #2334 #2431

markterm · 2020-03-13T15:27:48Z

This allows retryStrategy to contain restartOnWorkflowRetry, in which case the entire node (and therefore all descendent nodes) will be restarted if they haven't all already succeeded and the workflow is retried.

See: #2334 for more info

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
Optional. I've added My organization is added to the USERS.md.
I've signed the CLA and required builds are green.

workflow/controller/operator.go

test/e2e/cli_test.go

workflow/util/util.go

codecov · 2020-03-19T13:24:43Z

Codecov Report

Attention: Patch coverage is 21.53846% with 51 lines in your changes missing coverage. Please review.

Project coverage is 11.62%. Comparing base (694664c) to head (0b194c4).

Files with missing lines	Patch %	Lines
workflow/util/util.go	21.62%	27 Missing and 2 partials ⚠️
workflow/controller/operator.go	35.29%	8 Missing and 3 partials ⚠️
cmd/argo/commands/retry.go	0.00%	10 Missing ⚠️
server/workflow/workflow_server.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2431      +/-   ##
==========================================
+ Coverage   11.22%   11.62%   +0.39%     
==========================================
  Files          83       84       +1     
  Lines       32696    32871     +175     
==========================================
+ Hits         3671     3820     +149     
  Misses      28525    28525              
- Partials      500      526      +26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexec · 2020-03-19T16:18:04Z

@sarabala1979 I think this is your area of expertise?

markterm · 2020-03-25T12:24:45Z

@sarabala1979 hi, any feedback on this?

simster7

My initial reaction is that this shouldn't be implemented in the workflow-controller. "Retry" is not a feature that the controller knows about. When a Workflow is retried, the CLI/Server manually edits the Workflow object and sets "Failed" steps to "Pending" so that they are re-run. The controller is unaware that this has happened and treats the Workflow as if it was running for the first time.

For this feature to be implemented without breaking abstraction barriers, it should be implemented fully on the RetryWorkflow function on workflow/util/util.go. This could perhaps be by specifying which steps should be fully restated on a UI/CLI, which are then passed to the function. The function can then restart the appropriate nodes from that input.

simster7 · 2020-03-31T00:12:41Z

workflow/controller/operator.go

@@ -629,6 +629,17 @@ func (woc *wfOperationCtx) processNodeRetries(node *wfv1.NodeStatus, retryStrate
 		return woc.markNodePhase(node.Name, wfv1.NodeSucceeded), true, nil
 	}

+	if woc.workflowDeadline != nil && time.Now().UTC().After(*woc.workflowDeadline) {
+		var message string
+		if woc.workflowDeadline.IsZero() {


JSYK: The way we detect termination has changed since you opened this PR. This info is now found in Workflow.Spec.Shutdown

Thanks I'm updating this.

simster7 · 2020-03-31T00:19:04Z

test/e2e/testdata/retry-test.yaml

+
+    - name: steps-inner
+      retryStrategy:
+        restartOnWorkflowRetry: true


My initial reaction is that this is not the correct place for this.

retryStrategy deals with retrying this node during a single workflow execution. Your proposed flag deals with retrying the node across different executions. Under this implementation retryStrategy is overloaded.

Ah, you are right - I'm moving restartOnWorkflowRetry up to the template level. Does that work?

golangcibot · 2020-03-31T14:28:17Z

workflow/controller/operator.go

 		for _, node := range woc.wf.Status.Nodes {
 			if node.IsActiveSuspendNode() {
-				woc.markNodePhase(node.Name, wfv1.NodeFailed, fmt.Sprintf("step exceeded workflow deadline %s", *woc.workflowDeadline))
+				var message = ""


ineffectual assignment to message (from ineffassign)

markterm

Thanks for taking a look.

My use case for this is where we have an 'asynchronous' group of steps where the first starts a pod which kicks off a job outside Argo, then the suspend step is either resumed or failed depending on that job. Therefore just retrying the resume step on failure wouldn't be very useful.

As you saw I did implement all the actual logic in the RetryWorkflow function, but a user triggering a retry wouldn't know what to pass in to be fully restarted - I do think this information best belongs with the workflow, which means storing at least that in the workflow template. But I don't think that's very invasive ...

markterm · 2020-03-31T09:19:55Z

workflow/controller/operator.go

@@ -629,6 +629,17 @@ func (woc *wfOperationCtx) processNodeRetries(node *wfv1.NodeStatus, retryStrate
 		return woc.markNodePhase(node.Name, wfv1.NodeSucceeded), true, nil
 	}

+	if woc.workflowDeadline != nil && time.Now().UTC().After(*woc.workflowDeadline) {
+		var message string
+		if woc.workflowDeadline.IsZero() {


Thanks I'm updating this.

markterm · 2020-03-31T09:20:27Z

test/e2e/testdata/retry-test.yaml

+
+    - name: steps-inner
+      retryStrategy:
+        restartOnWorkflowRetry: true


Ah, you are right - I'm moving restartOnWorkflowRetry up to the template level. Does that work?

simster7 · 2020-03-31T17:59:19Z

but a user triggering a retry wouldn't know what to pass in to be fully restarted

If a user would tag a certain template as restartOnWorkflowRetry, wouldn't they be able to specify that same template when using argo retry? After all, we're not implementing automatic Workflow retries in this PR (#1578), so some user intervention/third-party scripting is still required to restart the Workflow. Why can't the user/script be able to supply which steps to be retried with argo retry?

I do think this information best belongs with the workflow, which means storing at least that in the workflow template. But I don't think that's very invasive.

Let me gather some more opinions with the team and get back to you. Could be that you're right and I'm a bit too stringent 🙂

simster7 · 2020-03-31T21:12:32Z

Let me gather some more opinions with the team and get back to you.

Hi @mark9white. The team agreed that we won't want to support this sort of labeling on the Workflow spec. This feature is still very much desired, but we'll have to find a way to specify which nodes to restart fully on the client-side.

markterm · 2020-04-01T08:51:56Z

Thanks for following up with the team. I could make this work by providing the ability to specify nodes to restart by templateName - would that be ok?

Doing this is not ideal for Argo users - has the team had any thoughts to providing first-class support for triggering asynchronous jobs from Argo without using polling? An example would be triggered a Spark job (eg directly or using something like Amazon EMR) where you would have one step trigger it and then a suspend step that waits for the job completion.

markterm · 2020-04-01T09:44:11Z

I have just modified the PR so the retry command takes in a --reset-nodes-field-selector parameter.

simster7 · 2020-04-06T19:10:21Z

I have just modified the PR so the retry command takes in a --reset-nodes-field-selector parameter.

Thanks @mark9white! I'll take another look.

Doing this is not ideal for Argo users

Do you mean using the --reset-nodes-field-selector approach? Would you mind explain a bit why?

simster7

Looks pretty good, just some minor comments.

simster7 · 2020-04-06T19:18:33Z

cmd/argo/commands/retry.go

@@ -37,5 +51,6 @@ func NewRetryCommand() *cobra.Command {
 	command.Flags().StringVarP(&cliSubmitOpts.output, "output", "o", "", "Output format. One of: name|json|yaml|wide")
 	command.Flags().BoolVarP(&cliSubmitOpts.wait, "wait", "w", false, "wait for the workflow to complete")
 	command.Flags().BoolVar(&cliSubmitOpts.watch, "watch", false, "watch the workflow until it completes")
+	command.Flags().StringVar(&retryOps.nodesToResetFieldSelector, "reset-nodes-field-selector", "", "selector of nodes to reset, eg: --node-field-selector inputs.paramaters.myparam.value=abc")


I think the name of this flag should be --node-field-selector as all this does is provide a selector. This way it would be analogous to #1904. To specify that we want said nodes restarted we can pass a flag in conjunction:

$ argo retry --restart-successful --node-field-selector inputs.paramaters.myparam.value=abc

Or something like this. What do you think?

Good idea, am applying.

simster7 · 2020-04-06T19:21:50Z

workflow/controller/operator.go

+	if woc.wf.Spec.Shutdown != "" || (woc.workflowDeadline != nil && time.Now().UTC().After(*woc.workflowDeadline)) {
+		var message string
+		if woc.wf.Spec.Shutdown != "" {
+			message = fmt.Sprintf("Stopped with strategy '%s'", woc.wf.Spec.Shutdown)
+		} else {
+			message = fmt.Sprintf("retry exceeded workflow deadline %s", *woc.workflowDeadline)
+		}
+		woc.log.Infoln(message)
+		return woc.markNodePhase(node.Name, lastChildNode.Phase, message), true, nil
+	}


Why do we need this here? Isn't this covered by failSuspendedNodesAfterDeadlineOrShutdown()?

Without this, in the case of a retry parent node it just keeps retrying pods that continually fail because they are being executed after the deadline. The integration test didn't work without it.

simster7 · 2020-04-06T19:22:10Z

workflow/controller/operator.go

+func (woc *wfOperationCtx) failSuspendedNodesAfterDeadlineOrShutdown() error {
+	if woc.wf.Spec.Shutdown != "" || (woc.workflowDeadline != nil && time.Now().UTC().After(*woc.workflowDeadline)) {


Thanks for this!

simster7 · 2020-04-06T19:25:12Z

workflow/util/util.go

-				}
-
-				if selector.Matches(nodeFields) {
+				if selectorMatchesNode(selector, node) {


markterm · 2020-04-07T13:57:19Z

I have just modified the PR so the retry command takes in a --reset-nodes-field-selector parameter.

Thanks @mark9white! I'll take another look.

Doing this is not ideal for Argo users

Do you mean using the --reset-nodes-field-selector approach? Would you mind explain a bit why?

Because the person running 'retry' needs to know what selector to pass in to effectively retry the given workflow.

markterm · 2020-04-07T14:22:52Z

@simster7 I've applied the feedback.

simster7

LGTM! Thanks for this great PR @mark9white

simster7 · 2020-04-07T14:46:34Z

workflow/util/util.go

 	// Delete/reset fields which indicate workflow completed
 	delete(newWF.Labels, common.LabelKeyCompleted)
 	newWF.Status.Conditions.UpsertCondition(wfv1.WorkflowCondition{Status: metav1.ConditionFalse, Type: wfv1.WorkflowConditionCompleted})
 	newWF.ObjectMeta.Labels[common.LabelKeyPhase] = string(wfv1.NodeRunning)
 	newWF.Status.Phase = wfv1.NodeRunning
 	newWF.Status.Message = ""
 	newWF.Status.FinishedAt = metav1.Time{}
+	newWF.Spec.Shutdown = ""


Nice catch!

simster7 · 2020-04-09T01:32:26Z

Hey @mark9white. Could you resolve the conflicts here please?

markterm · 2020-04-09T08:22:28Z

Done!

…

On Thu, 9 Apr 2020 at 02:32, Simon Behar ***@***.***> wrote: Hey @mark9white <https://github.com/mark9white>. Could you resolve the conflicts here please? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2431 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LSDUJXOVW6A7VFMT33RLUQTPANCNFSM4LHE2LVA> .

markterm · 2020-04-09T10:19:34Z

Actually there is an issue in the e2e tests, which I'm looking into.

golangcibot · 2020-04-09T13:07:48Z

test/e2e/fixtures/e2e_suite.go

@@ -357,8 +364,10 @@ func (s *E2ESuite) printWorkflowDiagnostics(name string) {
 		s.CheckError(err)
 		wf.Status.Nodes = offloaded
 	}
-	logCtx.Debug("Workflow metadata:")
+	logCtx.Debug("Workflow metadata at %s:", time.Now().String())


printf: Debug call has possible formatting directive %s (from govet)

golangcibot · 2020-04-11T11:35:56Z

workflow/util/util.go

@@ -715,7 +753,7 @@ func RetryWorkflow(kubeClient kubernetes.Interface, repo sqldb.OffloadNodeStatus
 	if err != nil {
 		return nil, fmt.Errorf("unable to compress workflow: %s", err)
 	}
-
+	


File is not goimports-ed with -local github.com/argoproj/argo (from goimports)

Suggested change

simster7

Sorry to renege the approval, but because of #2645 new changes are needed.

simster7 · 2020-04-11T16:15:21Z

workflow/util/util.go

+		if err != nil {
+			return nil, err
+		} else {
+			for _, node := range wf.Status.Nodes {


Hey @mark9white, because of #2645 this actually needs to be moved further down the code. Workflows with offloaded nodes are only retrieved starting in line 678, so if a Workflow has offloaded nodes, wf.Status.Nodes will be nil at this point in the code and the node field selector will have no effect.

While you're at this, would you mind extracting this block out to a helper function? RetryWorkflow is already a bit cluttered 🙂

simster7 · 2020-04-11T16:18:47Z

workflow/util/util.go

+			return nil, err
+		} else {
+			for _, node := range wf.Status.Nodes {
+				if selectorMatchesNode(selector, node) {


Seems like this code could be included in the large for loop starting at line 690. What do you think? Adding it there could save us an iteration through all the nodes. If you do decide to add it, please make sure it's added via a helper function.

We need to determine the list of nodes including child nodes first.

simster7 · 2020-04-11T16:24:23Z

workflow/util/util.go

 				newNodes[node.ID] = node
 				continue
 			}
 		case wfv1.NodeError, wfv1.NodeFailed:
-			if !strings.HasPrefix(node.Name, onExitNodeName) && (node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeStepGroup) {
+			if !strings.HasPrefix(node.Name, onExitNodeName) && (node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeStepGroup) && !doForceResetNode {


Why is this && ! doForceResetNode necessary here? What's the difference between "pretend as if this node never existed" or resetting it manually?

You're right, fixed

simster7 · 2020-04-11T16:25:02Z

workflow/util/util.go

@@ -655,14 +688,19 @@ func RetryWorkflow(kubeClient kubernetes.Interface, repo sqldb.OffloadNodeStatus
 	}

 	for _, node := range nodes {
+		var doForceResetNode = false


Minor:

Suggested change

var doForceResetNode = false

forceResetNode := false

markterm · 2020-04-12T12:21:50Z

I've retriggered the build as it failed on TestLogProblems (which is unrelated and looks to be flakey).

sonarcloud · 2020-04-12T12:22:20Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
4 Code Smells

No Coverage information
4.3% Duplication

markterm · 2020-04-12T12:40:15Z

I'm still seeing TestLogProblems failing, but I don't think related to this PR.

markterm force-pushed the step-restart-on-retry branch from 85d7364 to 58b0103 Compare March 13, 2020 15:29

golangcibot reviewed Mar 13, 2020

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

test/e2e/cli_test.go Outdated Show resolved Hide resolved

workflow/util/util.go Outdated Show resolved Hide resolved

markterm force-pushed the step-restart-on-retry branch from d52a540 to 3f798c1 Compare March 23, 2020 19:58

simster7 self-assigned this Mar 30, 2020

simster7 suggested changes Mar 31, 2020

View reviewed changes

markterm force-pushed the step-restart-on-retry branch from 3f798c1 to 072add9 Compare March 31, 2020 14:24

golangcibot reviewed Mar 31, 2020

View reviewed changes

markterm commented Mar 31, 2020

View reviewed changes

simster7 reviewed Apr 6, 2020

View reviewed changes

markterm force-pushed the step-restart-on-retry branch from 74ef8a0 to 6fd7755 Compare April 7, 2020 14:21

simster7 approved these changes Apr 7, 2020

View reviewed changes

markterm force-pushed the step-restart-on-retry branch from 6fd7755 to e8f28cc Compare April 9, 2020 08:22

golangcibot reviewed Apr 9, 2020

View reviewed changes

markterm force-pushed the step-restart-on-retry branch from ecb6067 to 5616a66 Compare April 11, 2020 10:32

feat: Allow step restart on workflow retry. Closes argoproj#2334

a460ac8

markterm force-pushed the step-restart-on-retry branch from 4f19ff4 to a460ac8 Compare April 11, 2020 11:32

golangcibot reviewed Apr 11, 2020

View reviewed changes

Test fix

2b4adbe

simster7 suggested changes Apr 11, 2020

View reviewed changes

markterm added 2 commits April 12, 2020 12:20

Applied code review feedback

e03e28e

Trigger build

0b194c4

simster7 approved these changes Apr 12, 2020

View reviewed changes

simster7 merged commit 9c6351f into argoproj:master Apr 12, 2020

or-shachar mentioned this pull request May 1, 2023

Retry (rerun) successful workflows from certain steps #11020

Closed

agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Oct 10, 2024

agilgur5 mentioned this pull request Oct 10, 2024

Deprecate retrying of failed nodes by default. #13692

Open

		func (woc *wfOperationCtx) failSuspendedNodesAfterDeadlineOrShutdown() error {
		if woc.wf.Spec.Shutdown != "" \|\| (woc.workflowDeadline != nil && time.Now().UTC().After(*woc.workflowDeadline)) {

feat: Allow step restart on workflow retry. Closes #2334 #2431

feat: Allow step restart on workflow retry. Closes #2334 #2431

Conversation

markterm commented Mar 13, 2020

codecov bot commented Mar 19, 2020 • edited Loading

Codecov Report

alexec commented Mar 19, 2020

markterm commented Mar 25, 2020

simster7 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markterm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simster7 commented Mar 31, 2020

simster7 commented Mar 31, 2020

markterm commented Apr 1, 2020

markterm commented Apr 1, 2020

simster7 commented Apr 6, 2020

simster7 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markterm commented Apr 7, 2020

markterm commented Apr 7, 2020

simster7 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simster7 commented Apr 9, 2020

markterm commented Apr 9, 2020 via email

markterm commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simster7 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markterm commented Apr 12, 2020

sonarcloud bot commented Apr 12, 2020

markterm commented Apr 12, 2020

codecov bot commented Mar 19, 2020 •

edited

Loading

simster7 left a comment •

edited

Loading