fix: retry large archived wf. Fixes #12740 #12741

heidongxianhua · 2024-03-05T13:08:38Z

Motivation

Modifications

Verification

server/workflowarchive/archived_workflow_server.go

shuangkun · 2024-03-12T14:12:29Z

server/workflowarchive/archived_workflow_server.go


 		wf.ObjectMeta.ResourceVersion = ""
 		wf.ObjectMeta.UID = ""
 		result, err := wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{})
 		if err != nil {
 			return nil, sutils.ToStatusError(err, codes.Internal)
 		}
+		if large {
+			offloadedNodes, err := w.offloadNodeStatusRepo.Get(string(oriUid), wf.GetOffloadNodeStatusVersion())


what if wf.GetOffloadNodeStatusVersion() is "" ?

If we want to retry a archived workflow, it will create a new workflow, it is different to retry a not-archived workflow. So when retry large archived workflow, we need to Dehydrate the wf ( offload the node status) first, the key of these node status is the uid of workflow, and then we need to create a new workflow according the Dehydrated workflow. And as for the controller, it will read the offload-node status by the uid of the new created workflow, so we copy the node status info of the archived workflow and save it with new uid of the new created workflow. we could only get the uid after creating the new workflow, and the controller may process the new created workflow before we update node status with new uid, so we catch the db error and make it retry.

I known we should get offloadedNodes. But if version is "", we got err:

time="2024-03-12T14:01:23.864Z" level=info msg="Workflow to be dehydrated" Workflow Size=3361331 time="2024-03-12T14:01:24.370Z" level=debug msg="Getting offloaded nodes" uid=06ab13ed-7e5a-4a0c-aeae-40c19f196bc5 version= time="2024-03-12T14:01:24.371Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = upper: no more rows in this result set" grpc.code=Internal grpc.method=RetryArchivedWorkflow grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2024-03-12T14:01:23Z" grpc.time_ms=1242.014 span.kind=server system=grpc

you can test it argo server UI.

Maybe we don't need to check "large", becaust Dehydrate->CompressWorkflowIfNeeded>IsLarge have.

Maybe we don't need to check "large", becaust Dehydrate->CompressWorkflowIfNeeded>IsLarge have.

yeah, I have removed this check 'large'. And use w.hydrator.IsHydrated(wf) to assert it is been Dehydrated actually.

shuangkun · 2024-03-12T14:16:28Z

util/errors/errors.go

@@ -37,7 +37,8 @@ func IsTransientErr(err error) bool {
 		apierr.IsServiceUnavailable(err) ||
 		isTransientEtcdErr(err) ||
 		matchTransientErrPattern(err) ||
-		errors.Is(err, NewErrTransient(""))
+		errors.Is(err, NewErrTransient("")) ||
+		strings.Contains(err.Error(), "upper: no more rows in ")


When will this error occur ？ it's better to build a function like “matchTransientErrPattern” || "isTransientEtcdErr", maybe we can add like "isTransientSqbErr".

yeah, the root cause is as mentioned above. and I will have a test for the function isTransientSqbErr .

Where did this error occur? Is there an example to reproduce?

yeah, I run it local with make start UI=true PROFILE=mysql ALWAYS_OFFLOAD_NODE_STATUS=true, and submit a workflow and stop it when one step succeed, after it is archived, then retry it. (remove this error caght, it will retry failed at here https://github.com/argoproj/argo-workflows/blob/main/workflow/controller/controller.go#L854)

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: hello-world- spec: entrypoint: whalesay templates: - name: whalesay steps: - - name: run-pod1 template: run-pod - name: run-pod2 template: run-pod - name: run-pod3 template: run-pod - - name: run-pod2-sleep template: run-pod-sleep - name: run-pod container: image: docker/whalesay:latest command: [python3, -c] args: ["print('gfn')"] - name: run-pod-sleep container: image: docker/whalesay:latest command: [python3, -c] args: ["print('gfn')\nimport time\ntime.sleep(100)"] ttlStrategy: secondsAfterCompletion: 3 podGC: strategy: OnPodCompletion

It may occur in old version：#2333

Yeah，I have checked this issue before, and I got the error with evrsion v3.5.0, and the latest commit it also has this error. I test it at locally according to https://argo-workflows.readthedocs.io/en/latest/running-locally/, and do you add the env ALWAYS_OFFLOAD_NODE_STATUS=true? @shuangkun

I use make start UI=true PROFILE=mysql ALWAYS_OFFLOAD_NODE_STATUS=true run worflow-controller and server and I use latest.

And I just pull my branch heidongxianhua:fix_retry_large_archived_wf and rebased by the branch [argoproj:main](https://github.com/argoproj/argo-workflows/tree/main). If i add the error check https://github.com/argoproj/argo-workflows/pull/12741/files#diff-c2fc14b8209201e80dbf52f4293639d125ce3b6df398ebc93f5ee4e1abf05842R41, it will run ok and the server.log contains this, but there is no error in controller.log:

But if i delete this check (run make clean and make start UI=true PROFILE=mysql ALWAYS_OFFLOAD_NODE_STATUS=true), we will get the error in controller.log, and it lost the node:

@shuangkun

OK, thanks! I reproduced it. LGTM

Signed-off-by: heidongxianhua <18207133434@163.com>

shuangkun · 2024-04-01T12:37:50Z

Change the title. add "Fixes #12740" at the end. And commit empty Re-trigger CI.

heidongxianhua · 2024-04-02T01:35:29Z

yeah, i have changed the title and Re-trigger the CI again, couly you have a look? @shuangkun

heidongxianhua · 2024-04-02T08:54:59Z

thanks, but it has not been merged, could you merge it again? @shuangkun

shuangkun · 2024-04-02T09:31:06Z

thanks, but it has not been merged, could you merge it again? @shuangkun

Sorry, I don't have permission. I just gave it a high priority to review. @juliev0 @isubasinghe Can you help have a look? Thanks!

juliev0 · 2024-04-06T15:54:02Z

okay, I'll try to take a look at this one later

juliev0 · 2024-04-09T03:01:13Z

@sarabala1979 Is it okay I assigned you to take a look at this? I see you worked on offloadNodeStatusRepo for workflow_server.go before.

heidongxianhua · 2024-04-16T09:52:30Z

sorry to bother, could you have a look as this PR? @sarabala1979

juliev0 · 2024-04-26T22:23:05Z

server/workflowarchive/archived_workflow_server.go

 		wf.ObjectMeta.ResourceVersion = ""
 		wf.ObjectMeta.UID = ""
 		result, err := wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{})
 		if err != nil {
 			return nil, sutils.ToStatusError(err, codes.Internal)
 		}
+		if !w.hydrator.IsHydrated(wf) {


I assume this if statement is just in case? there isn't any reason why it shouldn't be dehydrated, right?

(nothing wrong with that of course...just trying to make sure I understand)

In line 307(https://github.com/argoproj/argo-workflows/pull/12741/files#diff-f854ed23b7da7ee7a4f8bd0dc4e2114f284610f55f9604b1c09b37c0fe35d933R307), we call the func w.hydrator.Dehydrate(wf) to Dehydrate the wf, but within this function, it will automatically determine whether(according to the size of this wf, https://github.com/argoproj/argo-workflows/blob/main/workflow/hydrator/hydrator.go#L100) dehydration really needs to be performed, so we should add this if statement . @juliev0

got it. Okay, please add some comments around all of this code, because I think it will be hard otherwise for people to understand the purpose.

juliev0 · 2024-04-26T22:35:59Z

Can you add comments to the code?

Signed-off-by: heidongxianhua <18207133434@163.com>

juliev0 · 2024-04-28T03:30:02Z

server/workflowarchive/archived_workflow_server.go

@@ -299,12 +303,30 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
 			}
 		}

+		log.WithFields(log.Fields{"Dehydrate workflow uid=": wf.UID}).Info("RetryArchivedWorkflow")
+		// It will dehydrate the wf actually only if needed (when the size of workflow is too large or set the env `ALWAYS_OFFLOAD_NODE_STATUS`=true)


Suggested change

// It will dehydrate the wf actually only if needed (when the size of workflow is too large or set the env `ALWAYS_OFFLOAD_NODE_STATUS`=true)

// If the Workflow needs to be dehydrated in order to capture and retain all of the previous state for the subsequent workflow, then do so

do you think this is accurate?

juliev0 · 2024-04-28T03:31:06Z

server/workflowarchive/archived_workflow_server.go

 		wf.ObjectMeta.ResourceVersion = ""
 		wf.ObjectMeta.UID = ""
 		result, err := wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{})
 		if err != nil {
 			return nil, sutils.ToStatusError(err, codes.Internal)
 		}
+		// determine whether th wf is `Dehydrated` actually before


Suggested change

// determine whether th wf is `Dehydrated` actually before

// if the Workflow was dehydrated before, we need to capture and maintain its previous state for the new Workflow

is this a correct statement?

Signed-off-by: heidongxianhua <18207133434@163.com>

Signed-off-by: heidongxianhua <18207133434@163.com> Signed-off-by: xiaowu.zhu <xiaowu.zhu@daocloud.io>

Signed-off-by: heidongxianhua <18207133434@163.com> (cherry picked from commit 6182386)

heidongxianhua mentioned this pull request Mar 5, 2024

cannot retry large archived workflow that needs offloading #12740

Closed

4 tasks

heidongxianhua changed the title ~~[WIP] Fix retry large archived wf~~ [WIP] fix: retry large archived wf Mar 5, 2024

shuangkun self-assigned this Mar 5, 2024

shuangkun added area/workflow-archive area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Mar 5, 2024

agilgur5 marked this pull request as draft March 6, 2024 01:35

shuangkun reviewed Mar 6, 2024

View reviewed changes

server/workflowarchive/archived_workflow_server.go Outdated Show resolved Hide resolved

heidongxianhua changed the title ~~[WIP] fix: retry large archived wf~~ fix: retry large archived wf Mar 11, 2024

heidongxianhua marked this pull request as ready for review March 12, 2024 01:52

shuangkun reviewed Mar 12, 2024

View reviewed changes

heidongxianhua added 13 commits April 1, 2024 17:58

Update argo-kube-client.go

afe9180

Signed-off-by: heidongxianhua <18207133434@163.com>

Update argoserver.go

430e89c

Signed-off-by: heidongxianhua <18207133434@163.com>

Update workflow_server_test.go

0e443d8

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server_test.go

0eec94d

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

39e50d8

Signed-off-by: heidongxianhua <18207133434@163.com>

Update errors.go

3d141d1

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

865ea2e

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

13b4774

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

2152cbd

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

4435194

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

00bfb14

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

5ccf27c

Signed-off-by: heidongxianhua <18207133434@163.com>

Update errors.go

7d0bb9c

Signed-off-by: heidongxianhua <18207133434@163.com>

heidongxianhua force-pushed the fix_retry_large_archived_wf branch from 3142972 to 7d0bb9c Compare April 1, 2024 14:36

heidongxianhua changed the title ~~fix: retry large archived wf~~ fix: retry large archived wf. Fixes #12740 Apr 1, 2024

shuangkun added the prioritized-review For members of the Sustainability Effort label Apr 2, 2024

shuangkun self-requested a review April 2, 2024 03:29

shuangkun approved these changes Apr 2, 2024

View reviewed changes

juliev0 assigned juliev0 and unassigned juliev0 Apr 2, 2024

juliev0 requested a review from terrytangyuan April 2, 2024 17:56

juliev0 self-assigned this Apr 6, 2024

juliev0 removed their assignment Apr 9, 2024

juliev0 requested a review from sarabala1979 April 9, 2024 03:00

juliev0 assigned sarabala1979 Apr 9, 2024

juliev0 assigned juliev0 and unassigned sarabala1979 Apr 26, 2024

juliev0 reviewed Apr 26, 2024

View reviewed changes

heidongxianhua added 3 commits April 28, 2024 10:13

Update archived_workflow_server.go

7c08a28

Signed-off-by: heidongxianhua <18207133434@163.com>

Update archived_workflow_server.go

51ef09a

Signed-off-by: heidongxianhua <18207133434@163.com>

Merge branch 'main' into fix_retry_large_archived_wf

b11c579

juliev0 reviewed Apr 28, 2024

View reviewed changes

Update archived_workflow_server.go

6345073

Signed-off-by: heidongxianhua <18207133434@163.com>

juliev0 approved these changes Apr 28, 2024

View reviewed changes

juliev0 merged commit 6182386 into argoproj:main Apr 28, 2024
27 checks passed

yyzxw added a commit to yyzxw/argo-workflows that referenced this pull request Apr 28, 2024

fix: retry large archived wf. Fixes argoproj#12740 (argoproj#12741)

8117dc9

Signed-off-by: heidongxianhua <18207133434@163.com> Signed-off-by: xiaowu.zhu <xiaowu.zhu@daocloud.io>

agilgur5 pushed a commit that referenced this pull request May 4, 2024

fix: retry large archived wf. Fixes #12740 (#12741)

db3b1a2

Signed-off-by: heidongxianhua <18207133434@163.com> (cherry picked from commit 6182386)

agilgur5 mentioned this pull request May 4, 2024

[testing PR] test why e2e tests are failing #13002

Closed

shuangkun mentioned this pull request Nov 1, 2024

REQUEST: Promotion to Approver for shuangkun argoproj/argoproj#338

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry large archived wf. Fixes #12740 #12741

fix: retry large archived wf. Fixes #12740 #12741

heidongxianhua commented Mar 5, 2024 •

edited

Loading

shuangkun Mar 12, 2024

heidongxianhua Mar 13, 2024

shuangkun Mar 13, 2024

shuangkun Mar 13, 2024

heidongxianhua Mar 15, 2024 •

edited

Loading

shuangkun Mar 12, 2024

heidongxianhua Mar 13, 2024

shuangkun Mar 13, 2024

heidongxianhua Mar 13, 2024

heidongxianhua Mar 13, 2024 •

edited by agilgur5

Loading

shuangkun Apr 1, 2024

heidongxianhua Apr 1, 2024 •

edited

Loading

shuangkun Apr 1, 2024

heidongxianhua Apr 1, 2024 •

edited

Loading

shuangkun Apr 1, 2024

shuangkun commented Apr 1, 2024

heidongxianhua commented Apr 2, 2024

heidongxianhua commented Apr 2, 2024

shuangkun commented Apr 2, 2024

juliev0 commented Apr 6, 2024

juliev0 commented Apr 9, 2024

heidongxianhua commented Apr 16, 2024

juliev0 Apr 26, 2024

juliev0 Apr 26, 2024

heidongxianhua Apr 27, 2024 •

edited

Loading

juliev0 Apr 27, 2024

juliev0 commented Apr 26, 2024

juliev0 Apr 28, 2024

juliev0 Apr 28, 2024

juliev0 Apr 28, 2024

juliev0 Apr 28, 2024

	// It will dehydrate the wf actually only if needed (when the size of workflow is too large or set the env `ALWAYS_OFFLOAD_NODE_STATUS`=true)
	// If the Workflow needs to be dehydrated in order to capture and retain all of the previous state for the subsequent workflow, then do so

	// determine whether th wf is `Dehydrated` actually before
	// if the Workflow was dehydrated before, we need to capture and maintain its previous state for the new Workflow

fix: retry large archived wf. Fixes #12740 #12741

fix: retry large archived wf. Fixes #12740 #12741

Conversation

heidongxianhua commented Mar 5, 2024 • edited Loading

Motivation

Modifications

Verification

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heidongxianhua Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heidongxianhua Mar 13, 2024 • edited by agilgur5 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heidongxianhua Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heidongxianhua Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuangkun commented Apr 1, 2024

heidongxianhua commented Apr 2, 2024

heidongxianhua commented Apr 2, 2024

shuangkun commented Apr 2, 2024

juliev0 commented Apr 6, 2024

juliev0 commented Apr 9, 2024

heidongxianhua commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heidongxianhua Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliev0 commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heidongxianhua commented Mar 5, 2024 •

edited

Loading

heidongxianhua Mar 15, 2024 •

edited

Loading

heidongxianhua Mar 13, 2024 •

edited by agilgur5

Loading

heidongxianhua Apr 1, 2024 •

edited

Loading

heidongxianhua Apr 1, 2024 •

edited

Loading

heidongxianhua Apr 27, 2024 •

edited

Loading