Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot retry large archived workflow that needs offloading #12740

Closed
3 of 4 tasks
heidongxianhua opened this issue Mar 5, 2024 · 5 comments · Fixed by #12741
Closed
3 of 4 tasks

cannot retry large archived workflow that needs offloading #12740

heidongxianhua opened this issue Mar 5, 2024 · 5 comments · Fixed by #12741
Assignees
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/workflow-archive P3 Low priority type/bug

Comments

@heidongxianhua
Copy link
Contributor

heidongxianhua commented Mar 5, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

when I retry a large archived workflow (which is failed), it can not retry because of the server error.
image
And I have set offloadNodeStatus=true in configmap, due to the related code, the offloadNodeStatus is only valid for the workflow which are not archived. As for archived workflow, it is invalid, the workflow will get all infos from nodeOffloadRepo if need and save it to argo_archived_workflows table, then when retry this workflow, it create a new workflow with
complete information not using nodeOffloadRepo, so it will be failed.

Version

V3.5.0

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    steps:
    - - name: run-pod1
        template: run-pod
      - name: run-pod2
        template: run-pod
      - name: run-pod3
        template: run-pod
    - - name: run-pod4
        template: run-pod
      - name: run-pod5
        template: run-pod
      - name: run-pod6
        template: run-pod
    - - name: run-pod7
        template: run-pod
      - name: run-pod8
        template: run-pod
      - name: run-pod9
        template: run-pod
    - - name: run-pod10
        template: run-pod
      - name: run-pod11
        template: run-pod
      - name: run-pod12
        template: run-pod
    - - name: run-pod2-sleep
        template: run-pod-sleep

  - name: run-pod
    steps:
    - - name: run-pod1
        template: run-pod-1
      - name: run-pod2
        template: run-pod-1
      - name: run-pod3
        template: run-pod-1
    - - name: run-pod4
        template: run-pod-1
      - name: run-pod5
        template: run-pod-1
      - name: run-pod6
        template: run-pod-1
  - name: run-pod-1
    steps:
    - - name: run-pod1
        template: run-pod-2
      - name: run-pod2
        template: run-pod-2
      - name: run-pod3
        template: run-pod-2
    - - name: run-pod4
        template: run-pod-2
      - name: run-pod5
        template: run-pod-2
      - name: run-pod6
        template: run-pod-2
  - name: run-pod-2
    steps:
    - - name: run-pod1
        template: run-pod-3
      - name: run-pod2
        template: run-pod-3
      - name: run-pod3
        template: run-pod-3
    - - name: run-pod4
        template: run-pod-3
      - name: run-pod5
        template: run-pod-3
      - name: run-pod6
        template: run-pod-3
  - name: run-pod-3
    steps:
    - - name: run-pod1
        template: run-pod-4
      - name: run-pod2
        template: run-pod-4
      - name: run-pod3
        template: run-pod-4
    - - name: run-pod4
        template: run-pod-4
      - name: run-pod5
        template: run-pod-4
      - name: run-pod6
        template: run-pod-4
  - name: run-pod-4
    steps:
    - - name: run-pod1
        template: run-pod-5
      - name: run-pod2
        template: run-pod-5
      - name: run-pod3
        template: run-pod-5
    - - name: run-pod4
        template: run-pod-5
      - name: run-pod5
        template: run-pod-5
      - name: run-pod6
        template: run-pod-5
  - name: run-pod-5
    container:
      image: docker/whalesay:latest
      command: [python3, -c]
      args: ["print('gfn')"]
  - name: run-pod-sleep ## where running here, we need to stop wf
    container:
      image: docker/whalesay:latest
      command: [python3, -c]
      args: ["print('gfn')\nimport time\ntime.sleep(100)"]
  ttlStrategy:
    secondsAfterCompletion: 3
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

argo server:

level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = etcdserver: request is too large" grpc.code=Internal grpc.method=RetryArchivedWorkflow grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2024-03-04T11:54:36Z" grpc.time_ms=282.849 span.kind=server system=grpc

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@shuangkun shuangkun self-assigned this Mar 5, 2024
@shuangkun
Copy link
Member

This means your object is larger than 1MB. https://github.com/kubernetes/kubernetes/blob/db1990f48b92d603f469c1c89e2ad36da1b74846/test/integration/master/synthetic_master_test.go#L315 We have encountered similar problems before, "Request entity too large: limit is 3145728"

@heidongxianhua
Copy link
Contributor Author

heidongxianhua commented Mar 5, 2024

@shuangkun
yes, when I cut some nodes, and it will be expectedMsgFor1MB := 'etcdserver: request is too large', it support retry when set offloadNodeStatus=true and the wf has not been archived. But when the workflow is archived, it did not support to retry. May be we should also to compress nodestatus and store in /status/compressedNodes for the archived workflow, just similar to the no-archived workflow?

@shuangkun
Copy link
Member

shuangkun commented Mar 5, 2024

May be we should also to compress nodestatus and store in /status/compressedNodes for the archived workflow, just similar to the no-archived workflow?

Yes, we can make some related improvements to see if it can work in a large workflow.

@heidongxianhua
Copy link
Contributor Author

heidongxianhua commented Mar 5, 2024

Yes, we can make some related improvements to see if it can work in a large workflow.

@shuangkun thanks, and I just give a draft PR, it may be too trick, but it work well for retry large archived wf. #12741

@shuangkun shuangkun added area/workflow-archive area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Mar 5, 2024
@shuangkun
Copy link
Member

OK,I will take a look. Thanks!

@agilgur5 agilgur5 changed the title support large workflow: can not retry large archived workflow cannot retry large archived workflow that needs offloading Mar 6, 2024
@agilgur5 agilgur5 added the P3 Low priority label Mar 6, 2024
juliev0 pushed a commit that referenced this issue Apr 28, 2024
Signed-off-by: heidongxianhua <18207133434@163.com>
yyzxw added a commit to yyzxw/argo-workflows that referenced this issue Apr 28, 2024
Signed-off-by: heidongxianhua <18207133434@163.com>
Signed-off-by: xiaowu.zhu <xiaowu.zhu@daocloud.io>
agilgur5 pushed a commit that referenced this issue May 4, 2024
Signed-off-by: heidongxianhua <18207133434@163.com>
(cherry picked from commit 6182386)
tczhao pushed a commit to tczhao/argo that referenced this issue Nov 24, 2024
Signed-off-by: heidongxianhua <18207133434@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/workflow-archive P3 Low priority type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants