Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support retryPolicy combination #13119

Open
tczhao opened this issue May 30, 2024 · 8 comments
Open

support retryPolicy combination #13119

tczhao opened this issue May 30, 2024 · 8 comments
Labels
area/retryStrategy Template-level retryStrategy solution/workaround There's a workaround, might not be great, but exists type/feature Feature request

Comments

@tczhao
Copy link
Member

tczhao commented May 30, 2024

Summary

What change needs making?

We want to use both OnError and OnTransientError
https://argo-workflows.readthedocs.io/en/stable/retries/#retry-policies
But only one type is supported at the moment.

We have workarounds

workaround 1 OnError + expression

  retryStrategy:
    retryPolicy: OnError
    expression: >-
      lastRetry.message contains 'Pod was terminated in response to imminent node shutdown' 

but we are facing issue #13058

workaround 2 OnTransientError + patterns from OnError

  retryStrategy:
    retryPolicy: OnTransientError
# ...
    env:
      - name: TRANSIENT_ERROR_PATTERN
        value: "(pod_deleted|post workflowtaskresults)"

issue: takes time to identify all argo internal error, and every error is a step back in SLA

Use Cases

When would you use this?
To retry on both argo internal error OnError and transient error OnTransientError


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

@tczhao tczhao added the type/feature Feature request label May 30, 2024
@tczhao
Copy link
Member Author

tczhao commented May 30, 2024

code that determines how retryPolicy is used
https://github.com/argoproj/argo-workflows/blob/v3.5.7/workflow/controller/operator.go#L1060-L1077

thoughts:
to make this backward compatible, the retryPolicy field has to be a string.
having OnErrorTransientError OnFailureTransientError seems to be an easy fix, looking for opinions

@tczhao tczhao added the area/retryStrategy Template-level retryStrategy label May 30, 2024
@tczhao tczhao changed the title support a combination of retryPolicy support retryPolicy combination May 30, 2024
@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label May 30, 2024
@agilgur5 agilgur5 changed the title support retryPolicy combination support retryPolicy combination May 30, 2024
@agilgur5
Copy link

agilgur5 commented May 30, 2024

I wonder if we should make the retryPolicy variants available as keywords in the expression instead? That would support any combination then. Could also potentially make retryPolicy itself no longer needed.

Also cc @Joibel who's worked on the two before (#11005) and may have an opinion

@Joibel
Copy link
Member

Joibel commented Jun 3, 2024

OnError is a strict superset of OnTransientError. So if you specify OnError you will get all the retries you'd get with OnTransientError and more, so there is never a need to use both strategies at the same time.

If you needed OnFailure + OnTransientError then perhaps an isTransient() helper to put in the expression would be useful.

Fixing #13058 seems like it would help the most though.

@tczhao
Copy link
Member Author

tczhao commented Jun 11, 2024

I think what we want is
OnError + some patterns (could come from node Error or Failure)


The current expression is not enough since its applied as a filter before evaluating retryPolicy.

fixing #13058 is not enough for this usecase
e.g.

  • retryAlways + expression (need to list out all possible onError related failure)
  • retryOnError + expression (can not support node Failure situation)
  • retryOnFailure + expression (can not support node Error situation)
  • retryOnTransientError + expression (need to list out all possible onError related failure)

@tczhao
Copy link
Member Author

tczhao commented Jun 11, 2024

I think we could have keywords/parameters,

  • retryPolicy.OnFailure
  • retryPolicy.OnError
  • retryPolicy.OnTransientError

in the expression, this should be sufficient for all usecase

@agilgur5
Copy link

agilgur5 commented Jun 11, 2024

I think we could have keywords/parameters,

Yea that's more or less what I was suggesting earlier

@tczhao
Copy link
Member Author

tczhao commented Jun 12, 2024

I think what we want is OnError + some patterns (could come from node Error or Failure)

The current expression is not enough since its applied as a filter before evaluating retryPolicy.

fixing #13058 is not enough for this usecase e.g.

  • retryAlways + expression (need to list out all possible onError related failure)
  • retryOnError + expression (can not support node Failure situation)
  • retryOnFailure + expression (can not support node Error situation)
  • retryOnTransientError + expression (need to list out all possible onError related failure)

Some corrections
fixing #13058 is enough for this use case
e.g.
retryAlways + expression(lastRetry.status=="Errored" or custom-pattern)

If you needed OnFailure + OnTransientError then perhaps an isTransient() helper to put in the expression would be useful.

This is correct.
This is the only edge case where the current implementation is not able to support

@tczhao
Copy link
Member Author

tczhao commented Jun 18, 2024

Hi @Joibel
since

OnError is a strict superset of OnTransientError

Could we include TRANSIENT_ERROR_PATTERN as part of OnError as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retryStrategy Template-level retryStrategy solution/workaround There's a workaround, might not be great, but exists type/feature Feature request
Projects
None yet
Development

No branches or pull requests

3 participants