-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry failed workflow with ttl
deleted after initial secondsAfterFailure
while still running
#12636
Comments
suspend
and ttl
shows failed initial status after secondsAfterFailure
So using archived workflows is an important piece of this. Updates to the live and archived workflows naturally have a race condition -- the archived version may lag behind the live one. Does the archived workflow's old status persist after, say, 10 minutes? Or does it then match the new status?
Not sure if you have a typo here or not -- it sounds like it hit the TTL during the The TTL should take into account |
Once a workflow fails and has configured the value You can see an example in attached traces :
And the workflow is deleted 5 minutes later,
But if we launch a retry command ,and the workflow goes to "running" state ,the workflow is deleted even if it is in running state as you can see in the logs . This behaviour occurs when archiving is enabled and disabled. I would expect that if a retry is performed on a failed wf and the workflow goes to "running" state, the workflow should not be deleted because the retry operation has fixed the error and the workflow is running. |
Thanks for investigating this behavior more. Per your analysis, that sounds like the Could you also answer the question I had regarding the Workflow Archive? Does it eventually update to the retried Workflow? If not, that might be another bug, an unhandled race |
Correct , suspend step is unrelated .
It is not updated to the retried workflow. Failed workflow is displayed in archived tab . I attached two images related in my initial comment |
suspend
and ttl
shows failed initial status after secondsAfterFailure
ttl
deleted after initial secondsAfterFailure
Yes I saw those, but I was wondering if the Archived Workflow might update after, say, 10 more minutes. If the last screenshot stays the same or changes to match the retried Workflow. Per your response, it stays the same, so it seems like there's a secondary unhandled race condition here. Although fixing the TTL issue might resolve that race as well; the TTL GC is not anticipating an incomplete Workflow, so in this case GC is happening before archiving (since only completed Workflows get archived) |
ttl
deleted after initial secondsAfterFailure
ttl
deleted after initial secondsAfterFailure
while still running
@agilgur5 @manuelbmar We are also facing the same issue. We have set a ttl of 7 days for the workflow. Let's say Workflow has failed 7 days ago, the workflow was retried 2 days ago and it is in Running state. Still the workflow has deleted. Any resolution for this issue. It seems a critical issue |
@agilgur5 Any updates? Are there any open issues or pull requests being worked on to resolve the problem? |
This only occurs as a race condition with a combination of several features. Retries in particular should not be used frequently (as that would suggest there is an issue with the tasks themselves that should be fixed) and are also one of the most complex areas of the codebase. I.e. this is a low frequency + high complexity issue.
If there were updates, they would already be in the thread. Please follow proper open source etiquette.
You are also more than welcome to contribute as you checked that you'd like to do.
Clarification here, archiving only occurs for a completed Workflow, so this is a single bug. The solution is still likely to be to remove a Workflow from the TTL queue when it is retried. |
…d not be deleted (Fixes argoproj#12636)
hi, @agilgur5 @manuelbmar We encountered a similar issue where we couldn't find a good solution for removing retried workflows from the TTL queue, as there seems to be no built-in method for deleting elements from the delayed queue. As a temporary workaround, we have implemented a somewhat inelegant bypass, as in #12905, which incurs a query cost before the deletion operation, but it effectively addresses our current issue. Regarding the idea of removing elements from the queue:
would like to ask if this direction is feasible? |
The
Yea that could potentially add quite a lot of queries, since it's one more for every deletion 😕
We actually are implementing something similar in #12734 (see also #12538) |
…j#12636) Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
…j#12636) Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
…j#12636) (argoproj#12905) Signed-off-by: Shiwei Tang <siwe.tang@gmail.com> Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
…j#12636) (argoproj#12905) Signed-off-by: Shiwei Tang <siwe.tang@gmail.com> Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Pre-requisites
:latest
What happened/what did you expect to happen?
I am retrying a failed worflow that has steps set up with manual approval and ttl secondsAfterFailure .
Step-0 failed in the first execution , i tried a call to /retry method , the failed step has now gone fine and I manage to advance our workflow correctly.
The workflow is now waiting for the next step with manual approval as you can see in the screenshot .
But once the TTL secondsAfterFailure is completed, "workflow gone" message is displayed and the workflow status shown in the UI is the one before the /retry action.
"workflow gone" image :
"archived workflows" image :
I think ttl secondsAfterFailure is not takig into account that the workflow is running,
as it has a "suspend" step waiting for approval.edited by agilgur5:suspend
is unrelated / red herring in this case, see belowVersion
v3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: