Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] persistence agent don't update pipeline run status after workflow deleted #5722

Open
algs opened this issue May 23, 2021 · 7 comments
Labels
area/backend help wanted The community is welcome to contribute. kind/bug lifecycle/frozen

Comments

@algs
Copy link
Contributor

algs commented May 23, 2021

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
  • KFP version: 1.5.0
  • KFP SDK version:

Steps to reproduce

  1. launch a pipeline run, and observe the status of nodes in the run as expected from UI.
  2. use command kubectl delete workflow xxx to delete the workflow corresponding to the pipeline run, then the unfinished nodes in the pipeline run stay running, and never been updated.
  3. the delete workflow may happen when we delete the scheduledworkflow, and then the workflow without the parent scheduledworkflow got GC

Expected result

The running nodes in the pipeline run been marked as terminated

Materials and Reference


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

cc @Jeffwan

@zijianjoy zijianjoy added the help wanted The community is welcome to contribute. label May 26, 2021
@Bobgy
Copy link
Contributor

Bobgy commented May 26, 2021

Thank you @algs for flagging this bug report!
This is a real problem from the persistence agent design.

My rough idea to improve would be that, persistence agent should also periodically list all runs from KFP API that are still in running state, if any of the runs do not have a corresponding argo workflow. We should update the DB record to maybe a special state called "WORKFLOW_DELETED" etc.

What do you think?

@DaleSin
Copy link

DaleSin commented May 27, 2021

Maybe just list runs which have not been updated/reported for a certain time, it should be a few runs.

@algs
Copy link
Contributor Author

algs commented May 27, 2021

Maybe just list runs which have not been updated/reported for a certain time, it should be a few runs.

would this because of service error rather than the corresponding workflow deleted?

@Jeffwan
Copy link
Member

Jeffwan commented May 27, 2021

My rough idea to improve would be that, persistence agent should also periodically list all runs from KFP API that are still in running state, if any of the runs do not have a corresponding argo workflow. We should update the DB record to maybe a special state called "WORKFLOW_DELETED" etc.

An alternative could be persistent agent watches workflow changes and if there's a delete workflow event, check running state of associated run and update DB. I prefer to choose the way which can reduce apiserver pressure. :D

@Bobgy
Copy link
Contributor

Bobgy commented May 28, 2021

@Jeffwan great idea! That feels like the most efficient solution.

@stale
Copy link

stale bot commented Aug 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 28, 2021
@Bobgy
Copy link
Contributor

Bobgy commented Sep 12, 2021

/lifecycle frozen
Welcome anyone to contribute this!

@google-oss-robot google-oss-robot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Sep 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend help wanted The community is welcome to contribute. kind/bug lifecycle/frozen
Projects
None yet
Development

No branches or pull requests

6 participants