-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Kubeflow Pipeline experiment run is always running #5851
Comments
Have you set TTL via https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html?highlight=node#kfp.dsl.PipelineConf.set_ttl_seconds_after_finished? |
@capri-xiyue |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Why does this pipelineconf function exist if it is unusable and broken? |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
What steps did you take
Ran an experiment , created in a workflow.py file and run from jupyterlab.
What happened:
I have multiple experiment runs(workflow) which runs sets of kuberenets pods(children inside the workflow) but from some reason some of the runs are still running for days even though all the child pods are completed (Succeeded/Failed).
I have been looking at the logs of the workflow controller but couldn't find the issue. I have went through the workflow.py file to see if something is wrong there but could find the reason why some of the runs are completed and some are still running. Just to remind - I am talking on different runs of the same experiment.
When reviewing the logs of succeeded workflow vs a keep running workflow I some differences:
From the successful workflow:
time="2021-06-11T17:38:37Z" level=info msg="Found Workflow default-tenant/txyz set expire at 2021-06-11 17:50:37 +0000 UTC (11m59.461359494s from now)"
time="2021-06-11T17:38:37Z" level=info msg="Queueing workflow default-tenant/xyz for delete in 11m59.461359494s"
time="2021-06-11T17:50:37Z" level=info msg="Deleting TTL expired workflow default-tenant/xyz"
time="2021-06-11T17:50:37Z" level=info msg="Successfully deleted 'default-tenant/xyz'"
. .
From the still running workflow:
time="2021-06-11T09:13:10Z" level=info msg="Node not set to be retried after status: Error" namespace=default-tenant workflow=xyz
and getting the following message until now:
time="2021-06-11T09:38:42Z" level=info msg="Processing workflow" namespace=default-tenant workflow=xyz
What did you expect to happen:
I was expecting to see all runs in a completed state (Succeeded/Failed)
Environment:
1.0.1
1.0.1
Anything else you would like to add:
Labels
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: