[bug] Kubeflow Pipeline experiment run is always running #5851

achikars · 2021-06-13T12:48:42Z

What steps did you take

Ran an experiment , created in a workflow.py file and run from jupyterlab.

What happened:

I have multiple experiment runs(workflow) which runs sets of kuberenets pods(children inside the workflow) but from some reason some of the runs are still running for days even though all the child pods are completed (Succeeded/Failed).

I have been looking at the logs of the workflow controller but couldn't find the issue. I have went through the workflow.py file to see if something is wrong there but could find the reason why some of the runs are completed and some are still running. Just to remind - I am talking on different runs of the same experiment.

When reviewing the logs of succeeded workflow vs a keep running workflow I some differences:

From the successful workflow:

time="2021-06-11T17:38:37Z" level=info msg="Found Workflow default-tenant/txyz set expire at 2021-06-11 17:50:37 +0000 UTC (11m59.461359494s from now)"

time="2021-06-11T17:38:37Z" level=info msg="Queueing workflow default-tenant/xyz for delete in 11m59.461359494s"

time="2021-06-11T17:50:37Z" level=info msg="Deleting TTL expired workflow default-tenant/xyz"

time="2021-06-11T17:50:37Z" level=info msg="Successfully deleted 'default-tenant/xyz'"

. .

From the still running workflow:

time="2021-06-11T09:13:10Z" level=info msg="Node not set to be retried after status: Error" namespace=default-tenant workflow=xyz

and getting the following message until now:

time="2021-06-11T09:38:42Z" level=info msg="Processing workflow" namespace=default-tenant workflow=xyz

What did you expect to happen:

I was expecting to see all runs in a completed state (Succeeded/Failed)

Environment:

How do you deploy Kubeflow Pipelines (KFP)?

KFP version:
1.0.1
KFP SDK version:
1.0.1

Anything else you would like to add:

Labels

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

capri-xiyue · 2021-06-18T01:10:12Z

Have you set TTL via https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html?highlight=node#kfp.dsl.PipelineConf.set_ttl_seconds_after_finished?
It is not the correct way to set TTL.
The correct way to set TTL in KFP is recorded in #3938

Hedingber · 2021-06-20T12:25:07Z

@capri-xiyue
Can you give some more details ?
I am experiencing the same issue and I am using set_ttl_seconds_after_finished to set ttl on the workflow.
Why it's not the correct way to set TTL ? what is the purpose of this method ?
The correct way you linked to is a way to set a TTL globally, what if I want a different TTL per pipeline ?
it seems like this method is the way to go
Above all, how/why the way I'm setting the TTL affecting whether my pipeline is being correctly moving between states or not ?

stale · 2021-10-02T01:03:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout · 2021-10-13T12:40:42Z

I fixed the first part and the default from persistenceagent is working now with kubernetes resources #6622 , but i still think that pipelineconf TTL_SECONDS_AFTER_WORKFLOW_FINISH per workflow is broken as mentioned in #6432

juliusvonkohout · 2021-10-13T12:41:43Z

@capri-xiyue Can you give some more details ? I am experiencing the same issue and I am using set_ttl_seconds_after_finished to set ttl on the workflow. Why it's not the correct way to set TTL ? what is the purpose of this method ? The correct way you linked to is a way to set a TTL globally, what if I want a different TTL per pipeline ? it seems like this method is the way to go Above all, how/why the way I'm setting the TTL affecting whether my pipeline is being correctly moving between states or not ?

Why does this pipelineconf function exist if it is unusable and broken?

stale · 2022-03-03T03:04:45Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

achikars added the kind/bug label Jun 13, 2021

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 2, 2021

stale bot closed this as completed Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Kubeflow Pipeline experiment run is always running #5851

[bug] Kubeflow Pipeline experiment run is always running #5851

achikars commented Jun 13, 2021

capri-xiyue commented Jun 18, 2021 •

edited

Loading

Hedingber commented Jun 20, 2021

stale bot commented Oct 2, 2021

juliusvonkohout commented Oct 13, 2021

juliusvonkohout commented Oct 13, 2021

stale bot commented Mar 3, 2022

[bug] Kubeflow Pipeline experiment run is always running #5851

[bug] Kubeflow Pipeline experiment run is always running #5851

Comments

achikars commented Jun 13, 2021

What steps did you take

What happened:

What did you expect to happen:

Environment:

Anything else you would like to add:

Labels

capri-xiyue commented Jun 18, 2021 • edited Loading

Hedingber commented Jun 20, 2021

stale bot commented Oct 2, 2021

juliusvonkohout commented Oct 13, 2021

juliusvonkohout commented Oct 13, 2021

stale bot commented Mar 3, 2022

capri-xiyue commented Jun 18, 2021 •

edited

Loading