Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Kubeflow Pipeline experiment run is always running #5851

Closed
achikars opened this issue Jun 13, 2021 · 6 comments
Closed

[bug] Kubeflow Pipeline experiment run is always running #5851

achikars opened this issue Jun 13, 2021 · 6 comments
Labels
kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@achikars
Copy link

What steps did you take

Ran an experiment , created in a workflow.py file and run from jupyterlab.

What happened:

I have multiple experiment runs(workflow) which runs sets of kuberenets pods(children inside the workflow) but from some reason some of the runs are still running for days even though all the child pods are completed (Succeeded/Failed).

I have been looking at the logs of the workflow controller but couldn't find the issue. I have went through the workflow.py file to see if something is wrong there but could find the reason why some of the runs are completed and some are still running. Just to remind - I am talking on different runs of the same experiment.

When reviewing the logs of succeeded workflow vs a keep running workflow I some differences:

From the successful workflow:

time="2021-06-11T17:38:37Z" level=info msg="Found Workflow default-tenant/txyz set expire at 2021-06-11 17:50:37 +0000 UTC (11m59.461359494s from now)"

time="2021-06-11T17:38:37Z" level=info msg="Queueing workflow default-tenant/xyz for delete in 11m59.461359494s"

time="2021-06-11T17:50:37Z" level=info msg="Deleting TTL expired workflow default-tenant/xyz"

time="2021-06-11T17:50:37Z" level=info msg="Successfully deleted 'default-tenant/xyz'"

. .

From the still running workflow:

time="2021-06-11T09:13:10Z" level=info msg="Node not set to be retried after status: Error" namespace=default-tenant workflow=xyz

and getting the following message until now:

time="2021-06-11T09:38:42Z" level=info msg="Processing workflow" namespace=default-tenant workflow=xyz

What did you expect to happen:

I was expecting to see all runs in a completed state (Succeeded/Failed)

Environment:

  • How do you deploy Kubeflow Pipelines (KFP)?
  • KFP version:
    1.0.1
  • KFP SDK version:
    1.0.1

Anything else you would like to add:

Labels


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@capri-xiyue
Copy link
Contributor

capri-xiyue commented Jun 18, 2021

Have you set TTL via https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html?highlight=node#kfp.dsl.PipelineConf.set_ttl_seconds_after_finished?
It is not the correct way to set TTL.
The correct way to set TTL in KFP is recorded in #3938

@Hedingber
Copy link

@capri-xiyue
Can you give some more details ?
I am experiencing the same issue and I am using set_ttl_seconds_after_finished to set ttl on the workflow.
Why it's not the correct way to set TTL ? what is the purpose of this method ?
The correct way you linked to is a way to set a TTL globally, what if I want a different TTL per pipeline ?
it seems like this method is the way to go
Above all, how/why the way I'm setting the TTL affecting whether my pipeline is being correctly moving between states or not ?

@stale
Copy link

stale bot commented Oct 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 2, 2021
@juliusvonkohout
Copy link
Member

I fixed the first part and the default from persistenceagent is working now with kubernetes resources #6622 , but i still think that pipelineconf TTL_SECONDS_AFTER_WORKFLOW_FINISH per workflow is broken as mentioned in #6432

@juliusvonkohout
Copy link
Member

@capri-xiyue Can you give some more details ? I am experiencing the same issue and I am using set_ttl_seconds_after_finished to set ttl on the workflow. Why it's not the correct way to set TTL ? what is the purpose of this method ? The correct way you linked to is a way to set a TTL globally, what if I want a different TTL per pipeline ? it seems like this method is the way to go Above all, how/why the way I'm setting the TTL affecting whether my pipeline is being correctly moving between states or not ?

Why does this pipelineconf function exist if it is unusable and broken?

@stale
Copy link

stale bot commented Mar 3, 2022

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
None yet
Development

No branches or pull requests

4 participants