Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend - When retry enabled, UI says "Execution was skipped" while it's not #4814

Closed
radcheb opened this issue Nov 23, 2020 · 3 comments · Fixed by #4819
Closed

Frontend - When retry enabled, UI says "Execution was skipped" while it's not #4814

radcheb opened this issue Nov 23, 2020 · 3 comments · Fixed by #4819

Comments

@radcheb
Copy link
Contributor

radcheb commented Nov 23, 2020

What steps did you take:

When retry is enabled on a container_op, the UI always says it's execution was skipped and outputs are taken from cache even if it's the first time executed.
Actually cache was never activated on our server.

To reproduce I run the following pipeline:

from kfp import dsl, Client, components, compiler

def add_fn(a: int, b: int) -> int:
    from datetime import datetime
    print("Current Time =", datetime.now().strftime("%H:%M:%S"))
    return a + b

ADD_OP = components.func_to_container_op(add_fn, base_image="library/python:3.6")

@dsl.pipeline(name="test_cache")
def test_cache_pipeline(a: int, b: int):
    add_op_with_retry = ADD_OP(a=a, b=b)
    add_op_with_retry.set_retry(1)

    add_op_without_retry = ADD_OP(a=a, b=b)

client = Client()
compiler_instance = compiler.Compiler()
compiler_instance.compile(pipeline_func=test_cache_pipeline, package_path="test_pipeline_caching.yaml")
client.upload_pipeline(pipeline_package_path="test_pipeline_caching.yaml", pipeline_name="test_pipeline_caching")

What happened:

We got a misleading message:
image

While UI says the other with no retry was executed:
image

What did you expect to happen:

Expected cache to work.

Environment:

Kubeflow 1.0.0.

How did you deploy Kubeflow Pipelines (KFP)?
Deployed on EKS 1.15

KFP version: 1.0.0

KFP SDK version: 1.0.0

Anything else you would like to add:

Caching was disabled by default.

/kind bug
/area frontend

@Ark-kun
Copy link
Contributor

Ark-kun commented Nov 23, 2020

@Bobgy The actually skipped pods have the pipelines.kubeflow.org/reused_from_cache label set to true. We might want to use that for the reuse detection.

@elikatsis
Copy link
Member

The problem for this is that the function that decides whether a node was cached is wrong.
So a proper fix for this would be to change this function:

function wasNodeCached(node: NodeStatus): boolean {
const artifacts = node.outputs?.artifacts;
if (!artifacts || !node.id) {
return false;
}
// HACK: There is a way to detect the skipped pods based on the WorkflowStatus alone.
// All output artifacts have the pod name (same as node ID) in the URI. But for skipped
// pods, the pod name does not match the URIs.
// (And now there are always some output artifacts since we've enabled log archiving).
return artifacts.some(artifact => artifact.s3 && !artifact.s3.key.includes(node.id));
}

Lines 109-111 should be:

if (!artifacts || !node.id || node.type !== 'Pod') {
  return false;
}

I can submit a PR for it

elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Nov 23, 2020
…w#4814)

The UI was deciding whether a node was cached based on whether it was
pointing to some other node's output artifacts.

This works for nodes that correspond to pods, but leads to a wrong
decision for the rest of the nodes. For example, if the node is a Retry
one, it points to another node's outputs because it is that other node
that actually produced them.

We now apply this logic on pod-type nodes only.

Fixes kubeflow#4814

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Nov 23, 2020
…w#4814

The UI was deciding whether a node was cached based on whether it was
pointing to some other node's output artifacts.

This works for nodes that correspond to pods, but leads to a wrong
decision for the rest of the nodes. For example, if the node is a Retry
one, it points to another node's outputs because it is that other node
that actually produced them.

We now apply this logic on pod-type nodes only.

Fixes kubeflow#4814

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
@elikatsis
Copy link
Member

Actually, I created the PR 😄 #4819

elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Nov 23, 2020
…w#4814

The UI was deciding whether a node was cached based on whether it was
pointing to some other node's output artifacts.

This works for nodes that correspond to pods, but leads to a wrong
decision for the rest of the nodes. For example, if the node is a Retry
one, it points to another node's outputs because it is that other node
that actually produced them.

We now apply this logic on pod-type nodes only.

Fixes kubeflow#4814

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
k8s-ci-robot pushed a commit that referenced this issue Nov 24, 2020
…4819)

The UI was deciding whether a node was cached based on whether it was
pointing to some other node's output artifacts.

This works for nodes that correspond to pods, but leads to a wrong
decision for the rest of the nodes. For example, if the node is a Retry
one, it points to another node's outputs because it is that other node
that actually produced them.

We now apply this logic on pod-type nodes only.

Fixes #4814

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants