Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the DSL to implement the design of #801 #926

Merged
merged 12 commits into from
Apr 25, 2019

Conversation

elikatsis
Copy link
Member

@elikatsis elikatsis commented Mar 6, 2019

Hello,

This is a PR to implement all the features described in #801.

  • We define two new pipeline components: PipelineVolumes and PipelineVolumeSnapshots. These are an abstraction for the user to manipulate PVCs and VolumeSnapshots for data management purposes, also exposing dependencies according to their usage
  • We extend ContainerOp's constructor to accept one more argument: volumes. It is a dict having filesystem paths as keys and PipelineVolumes or PipelineVolumeSnapshots as values. We rename the pre-existing internal attribute volumes to k8s_volumes, to avoid misinterpretation.
  • The compiler manipulates all objects (ContainerOp, PipelineVolume, PipelineVolumeSnapshot) using the existing ops dictionary. However, we also refer to objects separately based on their type, grouping objects for type-specific processing; cops, vols, snaps. Previously, cops was the only type of object in ops.
  • We extend the UI to show PipelineVolumes and PipelineVolumeSnapshots tasks differently:
    • PipelineVolume tasks which create new PVCs appear in a different color (mistyrose)
    • PipelineVolumeSnapshot tasks which create new VolumeSnapshots appear in a different color (honeydew)
    • We add a Volume Mounts section in the UI which lists all mount points in a ContainerOp along with their mounted volumes.
    • For all resource templates we show the full manifest, which is to be submitted to the cluster
  • We add sample code showing the usage of this new feature in directory samples/volumes/. These examples correspond to the design of Extend the DSL with support for Persistent Volumes and Snapshots #801
  • We add a full test suite in directory sdk/python/tests/dsl/ and sdk/python/tests/compiler
  • We add full testing for the frontend

I only recently started learning about the frontend, so looking forward to your feedback on my tests 🙂

Related issues

Closes #801

Make sure you have applied the fixes of argoproj/argo-workflows#1238, kubeflow/kubeflow#2556. This PR also depends on a version of Argo with the fix already merged at argoproj/argo-workflows#1232. I have uploaded updated Argo images at gcr.io/arrikto/argoexec:latest-elikatsis, gcr.io/arrikto/workflow-controller:latest-elikatsis to help with the review process.


This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Hi @elikatsis. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@k8s-ci-robot
Copy link
Contributor

Hi @elikatsis. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elikatsis
Copy link
Member Author

I force pushed to ensure commits are in proper chronological order.

@vkoukis
Copy link
Member

vkoukis commented Mar 6, 2019

/ok-to-test

@hongye-sun hongye-sun self-assigned this Mar 6, 2019
@elikatsis elikatsis force-pushed the pr-feature-dsl-volumes branch 4 times, most recently from 0b03f0c to 5fef914 Compare March 6, 2019 22:40
@hongye-sun
Copy link
Contributor

Really thanks for submitting the PR. I will review it shortly.

Copy link
Contributor

@hongye-sun hongye-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on an alternative way to implement it by not specializing volume and snapshot in DSL and still just rely on pipeline param to pass data and dependency. I feel that this pattern can be applied to more generic k8s resources instead of just volume.

self.data_source_name = None
self.mode = None
self.deps = set()
self.k8s_resource = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think it should have a base class named K8sResourceOp in parallel with ContainerOp to create general k8s resource. Volume and snapshot are two specialized ones that are deriving from it. A lot of code here can be reused.

I think it's also reasonable to me to have a PR to refactor this in the future..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on having a K8ResourceOp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

I really think it should have a base class named K8sResourceOp in parallel with ContainerOp to create general k8s resource. Volume and snapshot are two specialized ones that are deriving from it. A lot of code here can be reused.

@vicaire

+1 on having a K8ResourceOp

Hello @hongye-sun, @vicaire, thanks for the review!

This is a great suggestion. I am amending the design doc, to include your comment and a suggestion on how we can do it. We have tested it with an initial implementation of the code and it works beautifully. Essentially, it gives the user a way to create a custom K8s resource, from a resource spec, and have attributes of the K8s resource it creates be PipelineParams for the rest of the pipeline.

Let's continue the discussion in the design doc, I will submit an extension shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkoukis

The new design looks awesome and I really like it. It aligns with our vision of DSL and provides an elegant way to orchestrate arbitrary k8s resources. Let's continue our discussion on the design doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, it's worth to take a look of #879. In that PR, @eterna2 followed the same pattern to make ContainerOp and Container (which inherits from V1Container) as 2 different things, so that:

  1. We are able to expose the full k8s container spec from ContainerOp
  2. SideCar can inherit from Container
  3. (in the future) We can create a TFJob which contains a Container. Then, all the convenient APIs that we support in Container like param parsing, volume mount, etc can be just reused.

That PR also implements a more generic way to extract PipelineParams from any object as long as it follows the convention from swagger generated code. This effort should make ResourceOp implementation a lot easier.

Copy link
Member

@vkoukis vkoukis Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

The new design looks awesome and I really like it. It aligns with our vision of DSL and provides an elegant way to orchestrate arbitrary k8s resources. Let's continue our discussion on the design doc.

I think our visions for the DSL are aligned, it's just a matter of finding the most expressive/succinct way of describing our end goal. Your comments are very helpful in this regard. Looking forward to your comments on the design doc.

storage_class: The storage class to use for the dynamically created
PVC (requires size or data_source)
data_source: The name of an existing VolumeSnapshot or a
PipelineVolumeSnapshot object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make PipelineVolumeSnapshot Derive from VolumeSnapshot? Why do you need to differentiate them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Is it possible to not have a concept of volume that is specific to Pipeline?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

Is it possible to make PipelineVolumeSnapshot Derive from VolumeSnapshot? Why do you need to differentiate them?

@vicaire

+1. Is it possible to not have a concept of volume that is specific to Pipeline?

I will be amending the design doc. Yes, we can have a volume be a descendant of a generic K8s volume resource, V1Volume(). Similarly for the snapshot. The thing is approach does not cover is the ability to deduce dependencies between steps. I will continue in the design doc on #801.

size: The size of the PVC which will be created
storage_class: The storage class to use for the dynamically created
PVC (requires size or data_source)
data_source: The name of an existing VolumeSnapshot or a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add that it is an alpha feature and requires to be enabled in k8s cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Will add a comment in the final PR.

description="The fifth example of the design doc."
)
def example5(rok_url):
vol1 = dsl.PipelineVolume(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class mingles volume spec and creating volume operation together, which is not ideal to me. Can we make them as separate classes, for example:

PipelineVolume is served as a data object which contains the volume spec.
PersistentVolumeClaimOp is similar with ContainerOp which serves as creating a pvc in the pipeline.
PipelineVolume can be an output of PersistentVolumeClaimOp.

Later, we might refactor the code that PipelineVolume derives from k8s volume spec, and PersistentVolumeClaimOp derives from K8sResourceOp which can be used to create general k8s resources.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

This class mingles volume spec and creating volume operation together, which is not ideal to me. Can we make them as separate classes, for example:

PipelineVolume is served as a data object which contains the volume spec.
PersistentVolumeClaimOp is similar with ContainerOp which serves as creating a pvc in the pipeline.
PipelineVolume can be an output of PersistentVolumeClaimOp.

Yes! This was a great comment.
We went through this with @elikatsis , and tried to iterate on this approach.
We have verified with actual code that this indeed improves the design even further.

Essentially what you are saying is:
Let's separate the data object, which refers to the data being passed from step to step, from the operation, the DAG step that creates the object.

This wil work. First have a generic way to create a K8s resource, similar to ContainerOp, then inherit from it so we can have an Op that creates PVCs, then have this Op produce a data objct that is the PipelineVolume, the volume itself, ready to be consumed by next steps.

Later, we might refactor the code that PipelineVolume derives from k8s volume spec, and PersistentVolumeClaimOp derives from K8sResourceOp which can be used to create general k8s resources.

WDYT?

This was a great idea, I'll be amending the design doc, and we have verified this works, with real code.

command=["sh", "-c"],
arguments=["cat /data/file*| gzip -c >/data/full.gz"],
volumes={"/data": vol1}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change vol1 as an output of a PvcOp, we can embed the output PipelineParam of PvcOp in the volume spec. In the containerop, we can leverage the similar way as args and command to parse the volumes to get input params. In this way, PipelineVolume can be served as a pure data object and we can just rely on PipelineParam to manage the data passing and dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

If we change vol1 as an output of a PvcOp, we can embed the output PipelineParam of PvcOp in the volume spec. In the containerop, we can leverage the similar way as args and command to parse the volumes to get input params. In this way, PipelineVolume can be served as a pure data object and we can just rely on PipelineParam to manage the data passing and dependency.

Yes, this works. But has two disadvantanges, let's continue the discussion in #801:

  1. It requires modifications to Argo to allow container templates to consume volumes that contain input parameters that must be evaluated before each container task in the DAG
  2. It cannot capture the case when there is multiple container tasks are using the same volume and we need to depend on all containers finishing, before another task can consume [mount] the volume.

image="library/bash:4.4.23",
command=["sh", "-c"],
arguments=["cat /data/file*| gzip -c >/data/full.gz"],
volumes={"/data": vol1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to support volumes in ContainerOp's constructor, but it's better to make it support a generic k8s volume spec here. PipelineVolume still have value to exist to hide the complexity of k8s sdk code and it may keep additional data other than k8s Volume like subPath to facilitate mounting inputs and output folders from a single volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

It makes sense to me to support volumes in ContainerOp's constructor, but it's better to make it support a generic k8s volume spec here. PipelineVolume still have value to exist to hide the complexity of k8s sdk code and it may keep additional data other than k8s Volume like subPath to facilitate mounting inputs and output folders from a single volume.

ACK. We must simplify the way ContainerOp instsances can consume arbitrary V1Volume instances. If we make PipelineVolume a descendant of V1Volume then things will work automagically, whether we create k8s.V1Volume instances manually, or we depend on a previous task in the pipeline emitting a PipelineVolume instance.

image="library/bash:4.4.23",
command=["sh", "-c"],
arguments=["cat /mnt/file1 /mnt/file2"],
volumes={"/mnt": vol_common.after(step1, step2)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this code is really needed. Users achieve same effect by: step3.after(step1, step2).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

I am not sure if this code is really needed. Users achieve same effect by: step3.after(step1, step2).

This is the only comment we don't really agree with.
Apart from being a bit unfortunately named -- I'll open a separate issue for this -- I understand step3.after() is meant to be a last resort; as a way to declare dependencies when there is no other way for the compiler to deduce them.

But what is great about using PipelineParam instances is that the compiler can detect dependencies based on how we use PipelineParam instances, i.e., when we format strings based on them.

Similarly, this PR aims to make volumes a first-class way of passing GBs of objects between steps, thus creating dependencies implicitly.

We argue that PipelineParam instances is the only way to deduce dependencies now, we will soon have full artifacts [and we are happy to contribute in their implementation!], similarly volumes should be such a way. So, yes, a PipelineVolume is a V1Volume, agreed, but it also brings in dependencies when used.

Let's talk more about this after I amend #801.

volumes={"/data": vol2}
)

step2_snap = dsl.PipelineVolumeSnapshot(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this feature is still in Alpha, I am worried that not every user can use it. Can we not specialize it in DSL for now and just use k8s spec to represent it and we provide a K8sResourceOp to support to create it in DSL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongye-sun

Since this feature is still in Alpha, I am worried that not every user can use it. Can we not specialize it in DSL for now and just use k8s spec to represent it and we provide a K8sResourceOp to support to create it in DSL?

We agree with creating a special Op to create resources.
But using the k8s resource manually does not work, it leads to lots of boilerplate.
Using a simple type in the DSL that inherits from this resource simplifies the pipelines significantly.

This is also what has gathered very positive end user feedback, both in the design doc issue #801, and in discussions we've had with potential users in the wild. So, I think it is best to keep it. We will add a warning that this will not work unless supported by the cluster.

@@ -1,5 +1,5 @@
/*
* Copyright 2018 Google LLC
* Copyright 2018-2019 Google LLC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more overall comment: Is it possible to implement something that would work even if a pipeline step is used multiple times? In this case, one execution should not override the data written by another.

For example, Ning is planning to add loop support: https://docs.google.com/document/d/12KHoEGe3o-i2WyzaU2JPXp3GQL3_BavGuh-KYBMUdQ8/edit?ts=5c5b2627#heading=h.bhvs46afvzxg

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicaire

One more overall comment: Is it possible to implement something that would work even if a pipeline step is used multiple times? In this case, one execution should not override the data written by another.

For example, Ning is planning to add loop support: https://docs.google.com/document/d/12KHoEGe3o-i2WyzaU2JPXp3GQL3_BavGuh-KYBMUdQ8/edit?ts=5c5b2627#heading=h.bhvs46afvzxg

I will amend the design doc in #801 so we use PipelineParam instances in the K8s resource specs referred to by PipelineVolume and PipelineVolumeSnapshot instances. This should make them usable in loops.

We don't have access to an implementation of loops yet, so we cannot verify our code, but we are confident this will work, based on our use of PipelineParam instances. Let's talk more about it in #801, if you feel there is something that can cause problems with your implementation.

@elikatsis
Copy link
Member Author

Hello!
I force-pushed the changes we have discussed in the Design Doc.
The following PRs are still pending: kubeflow/kubeflow#2556, argoproj/argo-workflows#1238 and are needed for the whole feature to be used.
I believe they will get reviewed soon.

Copy link
Contributor

@vicaire vicaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contributions @vkoukis. Here are a couple initial comments.

Since this is such a big change, I am asking @rileyjbauer to review the frontend part, and @hongye-sun to approve the DSL part (@hongye-sun will be back next week).

step2_snap = dsl.VolumeSnapshotOp(
name="create_snapshot_2",
resource_name="snap2",
volume=step2.volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sample focus on having one volume per output? Could we cover the case where the same volume is reused throughout the workflow, but a different directory within the volume is used for each output?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicaire

Does this sample focus on having one volume per output? Could we cover the case where the same volume is reused throughout the workflow, but a different directory within the volume is used for each output?

We have modified the resourceops/volume_snapshotop_sequential.py sample, which covers exactly your desired case: One volume, being used by all steps, each step storing output in a different directory.

image="library/bash:4.4.23",
command=["sh", "-c"],
arguments=["cat /mnt/file1 /mnt/file2"],
volumes={"/mnt": vop.volume.after(step1, step2)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "vop.volume.after" statement should not be needed if we use Argo's parameter passing functionality. Is there a way we could avoid exposing this feature and focus instead on using Argo's parameter passing functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "vop.volume.after" statement should not be needed if we use Argo's parameter passing functionality. Is there a way we could avoid exposing this feature and focus instead on using Argo's parameter passing functionality?

We already use Argo's parameter passing functionality, for resource names in general, and volume and snapshot names in particular.
We already use the existing mechanism that the compiler has to deduce dependencies based on PipelineParams: Because the names of resources created by ResourceOps are output PipelineParams, any ContainerOp that needs to mount a volume will already take a dependency on the ResourceOp that is supposed to create it.

This mechanism however does not suffice. The problem is how to explain to the compiler that three ContainerOps have to use the same volume sequentially, because one step needs to access files that the previous step must have placed in the volume. In this case, there could be no PipelineParam being passed from step to step.

This is our rationale: a dsl.PipelineVolume is a K8s Volume that brings dependencies, exactly as a PipelineParam is a string that brings dependencies. Similarly, vop.volume.after(step1, step2, ...) is not a statement, it is an expression that returns a new dsl.PipelineVolume that has taken a dependency on step1, step2.

KFP only supports PipelineParams for deducing dependencies now. But in a while, we will also have artifacts. We think the platform should afford volumes the same first-class citizen treatment.

This avoids cumbersome workarounds using dummy PipelineParams like the following:

vop = dsl.VolumeOp(...)

step1 = dsl.ContainerOp(
    name="step_1",
    image="some_image_1",
    command=["sh", "-c"],
    arguments=["echo 1 > /mnt/file1 && "
               "some_processing_that_writes_to_the_volume"],
    file_outputs={"dummy": "/mnt/file1"},
    volumes={"/mnt": vop.volume}    
)

step2 = dsl.ContainerOp(
    name="step_2",
    image="some_image_2",
    command=["sh", "-c"],
    arguments=["echo %s && some_processing_that_reads_from_volume"
               % step1.output],
    volumes={"/mnt": vop.volume}    
)

In that case, these Argo parameters (KFP's PipelineParams) would have to be dummy data.
That, however, slows down the pipeline's execution, because Argo's executor has to retrieve and save these parameters.

It also requires the user to write additional unneeded code, just for that parameter passing.
I.e.:

  1. The user has to create a file, let's say /tmp/dummy
  2. Point to it in file_outputs
  3. Use it in another step's command/argument (e.g. "echo %s" % previous_step.output. This will also be logged, but it will have no value).

All of the above led us to provide such feature. That is to drive dependencies from PipelineVolumes (wrapper of V1Volume).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vkoukis. I am a bit hesitant about having multiple ways to specify dependencies between pipeline steps depending on how the data is stored (volume or not).

Would it be possible to only define dependencies between steps in on way? What about a way to enforce that step_2 should execute after step_1 that does not involve volumes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are already two ways of specifying dependencies between pipeline steps. One is using the after() method explicitly, while the other way is to use a PipelineParam in any field of an Op.
Similarly, I assume that when KFP has support for Artifacts, the compiler will derive dependencies based on their usage.

This is what makes PipelineParams powerful. I fail to see why the platform cannot afford volumes the same treatment. By not doing so, we are essentially limiting the user experience severely: If a user works with PipelineParams, they can write nice, succinct code, the proposed changes carry this functionality so it works when using volumes for data exchange. We have shown in the examples, that the same kind of code can be written with volumes being the medium for passing information from step to step.

Essentially, PipelineVolumes should be of the same object "family" in the DSL, objects that the compiler uses to deduce dependencies.

name="step2_gunzip",
image="library/bash:4.4.23",
command=["gunzip", "-k", "/data/full.gz"],
volumes={"/data": vop2.volume}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that these samples won't work with loops/recursion or whenever a step is reused multiple times. The same volume will be reused and the data overridden.

Here are some examples/documentations for loop/recursion on which to validate the volume implementation:

https://www.kubeflow.org/docs/pipelines/sdk/dsl-recursion/
https://github.com/kubeflow/pipelines/blob/master/samples/basic/recursion.py
https://github.com/kubeflow/pipelines/blob/master/sdk/python/tests/compiler/testdata/recursive_while.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicaire

It seems that these samples won't work with loops/recursion or whenever a step is reused
multiple times. The same volume will be reused and the data overridden.

Actually, what is going to happen depends on the user. If the user wishes to use a single volume throughout the loop, for all of its iterations, then they should create it before the loop. Similarly, if they have explicitly placed a ResourceOp inside a loop, then they expect multiple resources to be created, one for each iteration, so, they need to set the name of the resource accordingly, e.g., based on a PipelineParam value. This will ensure that each loop iteration creates a differently-named resource. Our mechanism supports all of this, it's up to the user to decide their policy.

Here is another example: If the resource has meaning specific to an iteration, e.g. for debugging, then the user should specify a PipelineParam as part of the resource_name argument.
Similarly, the user can choose what volume to mount on a Container, based on the value of flip_result in your recursive_while.py example.

Another way is for the user to provide a k8s_resource directly, and fill the metadata.generateName, to allow K8s to determine the final name. No matter what they choose, they will be able to use the final name as an output PipelineParam of this ResourceOp.

Copy link
Contributor

@vicaire vicaire Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vkoukis. The current thinking was that we store data in a directory as follows:

/"subdirectory"/"argo-workflow-UUID"/"POD-UUID"/...

This avoids collision if the same step is executed multiple times (loop, recursion, reuse of a template).

To make it easy to get the and the , we use special Argo strings that get substituted by Argo as the workflow executes: {{{workflow.uid}}}, {{{pod.name}}}

https://github.com/argoproj/argo/blob/master/docs/variables.md

Would it be possible for the DSL to automatically use these strings substituted by Argo to avoid the possibility of a collision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow here.
Accessing the volume happens at the filesystem layer.
The task that runs inside the ContainerOp can [and should!] be able to write anywhere in the volume.

Exactly how to keep the results of different executions separate, is actually user-defined policy.
Some users may be creating a new volume for each run, this is actually what we suggest our users do, because it is super easy and efficient to create new volumes/clones. Others, may choose to work under a specific subdirectory, in which case they can always refer to {{workflow.uid}} when accessing the volume. A subpath mount may come useful in this case.

Essentially, where the data lives in the volume is user-defined. And they can always use PipelineParams to pass extra information e.g., on the specific directories where they have placed results, from step to step.

This is also important so we can seed a pipeline from notebook data. It doesn't make sense to impose a specific /{{workflow.uid}} prefix for every single access, since the data may also have been created outside a pipeline context, from within a notebook.

Copy link
Contributor

@rileyjbauer rileyjbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just reviewing the changed under /frontend)

This is really great! Thank you for being so thorough!

const child = templates.get(task.template);
if (child) {
if (child.nodeType === 'dag') {
buildDag(graph, task.template, templates, alreadyVisited, nodeId);
} else if (child.nodeType === 'container' ) {
} else if (child.nodeType === 'container' || child.nodeType === 'resource') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra space at beginning of line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rileyjbauer, thank you for taking the time to review the PR!

Sorry for that, I fixed it 🙂

nodeColor = 'cornsilk';
} else if (child && child.nodeType === 'resource') {
const isSnapshotIf = /apiVersion: snapshot.storage.k8s.io/;
const isPVCIf = /apiVersion: v1\nkind: PersistentVolumeClaim/;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(most likely moot due to above comment)

Is there any danger in changing these two regex to the following?

const isSnapshotIf = /kind: VolumeSnapshot/;
const isPVCIf = /kind: PersistentVolumeClaim/;

If they do need to remain as they are, and assuming the manifest will be like the example provided in StaticGraphParserTest, this one should be:
/apiVersion: v1\\nkind: PersistentVolumeClaim/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually /kind: <kind>/ may also appear in other fields of a manifest.

E.g. A PVC created from a VolumeSnapshot has a field which contains /kind: VolumeSnapshot/ and vice versa.

So we will might need such regex when we refactor the UI as you mentioned in another comment.

For now I had them removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! Thanks for explaining

let nodeColor;
const template = templates.get(node.templateName);
if (template && template.resource) {
const isSnapshotIf = /apiVersion: snapshot.storage.k8s.io/;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also likely moot)
Same comment as in StaticWorkflowParser, any problem with changing to just use kind?

const isSnapshotIf = /kind: VolumeSnapshot/;
const isPVCIf = /kind: PersistentVolumeClaim/;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here.

}

const node = workflow.status.nodes[nodeId];
let tmpl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this for-loop can be replaced with

tmpl = workflow.spec.templates.find(t => !!t && !!t.name && t.name === node.templateName);

Same comment below in getNodeManifest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Thank you!

// Makes sure the workflow object contains the node and returns its
// volume mounts if any.
public static getNodeVolumeMounts(workflow: Workflow, nodeId: string): string[][] {
type volumeInfo = string[][];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
This type isn't really needed. The same is true for the paramList above, but I thought it would just make the line

const inputsOutputs: [paramList, paramList] = [[], []];

a little nicer compared to

const inputsOutputs: [string[][], string[][]] = [[], []];

though I am open to disagreement about that, and removing the paramList type as well.

(same goes for the manifestInfo type below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I had it for uniformity, but I got it simplified.

let nodeColor;
if (task.when) {
nodeColor = 'cornsilk';
} else if (child && child.nodeType === 'resource') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to ask, but do you mind reverting the changes around coloring in the StaticGraphParser and WorkflowParser?

I appreciate the work here, but the topic of colors in these graphs is still a bit contentious (we may end up undoing the conditional/cornsilk coloring as well), and in the runtime graph especially, this can lead to confusing messaging in extreme cases, e.g.:
image

I still think having some sort of visual cue that these nodes interact with the volumes would be beneficial, perhaps some icons + tooltips? but that discussion can/should happen after this PR gets merged, and you certainly needn't be the ones to do that later work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all the coloring.

We would be happy to join the future discussion on the UI enrichment, and of course contribute to it!

@paveldournov
Copy link
Contributor

Thanks for the screenshots!

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 25, 2019

@Ark-kun wants to suggest one last change but no work is required on your part as he is going to submit a PR to your repo/branch for your review (ETA: end of day).

Here is the PR to Arrikto's branch arrikto#2. It just moves some code back where it was so that the diff is ~1200+ lines smaller.

@elikatsis or @vkoukis only need one click on GitHub to merge it.

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 25, 2019

We'd like to merge this PR by the end of tomorrow.

I want to mention that I'm not blocking this PR. I'd like arrikto#2 to be incorporated, but we'll merge this PR tomorrow regardless.

This way the diff is much smaller and more understandable. We can always split or refactor the file later. Refactorings should not be mixed with genuine changes.
@elikatsis
Copy link
Member Author

I see that all the tests have succeeded.

@Ark-kun thank you once again for your proposal. It is totally reasonable and I merged it.
And thank you @hongye-sun @vicaire @rileyjbauer for your reviews.

@vicaire
Copy link
Contributor

vicaire commented Apr 25, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vicaire
Copy link
Contributor

vicaire commented Apr 25, 2019

Thanks a lot for your contribution @elikatsis and @vkoukis, and apologies for the delays.

@k8s-ci-robot k8s-ci-robot merged commit 07cb50e into kubeflow:master Apr 25, 2019
@vkoukis
Copy link
Member

vkoukis commented Apr 25, 2019

Thanks a lot for your contribution @elikatsis and @vkoukis, and apologies for the delays.

Thank you, @vicaire , @hongye-sun , @rileyjbauer , @Ark-kun for the insightful comments, reviews, and help! We're really happy to see this PR being merged, and are looking forward to making follow-up PRs.

@hongye-sun
Copy link
Contributor

Thanks @vkoukis and @elikatsis for the high quality PR. Really glad that it goes through finally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend the DSL with support for Persistent Volumes and Snapshots
9 participants