Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK - Components refactoring #2865

Merged
merged 5 commits into from
Jan 25, 2020

Conversation

Ark-kun
Copy link
Contributor

@Ark-kun Ark-kun commented Jan 17, 2020

This change is a pure refactoring of the implementation of component task creation.
For pipelines compiled using the DSL compiler (the compile() function or the command-line program) nothing should change.

The main goal of the refactoring is to change the way the component instantiation can be customized.
Previously, the flow was like this:

ComponentSpec + arguments --> TaskSpec --resolving+transform--> ContainerOp

This PR changes it to more direct path:

ComponentSpec + arguments --constructor--> ContainerOp
or
ComponentSpec + arguments --constructor--> TaskSpec
or
ComponentSpec + arguments --constructor--> SomeCustomTask

The original approach where the flow always passes through TaskSpec had some issues since TaskSpec only accepts string arguments (and two
other reference classes). This made it harder to handle custom types of arguments like PipelineParam or Channel.

Low-level refactoring changes:

  • Resolving of command-line argument placeholders has been extracted into a function usable by different task constructors.

  • Changed _components._created_task_transformation_handler to _components._container_task_constructor. Previously, the handler was receiving a TaskSpec instance. Now it receives ComponentSpec + arguments [+ ComponentReference].

  • Moved the ContainerOp construction handler setup to the kfp.dsl.Pipeline context class as planned.

  • Extracted TaskSpec creation to _components._create_task_spec_from_component_and_arguments.

  • Refactored _dsl_bridge.create_container_op_from_task to _components._resolve_command_line_and_paths which returns _ResolvedCommandLineAndPaths.

  • Renamed _dsl_bridge._create_container_op_from_resolved_task to _dsl_bridge._create_container_op_from_component_and_arguments.

  • The signature of _components._resolve_graph_task was changed and it now returns _ResolvedGraphTask instead of modified TaskSpec.

Some of the component tests still expect ContainerOp and its attributes.
These tests will be changed later.


This change is Reviewable

This change is a pure refactoring of the implementation of component task creation.
For pipelines compiled using the DSL compiler (the compile() function or the command-line program) nothing should change.

The main goal of the refactoring is to change the way the component instantiation can be customized.
Previously, the flow was like this:

`ComponentSpec` + arguments --> `TaskSpec` --resolving+transform--> `ContainerOp`

This PR changes it to more direct path:

`ComponentSpec` + arguments --constructor--> `ContainerOp`
or
`ComponentSpec` + arguments --constructor--> `TaskSpec`
or
`ComponentSpec` + arguments --constructor--> `SomeCustomTask`

The original approach where the flow always passes through `TaskSpec` had some issues since TaskSpec only accepts string arguments (and two
other reference classes). This made it harder to handle custom types of arguments like PipelineParam or Channel.

Low-level refactoring changes:

Resolving of command-line argument placeholders has been extracted into a function usable by different task constructors.

Changed `_components._created_task_transformation_handler` to `_components._container_task_constructor`. Previously, the handler was receiving a `TaskSpec` instance. Now it receives `ComponentSpec` + arguments [+ `ComponentReference`].
Moved the `ContainerOp` construction handler setup to the `kfp.dsl.Pipeline` context class as planned.
Extracted `TaskSpec` creation to `_components._create_task_spec_from_component_and_arguments`.
Refactored `_dsl_bridge.create_container_op_from_task` to `_components._resolve_command_line_and_paths` which returns `_ResolvedCommandLineAndPaths`.
Renamed `_dsl_bridge._create_container_op_from_resolved_task` to `_dsl_bridge._create_container_op_from_component_and_arguments`.
The signature of `_components._resolve_graph_task` was changed and it now returns `_ResolvedGraphTask` instead of modified `TaskSpec`.

Some of the component tests still expect ContainerOp and its attributes.
These tests will be changed later.
@numerology
Copy link

Thanks for the very informative and detailed PR description.

Before jumping into the code review, I wish to ask some questions for my own learning purpose.

  1. After the change, what will be the recommended usage/best practice centered around TaskSpec? Previously it often serves as a IR layer IIUC. I'm wondering after the change, what will be consumer/downstream node of the following code path:

ComponentSpec + arguments --constructor--> TaskSpec

  1. What's the purpose of CustomTask? Can you please provide some examples where the target CUJ cannot achieved by TaskSpec?

Thanks!

I do not want to add any top-level kfp imports in this file to prevent circular references.
@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 18, 2020

Great questions!

Sorry for the long text ahead >_<

TL/DR version:
The TaskSpec is alive and well.
I'm just restoring the system-independence of the Components library and laying the foundation for integrating with TFX.

Long version:

TaskSpec was supposed to play an important role in Pipelines SDK with multiple runners. It's a portable declarative structure that describes a task and can be used in graph components to represent a pipeline task.

Here are the original scenario flows:

  • ComponentSpec --+ arguments--> TaskSpec --+ more tasks--> Graph ComponentSpec (~= pipeline)
  • [Graph] ComponentSpec --> Argo Workflow (I've created a PoC compiler last year: )
  • [Graph] ComponentSpec --> Airflow pipeline
  • TaskSpec --> run on Argo
  • TaskSpec --> run on Airflow
  • TaskSpec --> run on local Docker with custom orchestrator
  • TaskSpec --> run on Kubernetes with custom orchestrator
  • TaskSpec --> run on CMLE with custom orchestrator
    ...

The component subsystem was kind of stand-alone. The user passes data to components creating tasks, the tasks form a graph component that can be published or converted to Argo/Airflow or submitted for execution on Argo/Docker/Kubernetes/etc. I wanted to add the backend entrypoint that would accept TaskSpec, then do compilation to Argo (or whatever orchestrator was used) and submit the workflow for execution.

These scenarios are still valid and I hope more of them see the light some day.

However at that time I had to integrate this portable component library with the existing KFP SDK.
The KFP SDK already had user-facing classes describing tasks and graphs - ContainerOp and OpsGroup.
The DSL compiler had the following flow:
ContainerOp --+ more tasks--> Argo workflow

Mind you, there was a great opposition to the very idea of the components and the ContainerOp and DSL compiler were mostly untouchable. I had to integrate components in the flow without changing anything in the system.
So, I surgically integrated the components into the SDK the following way:

  • ComponentSpec --+ arguments --> TaskSpec --> ContainerOp --+ more tasks--> Argo workflow

I had plans to simplify and rewrite the compiler on top of TaskSpec and ComponentSpec, but there was some FUD and the new compiler was never released.

TaskSpec is intended to use by graph components and compiler/orchestrators that are based on the TaskSpec.
Use of TaskSpec in systems that use different kind of task-like class (e.g. KFP's ContainerOp or TFX' BaseComponent) leads to some friction. Those systems also have different output reference classes - TaskSpec supports GraphInputReference and TaskOutputReference, while KFP uses PipelineParam and TFX uses Channel.

There was one problem with the TaskSpec --> ContainerOp transformation. Due to the direction of the flow, it only handled the output conversion well: TaskSpec with TaskOutputReference outputs was getting converted to ContainerOp with PipelineParam outputs. But how to handle the input arguments that are references? TaskSpec only accepts strings, GraphInputArgument or TaskOutputArgument as arguments. If you try to pass in PipelineParam or Channel it will fail the type check (any schema integrity is important for TaskSpec, because it's supposed to be serialized and shared). In case of PipelineParam I was able to do a workaround since you could convert it to string. But when type checking support was added to PipelineParam, this broke the components library DSL-independence, introducing a dependency on the KFP DSL. Although that issue can be solved and the portability can be restored, the same problem exists with TFX DSL and other DSLs that are not based on TaskSpec.

Think of some other framework like TFX:
The framework's task-like object (e.g. TFX' BaseComponent instance) has outputs with some output reference-like object (e.g. TFX' Channel instance), the framework's users expect being able to pass those objects to component task constructors. The invariant is
that the following should work:
component2(input1=component2().outputs[''output1])

So, if component2 is TFX-native, but component1 is loaded by load_component, then component1(...) should be able to accept the Channel arguments that are passed to it (and output its own Channels). Those Channel arguments cannot be put inside TaskSpec (since TaskSpec does not know about them), but those Channel arguments still need to somehow be passed to the task transformation code. So, Channels need to somehow go around TaskSpec (or go through it in string form, but this is not always possible.)

                                 /--Channels----->\
ComponentSpec ----+ arguments---/---> TaskSpec --->\-tfx_bridge---> tfx.BaseComponent

There is also an issue that we might want to perform type check - the type of Channel argument vs. the type of ComponentSpec's input. This code will be specific to Channel and need to hooked somehow.

I thought a lot about this issue and I contemplated different variants (e.g. installing custom system-specific argument converters that check the type and convert Channel to string, so that it can later be deserialized).
Ultimately I decided that those options are too complicated and over-engineered.
In the ComponentSpec --...--> tfx.BaseComponent flow, we're not using any of the capabilities of TaskSpec. We're not putting it into a graph inside a graph component. It's just an intermediate object that became a bottleneck.
So, I've decided to skip the creation of TaskSpec when the TaskSpec is not the final result of the component instantiation:

Pure environment (or when creating graph components, etc):
ComponentSpec --+ arguments-->-- _components code--> TaskSpec

KFP system:
ComponentSpec --+ arguments-->-- _components code-->--dsl_bridge--> kfp.ContainerOp

TFX system:
ComponentSpec --+ arguments-->-- _components code-->--tfx_bridge--> tfx.BaseComponent

So, the TaskSpec is alive and well and I hope there would be more systems using it in future.
I'm just restoring the system-independence of the Components library and laying the foundation for integrating with TFX.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 18, 2020

/retest

2 similar comments
@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 18, 2020

/retest

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 21, 2020

/retest

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 21, 2020

@numerology The tests are now green.

@numerology
Copy link

@numerology The tests are now green.

Thanks. Will take a look later today.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 21, 2020

Thanks. Will take a look later today.

Thanks. I can walk you through the code changes if you want, so that it's easier to review (big chunk of the changes is just moving one big function to another file and extracting one function).

@numerology
Copy link

/lgtm

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 24, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 25, 2020

/retest

1 similar comment
@Ark-kun
Copy link
Contributor Author

Ark-kun commented Jan 25, 2020

/retest

@numerology
Copy link

/retest

@k8s-ci-robot k8s-ci-robot merged commit 2d9f252 into kubeflow:master Jan 25, 2020
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this pull request Dec 9, 2020
* SDK - Components refactoring

This change is a pure refactoring of the implementation of component task creation.
For pipelines compiled using the DSL compiler (the compile() function or the command-line program) nothing should change.

The main goal of the refactoring is to change the way the component instantiation can be customized.
Previously, the flow was like this:

`ComponentSpec` + arguments --> `TaskSpec` --resolving+transform--> `ContainerOp`

This PR changes it to more direct path:

`ComponentSpec` + arguments --constructor--> `ContainerOp`
or
`ComponentSpec` + arguments --constructor--> `TaskSpec`
or
`ComponentSpec` + arguments --constructor--> `SomeCustomTask`

The original approach where the flow always passes through `TaskSpec` had some issues since TaskSpec only accepts string arguments (and two
other reference classes). This made it harder to handle custom types of arguments like PipelineParam or Channel.

Low-level refactoring changes:

Resolving of command-line argument placeholders has been extracted into a function usable by different task constructors.

Changed `_components._created_task_transformation_handler` to `_components._container_task_constructor`. Previously, the handler was receiving a `TaskSpec` instance. Now it receives `ComponentSpec` + arguments [+ `ComponentReference`].
Moved the `ContainerOp` construction handler setup to the `kfp.dsl.Pipeline` context class as planned.
Extracted `TaskSpec` creation to `_components._create_task_spec_from_component_and_arguments`.
Refactored `_dsl_bridge.create_container_op_from_task` to `_components._resolve_command_line_and_paths` which returns `_ResolvedCommandLineAndPaths`.
Renamed `_dsl_bridge._create_container_op_from_resolved_task` to `_dsl_bridge._create_container_op_from_component_and_arguments`.
The signature of `_components._resolve_graph_task` was changed and it now returns `_ResolvedGraphTask` instead of modified `TaskSpec`.

Some of the component tests still expect ContainerOp and its attributes.
These tests will be changed later.

* Adapted the _python_op tests

* Fixed linter failure

I do not want to add any top-level kfp imports in this file to prevent circular references.

* Added docstrings

* FIxed the return type forward reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants