Support applying Lightweight Python components in Pipeline SDK #750

RobbeSneyders · 2024-01-02T15:31:21Z

We want to be able to apply Lightweight Python components as part of a pipeline just like we do with docker components.

from fondant.component import PandasTransformComponent
from fondant.pipeline import Pipeline

class MyComponent(PandasTransformComponent):

    def __init__(...):
        ...

    def transform(dataframe):
        ...

pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset.apply(
    MyComponent,
    consumes={},
    produces={},
    arguments={},
)

Lightweight Python components will require some additional arguments compared to docker components. I can think of two already:

image: docker image to run the code in.
dependencies: additional Python dependencies to install in the container before executing.

I see two options:

We extend the apply (and read and write) methods on the dataset / pipeline to support these arguments. Since they are not relevant for docker components, we might then want to split this into two separate apply methods for clarity.

We add a decorator on the Lightweight Python component which defines these options:

from fondant.pipeline import LightWeightComponent

@LightWeightComponent(image=..., dependencies=[...])
class MyComponent(PandasTransformComponent):
    ...

I think I would prefer the second option so we can keep a single apply interface.

The text was updated successfully, but these errors were encountered:

GeorgesLorre · 2024-01-03T12:54:47Z

I also think a decorator will be the cleanest and clearest way of achieving this. This means that an .apply() can take a string (reusable component), path (local custom component) or a decorated class.

We also need to maybe think about the eager execution interface:

This is what I have now (calling .execute())

pipeline = Pipeline(name="foo", description="bar", base_path="/foobar")

class LoadFromParquett(DaskLoadComponent):
    def __init__(self, *_, **__):
        pass
    def load(self) -> dd.DataFrame:
        dask_df = dd.read_parquet("./foobar/sample.parquet", columns=["x", "y"])
        return dask_df

dataset1 = pipeline.execute(
    component=LoadFromParquett,
    produces={"x": pa.int32(), "y": pa.int32()},
)

but it might be nicer to keep .apply() and calling .execute() on top of it

dataset1 = pipeline.apply(
    LoadFromParquett,
    produces={"x": pa.int32(), "y": pa.int32()},
)

dataset1.execute(override_df=a_df)

That way we keep the pipeline definition as is and can to iterative development by calling certain components while creating.

RobbeSneyders · 2024-01-03T13:30:52Z

I also think a decorator will be the cleanest and clearest way of achieving this. This means that an .apply() can take a string (reusable component), path (local custom component) or a decorated class.

+1

but it might be nicer to keep .apply() and calling .execute() on top of it

Agree with keeping .apply(). I was thinking more towards an environment variable or argument to enable eager execution, but there's multiple options.

Environment variable: I like that the code is unchanged and can easily just be run non-eager as well.
Argument: I guess this argument should go on the pipeline. Downside is that the code is changed, but it is clear and explicit.
execute(): Similar to Dask, requires code changes. I guess we then also need to figure out which part of the graph needs to be executed (Dask executes from the start). What if a previous component was not executed?

One other thing is that if we want to support Eager execution on different runners, we need to know the runner up front. In Apache Beam for instance, you pass the runner to the pipeline when instantiating.

RobbeSneyders · 2024-01-08T10:50:28Z

I think this ticket can be limited to loading a Python component into a ComponentOp class.

If we look at the ComponentOp class, it requires two types of arguments to be instantiated:

A reference to the component
The keyword arguments from the apply function

Since we keep the same apply interface, we can just use the keyword arguments directly as well. What we need to implement is the reference to the component for Python components. For dockerized components, this reference is only used to get the component spec.

So this ticket needs to be able to translate a Python component into a component spec, which consists of the following necessary elements:

A name
we can get the name from the component class.
A docker image
If this is not a pre-built docker image, it actually consists of three parts:
- The base image
  Provided by the user via the decorator as mentioned above, with a default provided by Fondant.
- The dependencies
  Provided by the user via the decorator as mentioned above.
- The script to execute
  The Python implementation of the component converted to a self-contained script (KfP example)
We might want to introduce an abstraction in our code base which can contain either a pre-built image, or these different parts. The runners then need to support executing the different implementations of this abstraction.
Consumes / produces
On the long term, we can try to infer this as much as possible (Validate consumes and infer produces for Lightweight Python components #752), but I think we can start with a default of additionalProperties: true and expect the user to overwrite this with the consumes and produces keywords on the apply method.
Arguments
Infering the arguments should be doable based on the signature of the component (Infer the arguments based on component __init__ arguments #751).

GeorgesLorre · 2024-01-08T12:59:23Z

Yes! this is very similar to what I wrote for the xmas demo:

https://github.com/RobbeSneyders/fondant-xmas/pull/3/files#diff-65a733e81e8bc30179d9f957e51a2c6e9e45bc2a4101d54d6d5072f98433a69aR703

Minus the docker image (my demo has eager execution)

Fixes #751 This PR introduces functionality to infer the arguments from a `Component` class. The result is a dictionary with the argument names as keys, and `Argument` instances as values, which is the format of [`component_spec.args`.](https://github.com/ml6team/fondant/blob/8e828441eec8ff91074e5c8ccf16fe405b719594/src/fondant/core/component_spec.py#L193) We can leverage this behavior for Lightweight Python components as described in #750. Did some TDD here, let me know if I missed any cases.

GeorgesLorre · 2024-01-10T15:56:48Z

RobbeSneyders mentioned this issue Jan 2, 2024

Lightweight Python components #558

Closed

RobbeSneyders assigned GeorgesLorre Jan 2, 2024

RobbeSneyders added the Core Core framework label Jan 2, 2024

RobbeSneyders mentioned this issue Jan 8, 2024

Add component argument inference #763

Merged

GeorgesLorre mentioned this issue Jan 10, 2024

Support applying Lightweight Python components in Pipeline SDK #770

Merged

RobbeSneyders linked a pull request Jan 16, 2024 that will close this issue

Support applying Lightweight Python components in Pipeline SDK #770

Merged

RobbeSneyders closed this as completed in #770 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support applying Lightweight Python components in Pipeline SDK #750

Support applying Lightweight Python components in Pipeline SDK #750

RobbeSneyders commented Jan 2, 2024 •

edited

Loading

GeorgesLorre commented Jan 3, 2024 •

edited

Loading

RobbeSneyders commented Jan 3, 2024

RobbeSneyders commented Jan 8, 2024

GeorgesLorre commented Jan 8, 2024

GeorgesLorre commented Jan 10, 2024

Support applying Lightweight Python components in Pipeline SDK #750

Support applying Lightweight Python components in Pipeline SDK #750

Comments

RobbeSneyders commented Jan 2, 2024 • edited Loading

GeorgesLorre commented Jan 3, 2024 • edited Loading

RobbeSneyders commented Jan 3, 2024

RobbeSneyders commented Jan 8, 2024

GeorgesLorre commented Jan 8, 2024

GeorgesLorre commented Jan 10, 2024

RobbeSneyders commented Jan 2, 2024 •

edited

Loading

GeorgesLorre commented Jan 3, 2024 •

edited

Loading