Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update getting started guide #816

Merged
merged 3 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions docs/components/custom_containerized_component.md
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# Creating custom containerized components

Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
provides a lot of [components out of the box](https://github.com/ml6team/fondant/tree/main/components)
provides a lot
of [components out of the box](https://github.com/ml6team/fondant/tree/main/components)
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
, but you can also define your own custom containerized components.

To make sure co2ntainerized components are reusable, they should implement a single logical data processing
Containerized components are useful when you want to share the components within your organization
or community.
If you need to implement the missing code to utilize reusable components, we recommend the
implementation of [Python components](../components/custom_python_component.md).
mrchtr marked this conversation as resolved.
Show resolved Hide resolved

To make sure containerized components are reusable, they should implement a single logical data
processing
step (like captioning images or removing Personal Identifiable Information [PII] from text.)
If a component grows too large, consider splitting it into multiple separate components each
tackling one logical part.
Expand All @@ -18,15 +25,17 @@ To implement a custom containerized component, a couple of files need to be defi

## Fondant component specification

Each containerized Fondant component is defined by a specification which describes its interface. This
Each containerized Fondant component is defined by a specification which describes its interface.
This
specification is represented by a single `fondant_component.yaml` file. See the [component
specification page](../components/component_spec.md) for info on how to write the specification for
your component.

## Main.py script

The component script should be implemented in a `main.py` script in a folder called `src`.
Refer to the [main.py script](../components/components.md) section for more info on how to implement the
Refer to the [main.py script](../components/components.md) section for more info on how to implement
the
script.

Note that the `main.py` script can be split up into several Python scripts in case it would become
Expand Down Expand Up @@ -59,8 +68,8 @@ ENTRYPOINT ["fondant", "execute", "main"]
## Requirements.txt

A `requirements.txt` file lists the Python dependencies of the component. Note that any Fondant
component will always have `Fondant[component]` as the minimum requirement. It's important to also
pin the version of each dependency to make sure the component remains working as expected. Below is
component will always have `Fondant[component]` as the minimum requirement. It's important to also
pin the version of each dependency to make sure the component remains working as expected. Below is
an example of a component that relies on several Python libraries such as Pillow, PyTorch and
Transformers.

Expand All @@ -71,7 +80,8 @@ torch==2.0.1
transformers==4.29.2
```

Refer to this [section](publishing_components.md) to find out how to build and publish your components to use them in
Refer to this [section](publishing_components.md) to find out how to build and publish your
components to use them in
your own pipelines.


282 changes: 89 additions & 193 deletions docs/guides/implement_custom_components.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,185 +4,28 @@ This guide will teach you how to build custom components and integrate them in y

## Overview

In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first Fondant pipeline. While the
example demonstrates how to build a pipeline from reusable components, this is only the beginning.
In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first
Fondant pipeline. While the example demonstrates how to build a pipeline from reusable components,
this is only the beginning.

In this tutorial, we will guide you through the process of implementing your very own custom
component. We will illustrate this by building a transform component that filters images based on
file type.

This pipeline is an extension of the one introduced in the previous tutorial. After loading the
dataset from HuggingFace, it filters out any non-PNG files before downloading them. Finally, we
write the images to a local directory.

## Setting up the environment

We will be using the [local runner](../runners/local.md) to run this pipelines. To set up your local environment,
please refer to our [installation](installation.md) documentation.

## 1. Building a custom transform component

The typical file structure of a custom component looks like this:

```
|- custom_component
|- src
| |- main.py
|- Dockerfile
|- fondant_component.yaml
|- requirements.txt
```

It contains:

- **`src/main.py`**: The actual Python code to run.
- **`Dockerfile`**: The Dockerfile to package your component.
- **`fondant_component.yaml`**: The component specification defining the contract for the component.
- **`requirements.txt`**: Containing the Python requirements of your component.

Schematically, it can be represented as follows:

![component architecture](https://github.com/ml6team/fondant/blob/main/docs/art/guides/component.png?raw=true)

You can find a more detailed explanation [here](../components/custom_containerized_component.md).

### Creating the ComponentSpec

We start by creating the contract of our component:

```yaml title="fondant_component.yaml"
name: Filter file type
description: Component that filters on mime types
image: <my-registry>/filter_image_type:<version>

consumes:
image_url:
type: string

args:
mime_type:
description: The mime type to filter on
type: str
```

It begins by specifying the component name, a brief description, and component's Docker image.

!!! note "IMPORTANT"

Note that you'll need your own container registry to host the image for you custom component

The `consumes` section describes which data the component will consume. In this case, it will
read a single `"image_url"` column.

[//]: # (TODO: Use a transform instead of filter component here to keep it simple)

Since the component only filters the data, it will not create any new data. Fondant handles your
data efficiently by keeping track of the index along your pipeline. Only this index will be
updated when filtering data, which means that we don't need to define a `produces` section in the
component specification.

Finally, we define the arguments that the component will support. In this case, we only add a
single `mime_type` argument, which allows us to define which mime type should be filtered.

### Implementing the component

Now, it's time to implement the component logic. To do this, we'll create a `src/main.py` file.

We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type
of component. The following two methods should be implemented:

- **`__init__()`**: This method will receive the arguments define in your component
specification. Fondant also inserts some additional keyword arguments for more advanced use
cases. Be sure to include a `**kwargs` argument if you're not using those.
- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`.
Fondant automatically chunks your data so you can process larger-than-memory data, and your
component is executed in parallel across the available cores.

```python title="src/main.py"
"""A component that filters images based on file type."""
import mimetypes

import pandas as pd
from fondant.component import PandasTransformComponent


class FileTypeFilter(PandasTransformComponent):

def __init__(self, *, mime_type: str, **kwargs):
"""Custom component to filter on specific file type based on url

Args:
mime_type: The mime type to filter on (also defined in the component spec)
"""
self.mime_type = mime_type

@staticmethod
def get_mime_type(url):
"""Guess mime type based on the file name"""
mime_type, _ = mimetypes.guess_type(url)
return mime_type

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Reduce dataframe to specific mime type"""
dataframe["mime_type"] = dataframe["url"].apply(self.get_mime_type)
return dataframe[dataframe["mime_type"] == self.mime_type]
```

We return the filtered dataframe from the `transform` method, which Fondant will use to
automatically update the index. If we would have specified any output fields in our component
contract, Fondant would extract and write those as well.

### Defining the requirements
Reusable components consume data in a specific format, defined in a data contract.
Therefore, it is often necessary to implement glue code to connect the reusable components to your
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
specific data. The easiest way to do this is to implement a **Python component**.
mrchtr marked this conversation as resolved.
Show resolved Hide resolved

Our component uses two third-party dependencies: `pandas`, and `fondant`. `pandas` comes bundled
with `fondant` if you install it using the `component` extra though, so our `requirements.txt` will
look as follows:

```text title="requirements.txt"
fondant[component]
```

### Building the component

To use the component, it should be packaged into a Docker image, for which we need to define a
Dockerfile.

```bash title="Dockerfile"
FROM --platform=linux/amd64 python:3.8-slim

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["fondant", "execute", "main"]
```
In this tutorial, we will guide you through the process of implementing your very own custom
component. We will illustrate this by building a transform component that filters images based on
file type.

The entrypoint should be the `fondant execute` command which will execute your component.
If you want to share the component within your organization or even the community, take a look at
how to build [reusable components](../components/custom_containerized_component.md).

### Using the component
This pipeline is an extension of the one introduced in
the [previous tutorial](../guides/build_a_simple_pipeline.md).
Make sure you have completed the tutorial before diving into this one.

We will now update the pipeline we created in the [previous guide](/build_a_simple_pipeline.md)
to leverage our component.
In the last tutorial, we implemented this pipeline:

Our complete file structure looks as follows:
```
|- components
| |- filter_image_type
| |- src
| | |- main.py
| |- Dockerfile
| |- fondant_component.yaml
| |- requirements.txt
|- pipeline.py
```

```python title="pipeline.py"
```python
from fondant.pipeline import Pipeline
import pyarrow as pa

Expand All @@ -199,22 +42,14 @@ dataset = pipeline.read(
},
produces={
"alt_text": pa.string(),
"image_url": pa.string(),
"url": pa.string(),
"license_location": pa.string(),
"license_type": pa.string(),
"webpage_url": pa.string(),
},
)

# Our custom component
urls = dataset.apply(
"components/filter_image_type",
arguments={
"mime_type": "image/png"
}
)

images = urls.apply(
images = dataset.apply(
"download_images",
)

Expand All @@ -229,21 +64,82 @@ english_images = images.apply(
)
```

Instead of providing the name of the component like we did with the reusable components, we now
provide the path to our custom component.
We want to extend the pipeline and apply a simple text transformation to the `alt_text`. Let's
consider that the `alt_text` is so important that the text has to be transformed into uppercase
letters.

## Implement your Python component

Now, it's time to implement the component logic.

We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type
of component. The following method should be implemented:

- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`.
Fondant automatically chunks your data you can process larger-than-memory data, and your
component is executed in parallel across the available cores.

```python
"""A component that transform the alt text of the dataframe into uppercase."""
import pandas as pd
from fondant.component import PandasTransformComponent
from fondant.pipeline import lightweight_component


@lightweight_component
class UpperCaseTextComponent(PandasTransformComponent):

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Transform the alt text into upper case."""
dataframe["alt_text"] = dataframe["alt_text"].apply(lambda x: x.upper())
return dataframe
```

!!! note "IMPORTANT"

Note that we have used a decorator `@lightweight_component`. This decorator is necessary to inform
Fondant that this class is a Python component and can be used as a component in your pipeline.

We apply the uppercase transformation to the `alt_text` column of the dataframe. Afterward, we
return the transformed dataframe from the `transform` method, which Fondant will use to
automatically update the index.

The Python components provide an easy way to start with your component implementation. However, the
Python component implementation still allows you to define all advanced component configurations,
including installing extra arguments or defining component arguments. These concepts are more
advanced and not needed for quick exploration and experiments. You can find more information on
these topics in
the [documentation of the Python components](../components/custom_python_component.md).

### Using the component

Now were we have defined our Python component we can start using it in our pipeline.
For instance we can put this component at the end of our pipeline.

```python

upercase_alt_text = english_images.apply(
UpperCaseTextComponent
)

```

Instead of providing the name of the component, as we did with the reusable components,
we now provide the component implementation.

Now, you can execute the pipeline once more and examine the results. The final output should
exclusively consist of PNG images.
Now, you can execute the pipeline once more and examine the results. In the final output,
the `alt_text` is in uppercase.

We have designed the custom component to be easily adaptable. For example, if you wish to filter
out JPEG files, you can simply change the argument to `image/jpeg`, and your dataset will be
populated with JPEGs instead of PNGs
Of course, it is debatable whether uppercasing the alt_text is genuinely useful. This is just a
constructive and simple example to showcase how to use Python components as glue code within your
pipeline, helping you connect reusable components to each other.

## Next steps

We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by
We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by
image type, downloads the images, and filters them by alt text language.

One final step still remaining, is to write teh final dataset to its destination. You could for
instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description) component to write it to
One final step still remaining, is to write the final dataset to its destination. You could for
instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description)
component to write it to
the HuggingFace Hub, or create a custom `WriteComponent`.
Loading