Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update getting started guide #816

Merged
merged 3 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions docs/components/custom_containerized_component.md
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# Creating custom containerized components

Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
provides a lot of [components out of the box](https://github.com/ml6team/fondant/tree/main/components)
, but you can also define your own custom containerized components.
provides a lot
of [components out of the box](https://fondant.ai/en/latest/components/hub/), but you can also
define your own custom containerized components.

To make sure co2ntainerized components are reusable, they should implement a single logical data processing
Containerized components are useful when you want to share the components within your organization
or community.
If you don't need your component to be shareable, we recommend starting
with a simpler [Python components](../components/custom_python_component.md) instead.

To make sure containerized components are reusable, they should implement a single logical data
processing
step (like captioning images or removing Personal Identifiable Information [PII] from text.)
If a component grows too large, consider splitting it into multiple separate components each
tackling one logical part.
Expand Down
279 changes: 87 additions & 192 deletions docs/guides/implement_custom_components.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,185 +4,27 @@ This guide will teach you how to build custom components and integrate them in y

## Overview

In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first Fondant pipeline. While the
example demonstrates how to build a pipeline from reusable components, this is only the beginning.
In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first
Fondant pipeline. While the example demonstrates how to build a pipeline from reusable components,
this is only the beginning.

In this tutorial, we will guide you through the process of implementing your very own custom
component. We will illustrate this by building a transform component that filters images based on
file type.
Reusable components consume data in a specific format, defined in a data contract.
Therefore, it is often necessary to implement custom components to connect the reusable components to your
specific data. The easiest way to do this is to implement a **Lightweight Component**.

This pipeline is an extension of the one introduced in the previous tutorial. After loading the
dataset from HuggingFace, it filters out any non-PNG files before downloading them. Finally, we
write the images to a local directory.
In this tutorial, we will guide you through the process of implementing your very own custom
component. We will illustrate this by building a transform component that uppercases the `alt_text` of the image dataset.

## Setting up the environment
If you want to build a complex custom component or share the component within your organization or even the community,
take a look at how to build [reusable components](../components/custom_containerized_component.md).

We will be using the [local runner](../runners/local.md) to run this pipelines. To set up your local environment,
please refer to our [installation](installation.md) documentation.
This pipeline is an extension of the one introduced in
the [previous tutorial](../guides/build_a_simple_pipeline.md).
Make sure you have completed the tutorial before diving into this one.

## 1. Building a custom transform component
In the last tutorial, we implemented this pipeline:

The typical file structure of a custom component looks like this:

```
|- custom_component
|- src
| |- main.py
|- Dockerfile
|- fondant_component.yaml
|- requirements.txt
```

It contains:

- **`src/main.py`**: The actual Python code to run.
- **`Dockerfile`**: The Dockerfile to package your component.
- **`fondant_component.yaml`**: The component specification defining the contract for the component.
- **`requirements.txt`**: Containing the Python requirements of your component.

Schematically, it can be represented as follows:

![component architecture](https://github.com/ml6team/fondant/blob/main/docs/art/guides/component.png?raw=true)

You can find a more detailed explanation [here](../components/custom_containerized_component.md).

### Creating the ComponentSpec

We start by creating the contract of our component:

```yaml title="fondant_component.yaml"
name: Filter file type
description: Component that filters on mime types
image: <my-registry>/filter_image_type:<version>

consumes:
image_url:
type: string

args:
mime_type:
description: The mime type to filter on
type: str
```

It begins by specifying the component name, a brief description, and component's Docker image.

!!! note "IMPORTANT"

Note that you'll need your own container registry to host the image for you custom component

The `consumes` section describes which data the component will consume. In this case, it will
read a single `"image_url"` column.

[//]: # (TODO: Use a transform instead of filter component here to keep it simple)

Since the component only filters the data, it will not create any new data. Fondant handles your
data efficiently by keeping track of the index along your pipeline. Only this index will be
updated when filtering data, which means that we don't need to define a `produces` section in the
component specification.

Finally, we define the arguments that the component will support. In this case, we only add a
single `mime_type` argument, which allows us to define which mime type should be filtered.

### Implementing the component

Now, it's time to implement the component logic. To do this, we'll create a `src/main.py` file.

We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type
of component. The following two methods should be implemented:

- **`__init__()`**: This method will receive the arguments define in your component
specification. Fondant also inserts some additional keyword arguments for more advanced use
cases. Be sure to include a `**kwargs` argument if you're not using those.
- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`.
Fondant automatically chunks your data so you can process larger-than-memory data, and your
component is executed in parallel across the available cores.

```python title="src/main.py"
"""A component that filters images based on file type."""
import mimetypes

import pandas as pd
from fondant.component import PandasTransformComponent


class FileTypeFilter(PandasTransformComponent):

def __init__(self, *, mime_type: str, **kwargs):
"""Custom component to filter on specific file type based on url

Args:
mime_type: The mime type to filter on (also defined in the component spec)
"""
self.mime_type = mime_type

@staticmethod
def get_mime_type(url):
"""Guess mime type based on the file name"""
mime_type, _ = mimetypes.guess_type(url)
return mime_type

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Reduce dataframe to specific mime type"""
dataframe["mime_type"] = dataframe["url"].apply(self.get_mime_type)
return dataframe[dataframe["mime_type"] == self.mime_type]
```

We return the filtered dataframe from the `transform` method, which Fondant will use to
automatically update the index. If we would have specified any output fields in our component
contract, Fondant would extract and write those as well.

### Defining the requirements

Our component uses two third-party dependencies: `pandas`, and `fondant`. `pandas` comes bundled
with `fondant` if you install it using the `component` extra though, so our `requirements.txt` will
look as follows:

```text title="requirements.txt"
fondant[component]
```

### Building the component

To use the component, it should be packaged into a Docker image, for which we need to define a
Dockerfile.

```bash title="Dockerfile"
FROM --platform=linux/amd64 python:3.8-slim

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["fondant", "execute", "main"]
```

The entrypoint should be the `fondant execute` command which will execute your component.

### Using the component

We will now update the pipeline we created in the [previous guide](/build_a_simple_pipeline.md)
to leverage our component.

Our complete file structure looks as follows:
```
|- components
| |- filter_image_type
| |- src
| | |- main.py
| |- Dockerfile
| |- fondant_component.yaml
| |- requirements.txt
|- pipeline.py
```

```python title="pipeline.py"
```python
from fondant.pipeline import Pipeline
import pyarrow as pa

Expand All @@ -203,18 +45,10 @@ dataset = pipeline.read(
"license_location": pa.string(),
"license_type": pa.string(),
"webpage_url": pa.string(),
},
)

# Our custom component
urls = dataset.apply(
"components/filter_image_type",
arguments={
"mime_type": "image/png"
}
)

images = urls.apply(
images = dataset.apply(
"download_images",
)

Expand All @@ -229,21 +63,82 @@ english_images = images.apply(
)
```

Instead of providing the name of the component like we did with the reusable components, we now
provide the path to our custom component.
We want to extend the pipeline and apply a simple text transformation to the `alt_text`. Let's
consider that the `alt_text` is so important that the text has to be transformed into uppercase
letters.

## Implement your Lightweight component

Now, it's time to implement the component logic.

We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type
of component. The following method should be implemented:

- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`.
Fondant automatically chunks your data you can process larger-than-memory data, and your
component is executed in parallel across the available cores.

```python
"""A component that transform the alt text of the dataframe into uppercase."""
import pandas as pd
from fondant.component import PandasTransformComponent
from fondant.pipeline import lightweight_component


@lightweight_component
class UpperCaseTextComponent(PandasTransformComponent):

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Transform the alt text into upper case."""
dataframe["alt_text"] = dataframe["alt_text"].apply(lambda x: x.upper())
return dataframe
```

!!! note "IMPORTANT"

Note that we have used a decorator `@lightweight_component`. This decorator is necessary to inform
Fondant that this class is a Python component and can be used as a component in your pipeline.

We apply the uppercase transformation to the `alt_text` column of the dataframe. Afterward, we
return the transformed dataframe from the `transform` method, which Fondant will use to
automatically update the index.

The Python components provide an easy way to start with your component implementation. However, the
Python component implementation still allows you to define all advanced component configurations,
including installing extra arguments or defining component arguments. These concepts are more
advanced and not needed for quick exploration and experiments. You can find more information on
these topics in
the [documentation of the Python components](../components/custom_python_component.md).

### Using the component

Now were we have defined our Python component we can start using it in our pipeline.
For instance we can put this component at the end of our pipeline.

```python

uppercase_alt_text = english_images.apply(
UpperCaseTextComponent
)

```

Instead of providing the name of the component, as we did with the reusable components,
we now provide the component implementation.

Now, you can execute the pipeline once more and examine the results. The final output should
exclusively consist of PNG images.
Now, you can execute the pipeline once more and examine the results. In the final output,
the `alt_text` is in uppercase.

We have designed the custom component to be easily adaptable. For example, if you wish to filter
out JPEG files, you can simply change the argument to `image/jpeg`, and your dataset will be
populated with JPEGs instead of PNGs
Of course, it is debatable whether uppercasing the alt_text is genuinely useful. This is just a
constructive and simple example to showcase how to use Python components as glue code within your
pipeline, helping you connect reusable components to each other.

## Next steps

We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by
We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by
image type, downloads the images, and filters them by alt text language.

One final step still remaining, is to write teh final dataset to its destination. You could for
instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description) component to write it to
One final step still remaining, is to write the final dataset to its destination. You could for
instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description)
component to write it to
the HuggingFace Hub, or create a custom `WriteComponent`.
Loading