ml6team · mrchtr · Jan 26, 2024 · Jan 25, 2024 · Jan 25, 2024 · Jan 25, 2024
diff --git a/docs/components/custom_containerized_component.md b/docs/components/custom_containerized_component.md
@@ -1,10 +1,17 @@
 # Creating custom containerized components
 
 Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
-provides a lot of [components out of the box](https://github.com/ml6team/fondant/tree/main/components)
+provides a lot
+of [components out of the box](https://github.com/ml6team/fondant/tree/main/components)
 , but you can also define your own custom containerized components.
 
-To make sure co2ntainerized components are reusable, they should implement a single logical data processing
+Containerized components are useful when you want to share the components within your organization
+or community.
+If you need to implement the missing code to utilize reusable components, we recommend the
+implementation of [Python components](../components/custom_python_component.md).
+
+To make sure containerized components are reusable, they should implement a single logical data
+processing
 step (like captioning images or removing Personal Identifiable Information [PII] from text.)
 If a component grows too large, consider splitting it into multiple separate components each
 tackling one logical part.
@@ -18,15 +25,17 @@ To implement a custom containerized component, a couple of files need to be defi
 
 ## Fondant component specification
 
-Each containerized Fondant component is defined by a specification which describes its interface. This
+Each containerized Fondant component is defined by a specification which describes its interface.
+This
 specification is represented by a single `fondant_component.yaml` file. See the [component
 specification page](../components/component_spec.md) for info on how to write the specification for
 your component.
 
 ## Main.py script
 
 The component script should be implemented in a `main.py` script in a folder called `src`.
-Refer to the [main.py script](../components/components.md) section for more info on how to implement the
+Refer to the [main.py script](../components/components.md) section for more info on how to implement
+the
 script.
 
 Note that the `main.py` script can be split up into several Python scripts in case it would become
@@ -59,8 +68,8 @@ ENTRYPOINT ["fondant", "execute", "main"]
 ## Requirements.txt
 
 A `requirements.txt` file lists the Python dependencies of the component. Note that any Fondant
-component will always have `Fondant[component]` as the minimum requirement. It's important to also 
-pin the version of each dependency to make sure the component remains working as expected. Below is 
+component will always have `Fondant[component]` as the minimum requirement. It's important to also
+pin the version of each dependency to make sure the component remains working as expected. Below is
 an example of a component that relies on several Python libraries such as Pillow, PyTorch and
 Transformers.
 
@@ -71,7 +80,8 @@ torch==2.0.1
 transformers==4.29.2
 ```
 
-Refer to this [section](publishing_components.md) to find out how to build and publish your components to use them in 
+Refer to this [section](publishing_components.md) to find out how to build and publish your
+components to use them in
 your own pipelines.
 
 
diff --git a/docs/guides/implement_custom_components.md b/docs/guides/implement_custom_components.md
@@ -4,185 +4,28 @@ This guide will teach you how to build custom components and integrate them in y
 
 ## Overview
 
-In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first Fondant pipeline. While the 
-example demonstrates how to build a pipeline from reusable components, this is only the beginning.
+In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first
+Fondant pipeline. While the example demonstrates how to build a pipeline from reusable components,
+this is only the beginning.
 
-In this tutorial, we will guide you through the process of implementing your very own custom 
-component. We will illustrate this by building a transform component that filters images based on 
-file type.
-
-This pipeline is an extension of the one introduced in the previous tutorial. After loading the 
-dataset from HuggingFace, it filters out any non-PNG files before downloading them. Finally, we 
-write the images to a local directory.
-
-## Setting up the environment
-
-We will be using the [local runner](../runners/local.md) to run this pipelines. To set up your local environment,
-please refer to our [installation](installation.md) documentation.
-
-## 1. Building a custom transform component
-
-The typical file structure of a custom component looks like this:
-
-```
-|- custom_component
- |- src
- | |- main.py
- |- Dockerfile
- |- fondant_component.yaml
- |- requirements.txt
-```
-
-It contains:
-
-- **`src/main.py`**: The actual Python code to run.
-- **`Dockerfile`**: The Dockerfile to package your component.
-- **`fondant_component.yaml`**: The component specification defining the contract for the component.
-- **`requirements.txt`**: Containing the Python requirements of your component.
-
-Schematically, it can be represented as follows:
-
-![component architecture](https://github.com/ml6team/fondant/blob/main/docs/art/guides/component.png?raw=true)
-
-You can find a more detailed explanation [here](../components/custom_containerized_component.md).
-
-### Creating the ComponentSpec
-
-We start by creating the contract of our component:
-
-```yaml title="fondant_component.yaml"
-name: Filter file type
-description: Component that filters on mime types
-image: <my-registry>/filter_image_type:<version>
-
-consumes:
- image_url:
- type: string
-
-args:
- mime_type:
- description: The mime type to filter on
- type: str
-```
-
-It begins by specifying the component name, a brief description, and component's Docker image.
-
-!!! note "IMPORTANT"
-
- Note that you'll need your own container registry to host the image for you custom component
-
-The `consumes` section describes which data the component will consume. In this case, it will 
-read a single `"image_url"` column.
-
-[//]: # (TODO: Use a transform instead of filter component here to keep it simple)
-
-Since the component only filters the data, it will not create any new data. Fondant handles your 
-data efficiently by keeping track of the index along your pipeline. Only this index will be 
-updated when filtering data, which means that we don't need to define a `produces` section in the 
-component specification.
-
-Finally, we define the arguments that the component will support. In this case, we only add a 
-single `mime_type` argument, which allows us to define which mime type should be filtered.
-
-### Implementing the component
-
-Now, it's time to implement the component logic. To do this, we'll create a `src/main.py` file.
-
-We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type 
-of component. The following two methods should be implemented:
-
-- **`__init__()`**: This method will receive the arguments define in your component 
- specification. Fondant also inserts some additional keyword arguments for more advanced use 
- cases. Be sure to include a `**kwargs` argument if you're not using those.
-- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`. 
- Fondant automatically chunks your data so you can process larger-than-memory data, and your 
- component is executed in parallel across the available cores.
-
-```python title="src/main.py"
-"""A component that filters images based on file type."""
-import mimetypes
-
-import pandas as pd
-from fondant.component import PandasTransformComponent
-
-
-class FileTypeFilter(PandasTransformComponent):
-
- def __init__(self, *, mime_type: str, **kwargs):
- """Custom component to filter on specific file type based on url
-
- Args:
- mime_type: The mime type to filter on (also defined in the component spec)
- """
- self.mime_type = mime_type
-
- @staticmethod
- def get_mime_type(url):
- """Guess mime type based on the file name"""
- mime_type, _ = mimetypes.guess_type(url)
- return mime_type
-
- def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
- """Reduce dataframe to specific mime type"""
- dataframe["mime_type"] = dataframe["url"].apply(self.get_mime_type)
- return dataframe[dataframe["mime_type"] == self.mime_type]
-```
-
-We return the filtered dataframe from the `transform` method, which Fondant will use to 
-automatically update the index. If we would have specified any output fields in our component 
-contract, Fondant would extract and write those as well.
-
-### Defining the requirements
+Reusable components consume data in a specific format, defined in a data contract.
+Therefore, it is often necessary to implement glue code to connect the reusable components to your
+specific data. The easiest way to do this is to implement a **Python component**.
 
-Our component uses two third-party dependencies: `pandas`, and `fondant`. `pandas` comes bundled 
-with `fondant` if you install it using the `component` extra though, so our `requirements.txt` will 
-look as follows:
-
-```text title="requirements.txt"
-fondant[component]
-```
-
-### Building the component
-
-To use the component, it should be packaged into a Docker image, for which we need to define a 
-Dockerfile.
-
-```bash title="Dockerfile"
-FROM --platform=linux/amd64 python:3.8-slim
-
-# Install requirements
-COPY requirements.txt /
-RUN pip3 install --no-cache-dir -r requirements.txt
-
-# Set the working directory to the component folder
-WORKDIR /component/src
-
-# Copy over src-files
-COPY src/ .
-
-ENTRYPOINT ["fondant", "execute", "main"]
-```
+In this tutorial, we will guide you through the process of implementing your very own custom
+component. We will illustrate this by building a transform component that filters images based on
+file type.
 
-The entrypoint should be the `fondant execute` command which will execute your component.
+If you want to share the component within your organization or even the community, take a look at
+how to build [reusable components](../components/custom_containerized_component.md).
 
-### Using the component
+This pipeline is an extension of the one introduced in
+the [previous tutorial](../guides/build_a_simple_pipeline.md).
+Make sure you have completed the tutorial before diving into this one.
 
-We will now update the pipeline we created in the [previous guide](/build_a_simple_pipeline.md) 
-to leverage our component.
+In the last tutorial, we implemented this pipeline:
 
-Our complete file structure looks as follows:
-```
-|- components
-| |- filter_image_type
-| |- src
-| | |- main.py
-| |- Dockerfile
-| |- fondant_component.yaml
-| |- requirements.txt
-|- pipeline.py
-```
-
-```python title="pipeline.py"
+```python
 from fondant.pipeline import Pipeline
 import pyarrow as pa
 
@@ -199,22 +42,14 @@ dataset = pipeline.read(
  },
  produces={
  "alt_text": pa.string(),
- "image_url": pa.string(),
+ "url": pa.string(),
  "license_location": pa.string(),
  "license_type": pa.string(),
  "webpage_url": pa.string(),
- },
-)
-
-# Our custom component
-urls = dataset.apply(
- "components/filter_image_type",
- arguments={
- "mime_type": "image/png"
  }
 )
 
-images = urls.apply(
+images = dataset.apply(
  "download_images",
 )
 
@@ -229,21 +64,82 @@ english_images = images.apply(
 )
 ```
 
-Instead of providing the name of the component like we did with the reusable components, we now 
-provide the path to our custom component.
+We want to extend the pipeline and apply a simple text transformation to the `alt_text`. Let's
+consider that the `alt_text` is so important that the text has to be transformed into uppercase
+letters.
+
+## Implement your Python component
+
+Now, it's time to implement the component logic.
+
+We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type
+of component. The following method should be implemented:
+
+- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`.
+ Fondant automatically chunks your data you can process larger-than-memory data, and your
+ component is executed in parallel across the available cores.
+
+```python
+"""A component that transform the alt text of the dataframe into uppercase."""
+import pandas as pd
+from fondant.component import PandasTransformComponent
+from fondant.pipeline import lightweight_component
+
+
+@lightweight_component
+class UpperCaseTextComponent(PandasTransformComponent):
+
+ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
+ """Transform the alt text into upper case."""
+ dataframe["alt_text"] = dataframe["alt_text"].apply(lambda x: x.upper())
+ return dataframe
+```
+
+!!! note "IMPORTANT"
+
+ Note that we have used a decorator `@lightweight_component`. This decorator is necessary to inform
+ Fondant that this class is a Python component and can be used as a component in your pipeline.
+
+We apply the uppercase transformation to the `alt_text` column of the dataframe. Afterward, we
+return the transformed dataframe from the `transform` method, which Fondant will use to
+automatically update the index.
+
+The Python components provide an easy way to start with your component implementation. However, the
+Python component implementation still allows you to define all advanced component configurations,
+including installing extra arguments or defining component arguments. These concepts are more
+advanced and not needed for quick exploration and experiments. You can find more information on
+these topics in
+the [documentation of the Python components](../components/custom_python_component.md).
+
+### Using the component
+
+Now were we have defined our Python component we can start using it in our pipeline.
+For instance we can put this component at the end of our pipeline.
+
+```python
+
+upercase_alt_text = english_images.apply(
+ UpperCaseTextComponent
+)
+
+```
+
+Instead of providing the name of the component, as we did with the reusable components,
+we now provide the component implementation.
 
-Now, you can execute the pipeline once more and examine the results. The final output should 
-exclusively consist of PNG images.
+Now, you can execute the pipeline once more and examine the results. In the final output,
+the `alt_text` is in uppercase.
 
-We have designed the custom component to be easily adaptable. For example, if you wish to filter 
-out JPEG files, you can simply change the argument to `image/jpeg`, and your dataset will be 
-populated with JPEGs instead of PNGs
+Of course, it is debatable whether uppercasing the alt_text is genuinely useful. This is just a
+constructive and simple example to showcase how to use Python components as glue code within your
+pipeline, helping you connect reusable components to each other.
 
 ## Next steps
 
-We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by 
+We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by
 image type, downloads the images, and filters them by alt text language.
 
-One final step still remaining, is to write teh final dataset to its destination. You could for 
-instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description) component to write it to 
+One final step still remaining, is to write the final dataset to its destination. You could for
+instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description)
+component to write it to
 the HuggingFace Hub, or create a custom `WriteComponent`.