ml6team · mrchtr · Jan 26, 2024 · Jan 25, 2024 · Jan 25, 2024 · Jan 25, 2024
diff --git a/docs/components/custom_containerized_component.md b/docs/components/custom_containerized_component.md
@@ -1,10 +1,17 @@
 # Creating custom containerized components
 
 Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
-provides a lot of [components out of the box](https://github.com/ml6team/fondant/tree/main/components)
-, but you can also define your own custom containerized components.
+provides a lot
+of [components out of the box](https://fondant.ai/en/latest/components/hub/), but you can also
+define your own custom containerized components.
 
-To make sure co2ntainerized components are reusable, they should implement a single logical data processing
+Containerized components are useful when you want to share the components within your organization
+or community.
+If you don't need your component to be shareable, we recommend starting 
+with a simpler [Python components](../components/custom_python_component.md) instead.
+
+To make sure containerized components are reusable, they should implement a single logical data
+processing
 step (like captioning images or removing Personal Identifiable Information [PII] from text.)
 If a component grows too large, consider splitting it into multiple separate components each
 tackling one logical part.

diff --git a/docs/guides/implement_custom_components.md b/docs/guides/implement_custom_components.md
@@ -4,185 +4,27 @@ This guide will teach you how to build custom components and integrate them in y
 
 ## Overview
 
-In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first Fondant pipeline. While the 
-example demonstrates how to build a pipeline from reusable components, this is only the beginning.
+In the [previous tutorial](/build_a_simple_pipeline.md), you learned how to create your first
+Fondant pipeline. While the example demonstrates how to build a pipeline from reusable components,
+this is only the beginning.
 
-In this tutorial, we will guide you through the process of implementing your very own custom 
-component. We will illustrate this by building a transform component that filters images based on 
-file type.
+Reusable components consume data in a specific format, defined in a data contract.
+Therefore, it is often necessary to implement custom components to connect the reusable components to your
+specific data. The easiest way to do this is to implement a **Lightweight Component**. 
 
-This pipeline is an extension of the one introduced in the previous tutorial. After loading the 
-dataset from HuggingFace, it filters out any non-PNG files before downloading them. Finally, we 
-write the images to a local directory.
+In this tutorial, we will guide you through the process of implementing your very own custom
+component. We will illustrate this by building a transform component that uppercases the `alt_text` of the image dataset.
 
-## Setting up the environment
+If you want to build a complex custom component or share the component within your organization or even the community, 
+take a look at how to build [reusable components](../components/custom_containerized_component.md).
 
-We will be using the [local runner](../runners/local.md) to run this pipelines. To set up your local environment,
-please refer to our [installation](installation.md) documentation.
+This pipeline is an extension of the one introduced in
+the [previous tutorial](../guides/build_a_simple_pipeline.md).
+Make sure you have completed the tutorial before diving into this one.
 
-## 1. Building a custom transform component
+In the last tutorial, we implemented this pipeline:
 
-The typical file structure of a custom component looks like this:
-
-```
-|- custom_component
- |- src
- | |- main.py
- |- Dockerfile
- |- fondant_component.yaml
- |- requirements.txt
-```
-
-It contains:
-
-- **`src/main.py`**: The actual Python code to run.
-- **`Dockerfile`**: The Dockerfile to package your component.
-- **`fondant_component.yaml`**: The component specification defining the contract for the component.
-- **`requirements.txt`**: Containing the Python requirements of your component.
-
-Schematically, it can be represented as follows:
-
-![component architecture](https://github.com/ml6team/fondant/blob/main/docs/art/guides/component.png?raw=true)
-
-You can find a more detailed explanation [here](../components/custom_containerized_component.md).
-
-### Creating the ComponentSpec
-
-We start by creating the contract of our component:
-
-```yaml title="fondant_component.yaml"
-name: Filter file type
-description: Component that filters on mime types
-image: <my-registry>/filter_image_type:<version>
-
-consumes:
- image_url:
- type: string
-
-args:
- mime_type:
- description: The mime type to filter on
- type: str
-```
-
-It begins by specifying the component name, a brief description, and component's Docker image.
-
-!!! note "IMPORTANT"
-
- Note that you'll need your own container registry to host the image for you custom component
-
-The `consumes` section describes which data the component will consume. In this case, it will 
-read a single `"image_url"` column.
-
-[//]: # (TODO: Use a transform instead of filter component here to keep it simple)
-
-Since the component only filters the data, it will not create any new data. Fondant handles your 
-data efficiently by keeping track of the index along your pipeline. Only this index will be 
-updated when filtering data, which means that we don't need to define a `produces` section in the 
-component specification.
-
-Finally, we define the arguments that the component will support. In this case, we only add a 
-single `mime_type` argument, which allows us to define which mime type should be filtered.
-
-### Implementing the component
-
-Now, it's time to implement the component logic. To do this, we'll create a `src/main.py` file.
-
-We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type 
-of component. The following two methods should be implemented:
-
-- **`__init__()`**: This method will receive the arguments define in your component 
- specification. Fondant also inserts some additional keyword arguments for more advanced use 
- cases. Be sure to include a `**kwargs` argument if you're not using those.
-- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`. 
- Fondant automatically chunks your data so you can process larger-than-memory data, and your 
- component is executed in parallel across the available cores.
-
-```python title="src/main.py"
-"""A component that filters images based on file type."""
-import mimetypes
-
-import pandas as pd
-from fondant.component import PandasTransformComponent
-
-
-class FileTypeFilter(PandasTransformComponent):
-
- def __init__(self, *, mime_type: str, **kwargs):
- """Custom component to filter on specific file type based on url
-
- Args:
- mime_type: The mime type to filter on (also defined in the component spec)
- """
- self.mime_type = mime_type
-
- @staticmethod
- def get_mime_type(url):
- """Guess mime type based on the file name"""
- mime_type, _ = mimetypes.guess_type(url)
- return mime_type
-
- def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
- """Reduce dataframe to specific mime type"""
- dataframe["mime_type"] = dataframe["url"].apply(self.get_mime_type)
- return dataframe[dataframe["mime_type"] == self.mime_type]
-```
-
-We return the filtered dataframe from the `transform` method, which Fondant will use to 
-automatically update the index. If we would have specified any output fields in our component 
-contract, Fondant would extract and write those as well.
-
-### Defining the requirements
-
-Our component uses two third-party dependencies: `pandas`, and `fondant`. `pandas` comes bundled 
-with `fondant` if you install it using the `component` extra though, so our `requirements.txt` will 
-look as follows:
-
-```text title="requirements.txt"
-fondant[component]
-```
-
-### Building the component
-
-To use the component, it should be packaged into a Docker image, for which we need to define a 
-Dockerfile.
-
-```bash title="Dockerfile"
-FROM --platform=linux/amd64 python:3.8-slim
-
-# Install requirements
-COPY requirements.txt /
-RUN pip3 install --no-cache-dir -r requirements.txt
-
-# Set the working directory to the component folder
-WORKDIR /component/src
-
-# Copy over src-files
-COPY src/ .
-
-ENTRYPOINT ["fondant", "execute", "main"]
-```
-
-The entrypoint should be the `fondant execute` command which will execute your component.
-
-### Using the component
-
-We will now update the pipeline we created in the [previous guide](/build_a_simple_pipeline.md) 
-to leverage our component.
-
-Our complete file structure looks as follows:
-```
-|- components
-| |- filter_image_type
-| |- src
-| | |- main.py
-| |- Dockerfile
-| |- fondant_component.yaml
-| |- requirements.txt
-|- pipeline.py
-```
-
-```python title="pipeline.py"
+```python
 from fondant.pipeline import Pipeline
 import pyarrow as pa
 
@@ -203,18 +45,10 @@ dataset = pipeline.read(
  "license_location": pa.string(),
  "license_type": pa.string(),
  "webpage_url": pa.string(),
- },
-)
-
-# Our custom component
-urls = dataset.apply(
- "components/filter_image_type",
- arguments={
- "mime_type": "image/png"
  }
 )
 
-images = urls.apply(
+images = dataset.apply(
  "download_images",
 )
 
@@ -229,21 +63,82 @@ english_images = images.apply(
 )
 ```
 
-Instead of providing the name of the component like we did with the reusable components, we now 
-provide the path to our custom component.
+We want to extend the pipeline and apply a simple text transformation to the `alt_text`. Let's
+consider that the `alt_text` is so important that the text has to be transformed into uppercase
+letters.
+
+## Implement your Lightweight component
+
+Now, it's time to implement the component logic.
+
+We will subclass the `PandasTransformComponent` offered by Fondant. This is the most basic type
+of component. The following method should be implemented:
+
+- **`transform()`**: This method receives a chunk of the input data as a Pandas `DataFrame`.
+ Fondant automatically chunks your data you can process larger-than-memory data, and your
+ component is executed in parallel across the available cores.
+
+```python
+"""A component that transform the alt text of the dataframe into uppercase."""
+import pandas as pd
+from fondant.component import PandasTransformComponent
+from fondant.pipeline import lightweight_component
+
+
+@lightweight_component
+class UpperCaseTextComponent(PandasTransformComponent):
+
+ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
+ """Transform the alt text into upper case."""
+ dataframe["alt_text"] = dataframe["alt_text"].apply(lambda x: x.upper())
+ return dataframe
+```
+
+!!! note "IMPORTANT"
+
+ Note that we have used a decorator `@lightweight_component`. This decorator is necessary to inform
+ Fondant that this class is a Python component and can be used as a component in your pipeline.
+
+We apply the uppercase transformation to the `alt_text` column of the dataframe. Afterward, we
+return the transformed dataframe from the `transform` method, which Fondant will use to
+automatically update the index.
+
+The Python components provide an easy way to start with your component implementation. However, the
+Python component implementation still allows you to define all advanced component configurations,
+including installing extra arguments or defining component arguments. These concepts are more
+advanced and not needed for quick exploration and experiments. You can find more information on
+these topics in
+the [documentation of the Python components](../components/custom_python_component.md).
+
+### Using the component
+
+Now were we have defined our Python component we can start using it in our pipeline.
+For instance we can put this component at the end of our pipeline.
+
+```python
+
+uppercase_alt_text = english_images.apply(
+ UpperCaseTextComponent
+)
+
+```
+
+Instead of providing the name of the component, as we did with the reusable components,
+we now provide the component implementation.
 
-Now, you can execute the pipeline once more and examine the results. The final output should 
-exclusively consist of PNG images.
+Now, you can execute the pipeline once more and examine the results. In the final output,
+the `alt_text` is in uppercase.
 
-We have designed the custom component to be easily adaptable. For example, if you wish to filter 
-out JPEG files, you can simply change the argument to `image/jpeg`, and your dataset will be 
-populated with JPEGs instead of PNGs
+Of course, it is debatable whether uppercasing the alt_text is genuinely useful. This is just a
+constructive and simple example to showcase how to use Python components as glue code within your
+pipeline, helping you connect reusable components to each other.
 
 ## Next steps
 
-We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by 
+We now have a pipeline that downloads a dataset from the HuggingFace hub, filters the urls by
 image type, downloads the images, and filters them by alt text language.
 
-One final step still remaining, is to write teh final dataset to its destination. You could for 
-instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description) component to write it to 
+One final step still remaining, is to write the final dataset to its destination. You could for
+instance use the [`write_to_hf_hub`](../components/hub.md#write_to_hugging_face_hub#description)
+component to write it to
 the HuggingFace Hub, or create a custom `WriteComponent`.