Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic read write component #214

Merged
merged 48 commits into from
Jun 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
f7e5dcb
add writer component
PhilippeMoussalli Jun 9, 2023
654a5a6
Merge branch 'main' into write-component-class
PhilippeMoussalli Jun 12, 2023
17c5c7e
modify component spec schema to accept default arguments
PhilippeMoussalli Jun 12, 2023
a4e44d2
enable adding default arguments
PhilippeMoussalli Jun 12, 2023
6d2d1ab
test adding default arguments
PhilippeMoussalli Jun 12, 2023
af597f8
fixmypy issue
PhilippeMoussalli Jun 12, 2023
c90b53b
correct docs
PhilippeMoussalli Jun 12, 2023
a8e55d8
update component spec
PhilippeMoussalli Jun 12, 2023
00dfa55
update docs
PhilippeMoussalli Jun 12, 2023
0fa694f
add optional field to schema
PhilippeMoussalli Jun 13, 2023
c9f4429
enable defining default arguments
PhilippeMoussalli Jun 13, 2023
5ee3b31
add relevant tests
PhilippeMoussalli Jun 13, 2023
919c21f
Merge branch 'main' into enable-optional-component-arguments
PhilippeMoussalli Jun 13, 2023
66fdb4c
add test file
PhilippeMoussalli Jun 13, 2023
61adc40
bugfix string bool
PhilippeMoussalli Jun 14, 2023
88b4729
change method of defining optionals
PhilippeMoussalli Jun 14, 2023
64614f7
make component spec optional
PhilippeMoussalli Jun 14, 2023
218b8ce
Merge branch 'main' into enable-optional-component-arguments
PhilippeMoussalli Jun 15, 2023
7c7ace7
implement PR feedback
PhilippeMoussalli Jun 15, 2023
3d5b42a
make load component generic
PhilippeMoussalli Jun 15, 2023
7d88861
make load component generic
PhilippeMoussalli Jun 15, 2023
e2445d1
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 15, 2023
8b5a749
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 16, 2023
eaa2dfa
Add build for local components
GeorgesLorre Jun 14, 2023
86f89b4
Update starcoder example to use the docker compiler
GeorgesLorre Jun 16, 2023
d160e34
Fix build script and run pipeline locally
GeorgesLorre Jun 16, 2023
49d80a6
Update tests
GeorgesLorre Jun 19, 2023
402866f
Fix bad merge
GeorgesLorre Jun 19, 2023
4a00829
enable passing custom spec path for reusable components
PhilippeMoussalli Jun 19, 2023
3579cef
Add manifest per component
GeorgesLorre Jun 19, 2023
c56a9ed
implement changes to starcoder pipeline
PhilippeMoussalli Jun 19, 2023
7c6d7a4
implement changes to controlnet pipeline
PhilippeMoussalli Jun 19, 2023
c741116
implement change to SD pipeline
PhilippeMoussalli Jun 19, 2023
e21755e
add missing test file
PhilippeMoussalli Jun 19, 2023
c97b405
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 19, 2023
a2c8717
Merge branch 'feature/local-starcoder' into generic_read_write_component
PhilippeMoussalli Jun 19, 2023
421ef3e
modify compiler to handle lists and dict
PhilippeMoussalli Jun 19, 2023
c21f0ea
add datasets to write to hub req
PhilippeMoussalli Jun 19, 2023
94ecd32
Test pipeline on local runner and add options to load few rows
PhilippeMoussalli Jun 19, 2023
e731744
Test pipeline on local runner and add options to load few rows
PhilippeMoussalli Jun 19, 2023
2120a5c
modify composable component template
PhilippeMoussalli Jun 19, 2023
c1ec93c
add image resolution extration component
PhilippeMoussalli Jun 20, 2023
316eff3
implement PR feedback
PhilippeMoussalli Jun 20, 2023
a47f42c
add documentation on generic components
PhilippeMoussalli Jun 20, 2023
2184ef6
typo
PhilippeMoussalli Jun 20, 2023
177bed7
unpin fixed version
PhilippeMoussalli Jun 20, 2023
543d673
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 20, 2023
4ec35b4
address PR feedback
PhilippeMoussalli Jun 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion components/filter_comments/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get update && \
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the compoent folder
# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
Expand Down
2 changes: 1 addition & 1 deletion components/filter_line_length/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get update && \
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the compoent folder
# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
Expand Down
2 changes: 1 addition & 1 deletion components/image_cropping/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get update && \
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the compoent folder
# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
Expand Down
19 changes: 19 additions & 0 deletions components/image_resolution_extraction/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Image resolution extraction
description: Component that extracts image resolution data from the images
image: ghcr.io/ml6team/image_resolution_extraction:latest

consumes:
images:
fields:
data:
type: binary

produces:
images:
fields:
width:
type: int16
height:
type: int16
data:
type: binary
Comment on lines +13 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For components that only add columns, is there a need to specify existing columns in the produces section?

cc @RobbeSneyders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory not, but I'm not sure if it works in practice already if you leave them out.

4 changes: 4 additions & 0 deletions components/image_resolution_extraction/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fondant
pyarrow>=7.0
gcsfs==2023.4.0
imagesize==1.4.1
52 changes: 52 additions & 0 deletions components/image_resolution_extraction/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""This component filters images of the dataset based on image size (minimum height and width)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring to be updated

import io
import logging
import typing as t

import imagesize
import numpy as np
import pandas as pd

from fondant.component import PandasTransformComponent
from fondant.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


def extract_dimensions(images: bytes) -> t.Tuple[np.int16, np.int16]:
"""Extract the width and height of an image.

Args:
images: input dataframe with images_data column

Returns:
np.int16: width of the image
np.int16: height of the image
"""
width, height = imagesize.get(io.BytesIO(images))

return np.int16(width), np.int16(height)


class ImageResolutionExtractionComponent(PandasTransformComponent):
"""Component that extracts image dimensions."""

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""
Args:
dataframe: Dask dataframe
Returns:
dataset.
"""
logger.info("Filtering dataset...")

dataframe[[("images", "width"), ("images", "height")]] = \
dataframe[[("images", "data")]].map(extract_dimensions)

return dataframe


if __name__ == "__main__":
component = ImageResolutionExtractionComponent.from_args()
component.run()
2 changes: 1 addition & 1 deletion components/image_resolution_filtering/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get update && \
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the compoent folder
# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
Expand Down
24 changes: 14 additions & 10 deletions components/load_from_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,25 @@ name: Load from hub
description: Component that loads a dataset from the hub
image: ghcr.io/ml6team/load_from_hf_hub:latest

produces:
images:
consumes:
dummy_variable: #TODO: fill in here
fields:
data:
type: binary
width:
type: int16
height:
type: int16
captions:
fields:
data:
type: string

args:
dataset_name:
description: Name of dataset on the hub
type: str
column_name_mapping:
description: Mapping of the consumed hub dataset to fondant column names
type: dict
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved
image_column_names:
description: Optional argument, a list containing the original image column names in case the
dataset on the hub contains them. Used to format the image from HF hub format to a byte string.
type: list
default: None
n_rows_to_load:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_rows_to_load:
num_rows_to_load:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, he just switched this from nb_rows_to_load on my request 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're both clear ;P I'll leave it as is for now

description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale
type: int
default: None
63 changes: 26 additions & 37 deletions components/load_from_hf_hub/src/main.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
"""This component loads a seed dataset from the hub."""
import io
import logging
import typing as t

import dask.dataframe as dd
import numpy as np
from PIL import Image

from fondant.component import LoadComponent
from fondant.logger import configure_logging
Expand All @@ -13,52 +11,43 @@
logger = logging.getLogger(__name__)


def extract_width(image_bytes):
# Decode image bytes to PIL Image object
pil_image = Image.open(io.BytesIO(image_bytes))
width = pil_image.size[0]

return np.int16(width)


def extract_height(image_bytes):
# Decode image bytes to PIL Image object
pil_image = Image.open(io.BytesIO(image_bytes))
height = pil_image.size[1]

return np.int16(height)


class LoadFromHubComponent(LoadComponent):
def load(self, *, dataset_name: str) -> dd.DataFrame:
def load(self,
*,
dataset_name: str,
Comment on lines +16 to +17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*,
dataset_name: str,
dataset_name: str,
*,

The dataset_name is the main argument and the subject to load, so it makes sense to include this as a positional argument.

This makes sense:

component.load("my_dataset")

This doesn't:

component.load("my_dataset", {"original_name": "new_name"}, ["original_name"], 5)

column_name_mapping: dict,
image_column_names: t.Optional[list],
n_rows_to_load: t.Optional[int]) -> dd.DataFrame:
"""
Args:
dataset_name: name of the dataset to load.

column_name_mapping: Mapping of the consumed hub dataset to fondant column names
image_column_names: A list containing the original hub image column names. Used to
format the image from HF hub format to a byte string
n_rows_to_load: optional argument that defines the number of rows to load. Useful for
testing pipeline runs on a small scale
Returns:
Dataset: HF dataset
Dataset: HF dataset.
"""
# 1) Load data, read as Dask dataframe
logger.info("Loading dataset from the hub...")
dask_df = dd.read_parquet(f"hf://datasets/{dataset_name}")

# 2) Rename columns
dask_df = dask_df.rename(
columns={"image": "images_data", "text": "captions_data"}
)
# 2) Make sure images are bytes instead of dicts
if image_column_names is not None:
for image_column_name in image_column_names:
dask_df[image_column_name] = dask_df[image_column_name].map(
lambda x: x["bytes"], meta=("bytes", bytes)
)

# 3) Rename columns
dask_df = dask_df.rename(columns=column_name_mapping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't create hierarchical columns right?

Is this necessary given that we now use them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the columns are still stored as {subset}_{field} in parquet. They are only transformed to hierarchical columns in the Pandas component.


# 3) Make sure images are bytes instead of dicts
dask_df["images_data"] = dask_df["images_data"].map(
lambda x: x["bytes"], meta=("bytes", bytes)
)
# 4) Optional: only return specific amount of rows

# 4) Add width and height columns
dask_df["images_width"] = dask_df["images_data"].map(
extract_width, meta=("images_width", int)
)
dask_df["images_height"] = dask_df["images_data"].map(
extract_height, meta=("images_height", int)
)
Comment on lines -56 to -61
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we could still do for image columns right? Although then it needs to match the provided component spec as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure about that one, it might be that the original dataset has this metadata and we're assuming that the user requires it.

I was thinking about implementing another component that generates image metadata to go around this, it can be based on a conditional arguments (e.g. estimate_width: True/False, ..).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, downside is that it requires loading the image data into the component. We might be able to do this by only loading the first x bytes, but not sure how this works with different image formats and if this can be done performant with parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found this library: https://github.com/shibukawa/imagesize_py
Tested it out and it seems to work pretty well

if n_rows_to_load:
dask_df = dask_df.head(n_rows_to_load)
dask_df = dd.from_pandas(dask_df, npartitions=1)

return dask_df

Expand Down
Empty file.
28 changes: 28 additions & 0 deletions components/write_to_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Write to hub
description: Component that writes a dataset to the hub
image: ghcr.io/ml6team/write_to_hf_hub:latest

consumes:
dummy_variable: #TODO: fill in here
fields:
data:
type: binary

args:
hf_token:
description: The hugging face token used to write to the hub
type: str
username:
description: The username under which to upload the dataset
type: str
dataset_name:
description: The name of the dataset to upload
type: str
image_column_names:
description: A list containing the image column names. Used to format to image to HF hub format
type: list
default: None
column_name_mapping:
description: Mapping of the consumed fondant column names to the written hub column names
type: dict
default: None
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
huggingface_hub==0.14.1
datasets==2.10.1
fondant
pyarrow>=7.0
Pillow==9.4.0
Expand Down
101 changes: 101 additions & 0 deletions components/write_to_hf_hub/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
"""This component writes an image dataset to the hub."""
import logging
import typing as t
from io import BytesIO

import dask.dataframe as dd
import datasets

# Define the schema for the struct using PyArrow
import huggingface_hub
from PIL import Image

from fondant.component import WriteComponent
from fondant.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


def convert_bytes_to_image(image_bytes: bytes, feature_encoder: datasets.Image) -> \
t.Dict[str, t.Any]:
"""
Function that converts image bytes to hf image format
Args:
image_bytes: the images as a bytestring
feature_encoder: hf image feature encoder
Returns:
HF image representation.
"""
image = Image.open(BytesIO(image_bytes))
image = feature_encoder.encode_example(image)
return image


class WriteToHubComponent(WriteComponent):
def write(
self,
dataframe: dd.DataFrame,
*,
hf_token: str,
username: str,
dataset_name: str,
image_column_names: t.Optional[list],
column_name_mapping: t.Optional[dict]
):
"""
Args:
dataframe: Dask dataframe
hf_token: The hugging face token used to write to the hub
username: The username under which to upload the dataset
dataset_name: The name of the dataset to upload
image_column_names: A list containing the subset image column names. Used to format the
image fields to HF hub format
column_name_mapping: Mapping of the consumed fondant column names to the written hub
column names.
"""
# login
huggingface_hub.login(token=hf_token)

# Create HF dataset repository
repo_id = f"{username}/{dataset_name}"
repo_path = f"hf://datasets/{repo_id}"
logger.info(f"Creating HF dataset repository under ID: '{repo_id}'")
huggingface_hub.create_repo(repo_id=repo_id, repo_type="dataset", exist_ok=True)

# Get columns to write and schema
write_columns = []
schema_dict = {}
for subset_name, subset in self.spec.consumes.items():
for field in subset.fields.values():
column_name = f"{subset_name}_{field.name}"
write_columns.append(column_name)
if image_column_names and column_name in image_column_names:
schema_dict[column_name] = datasets.Image()
else:
schema_dict[column_name] = datasets.Value(str(field.type.value))

schema = datasets.Features(schema_dict).arrow_schema
dataframe = dataframe[write_columns]

# Map image column to hf data format
feature_encoder = datasets.Image(decode=True)

if image_column_names is not None:
for image_column_name in image_column_names:
dataframe[image_column_name] = dataframe[image_column_name].map(
lambda x: convert_bytes_to_image(x, feature_encoder),
meta=(image_column_name, "object")
)

# Map column names to hf data format
if column_name_mapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if column_name_mapping:
if column_name_mapping is not None:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mapping is empty, we don't need to do this either.

dataframe = dataframe.rename(columns=column_name_mapping)

# Write dataset to the hub
dd.to_parquet(dataframe, path=f"{repo_path}/data", schema=schema)


if __name__ == "__main__":
component = WriteToHubComponent.from_args()
component.run()
2 changes: 1 addition & 1 deletion docs/component_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ args:
```

These arguments are passed in when the component is instantiated.
If an argument is not explicitly provided, the default value will be used instead if available.```
If an argument is not explicitly provided, the default value will be used instead if available.
```python
from fondant.pipeline import ComponentOp

Expand Down
2 changes: 1 addition & 1 deletion docs/custom_component.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ RUN apt-get update && \
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the compoent folder
# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files and spec of the component
Expand Down
Loading