Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic read write component #214

Merged
merged 48 commits into from
Jun 21, 2023
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
f7e5dcb
add writer component
PhilippeMoussalli Jun 9, 2023
654a5a6
Merge branch 'main' into write-component-class
PhilippeMoussalli Jun 12, 2023
17c5c7e
modify component spec schema to accept default arguments
PhilippeMoussalli Jun 12, 2023
a4e44d2
enable adding default arguments
PhilippeMoussalli Jun 12, 2023
6d2d1ab
test adding default arguments
PhilippeMoussalli Jun 12, 2023
af597f8
fixmypy issue
PhilippeMoussalli Jun 12, 2023
c90b53b
correct docs
PhilippeMoussalli Jun 12, 2023
a8e55d8
update component spec
PhilippeMoussalli Jun 12, 2023
00dfa55
update docs
PhilippeMoussalli Jun 12, 2023
0fa694f
add optional field to schema
PhilippeMoussalli Jun 13, 2023
c9f4429
enable defining default arguments
PhilippeMoussalli Jun 13, 2023
5ee3b31
add relevant tests
PhilippeMoussalli Jun 13, 2023
919c21f
Merge branch 'main' into enable-optional-component-arguments
PhilippeMoussalli Jun 13, 2023
66fdb4c
add test file
PhilippeMoussalli Jun 13, 2023
61adc40
bugfix string bool
PhilippeMoussalli Jun 14, 2023
88b4729
change method of defining optionals
PhilippeMoussalli Jun 14, 2023
64614f7
make component spec optional
PhilippeMoussalli Jun 14, 2023
218b8ce
Merge branch 'main' into enable-optional-component-arguments
PhilippeMoussalli Jun 15, 2023
7c7ace7
implement PR feedback
PhilippeMoussalli Jun 15, 2023
3d5b42a
make load component generic
PhilippeMoussalli Jun 15, 2023
7d88861
make load component generic
PhilippeMoussalli Jun 15, 2023
e2445d1
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 15, 2023
8b5a749
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 16, 2023
eaa2dfa
Add build for local components
GeorgesLorre Jun 14, 2023
86f89b4
Update starcoder example to use the docker compiler
GeorgesLorre Jun 16, 2023
d160e34
Fix build script and run pipeline locally
GeorgesLorre Jun 16, 2023
49d80a6
Update tests
GeorgesLorre Jun 19, 2023
402866f
Fix bad merge
GeorgesLorre Jun 19, 2023
4a00829
enable passing custom spec path for reusable components
PhilippeMoussalli Jun 19, 2023
3579cef
Add manifest per component
GeorgesLorre Jun 19, 2023
c56a9ed
implement changes to starcoder pipeline
PhilippeMoussalli Jun 19, 2023
7c6d7a4
implement changes to controlnet pipeline
PhilippeMoussalli Jun 19, 2023
c741116
implement change to SD pipeline
PhilippeMoussalli Jun 19, 2023
e21755e
add missing test file
PhilippeMoussalli Jun 19, 2023
c97b405
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 19, 2023
a2c8717
Merge branch 'feature/local-starcoder' into generic_read_write_component
PhilippeMoussalli Jun 19, 2023
421ef3e
modify compiler to handle lists and dict
PhilippeMoussalli Jun 19, 2023
c21f0ea
add datasets to write to hub req
PhilippeMoussalli Jun 19, 2023
94ecd32
Test pipeline on local runner and add options to load few rows
PhilippeMoussalli Jun 19, 2023
e731744
Test pipeline on local runner and add options to load few rows
PhilippeMoussalli Jun 19, 2023
2120a5c
modify composable component template
PhilippeMoussalli Jun 19, 2023
c1ec93c
add image resolution extration component
PhilippeMoussalli Jun 20, 2023
316eff3
implement PR feedback
PhilippeMoussalli Jun 20, 2023
a47f42c
add documentation on generic components
PhilippeMoussalli Jun 20, 2023
2184ef6
typo
PhilippeMoussalli Jun 20, 2023
177bed7
unpin fixed version
PhilippeMoussalli Jun 20, 2023
543d673
Merge branch 'main' into generic_read_write_component
PhilippeMoussalli Jun 20, 2023
4ec35b4
address PR feedback
PhilippeMoussalli Jun 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions components/filter_comments/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@
import dask.dataframe as dd
from utils.text_extraction import get_comments_to_code_ratio

from fondant.component import TransformComponent
from fondant.component import DaskTransformComponent
from fondant.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


class FilterCommentsComponent(TransformComponent):
class FilterCommentsComponent(DaskTransformComponent):
"""Component that filters instances based on code to comments ratio."""

def transform(
Expand Down Expand Up @@ -46,4 +46,4 @@ def transform(

if __name__ == "__main__":
component = FilterCommentsComponent.from_args()
component.run()
component.run()
6 changes: 3 additions & 3 deletions components/filter_line_length/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@

import dask.dataframe as dd

from fondant.component import TransformComponent
from fondant.component import DaskTransformComponent
from fondant.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


class FilterLineLengthComponent(TransformComponent):
class FilterLineLengthComponent(DaskTransformComponent):
"""
This component filters code based on a set of metadata associated with it:
average line length, maximum line length and alphanum fraction.
Expand Down Expand Up @@ -44,4 +44,4 @@ def transform(

if __name__ == "__main__":
component = FilterLineLengthComponent.from_args()
component.run()
component.run()
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get update && \
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the component folder
# Set the working directory to the compoent folder
WORKDIR /component/src

# Copy over src-files
Expand Down
19 changes: 19 additions & 0 deletions components/image_resolution_extraction/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Image resolution extraction
description: Component that extracts image resolution data from the images
image: ghcr.io/ml6team/image_resolution_extraction:latest

consumes:
images:
fields:
data:
type: binary

produces:
images:
fields:
width:
type: int16
height:
type: int16
data:
type: binary
Comment on lines +13 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For components that only add columns, is there a need to specify existing columns in the produces section?

cc @RobbeSneyders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory not, but I'm not sure if it works in practice already if you leave them out.

4 changes: 4 additions & 0 deletions components/image_resolution_extraction/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fondant
pyarrow>=7.0
gcsfs==2023.4.0
imagesize==1.4.1
55 changes: 55 additions & 0 deletions components/image_resolution_extraction/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
"""This component filters images of the dataset based on image size (minimum height and width)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring to be updated

import io
import logging
import typing as t

import dask.dataframe as dd
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved
import imagesize
import numpy as np

from fondant.component import DaskTransformComponent
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved
from fondant.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


def extract_dimensions(image_df: dd.DataFrame) -> t.Tuple[np.int16, np.int16]:
"""Extract the width and height of an image.

Args:
image_df (dd.DataFrame): input dataframe with images_data column

Returns:
np.int16: width of the image
np.int16: height of the image
"""
width, height = imagesize.get(io.BytesIO(image_df["images_data"]))

return np.int16(width), np.int16(height)
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved


class ImageResolutionExtractionComponent(DaskTransformComponent):
"""Component that extracts image dimensions."""

def transform(self, *, dataframe: dd.DataFrame) -> dd.DataFrame:
"""
Args:
dataframe: Dask dataframe
Returns:
dataset.
"""
logger.info("Length of the dataframe before filtering: %s", len(dataframe))

logger.info("Filtering dataset...")

dataframe[["images_width", "images_height"]] = \
dataframe[["images_data"]].apply(extract_dimensions,
axis=1, result_type="expand", meta={0: int, 1: int})

return dataframe
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved


if __name__ == "__main__":
component = ImageResolutionExtractionComponent.from_args()
component.run()
24 changes: 14 additions & 10 deletions components/load_from_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,25 @@ name: Load from hub
description: Component that loads a dataset from the hub
image: ghcr.io/ml6team/load_from_hf_hub:latest

produces:
images:
consumes:
dummy_variable: #TODO: fill in here
fields:
data:
type: binary
width:
type: int16
height:
type: int16
captions:
fields:
data:
type: string

args:
dataset_name:
description: Name of dataset on the hub
type: str
column_name_mapping:
description: Mapping of the consumed hub dataset to fondant column names
type: dict
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved
image_column_names:
description: A list containing the original hub image column names. Used to format the image
from HF hub format to a byte string
Copy link
Contributor

@NielsRogge NielsRogge Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: A list containing the original hub image column names. Used to format the image
from HF hub format to a byte string
description: Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string.

type: list
default: None
nb_rows_to_load:
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved
description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale
type: int
default: None
63 changes: 26 additions & 37 deletions components/load_from_hf_hub/src/main.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
"""This component loads a seed dataset from the hub."""
import io
import logging
import typing as t

import dask.dataframe as dd
import numpy as np
from PIL import Image

from fondant.component import LoadComponent
from fondant.logger import configure_logging
Expand All @@ -13,52 +11,43 @@
logger = logging.getLogger(__name__)


def extract_width(image_bytes):
# Decode image bytes to PIL Image object
pil_image = Image.open(io.BytesIO(image_bytes))
width = pil_image.size[0]

return np.int16(width)


def extract_height(image_bytes):
# Decode image bytes to PIL Image object
pil_image = Image.open(io.BytesIO(image_bytes))
height = pil_image.size[1]

return np.int16(height)


class LoadFromHubComponent(LoadComponent):
def load(self, *, dataset_name: str) -> dd.DataFrame:
def load(self,
*,
dataset_name: str,
Comment on lines +16 to +17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*,
dataset_name: str,
dataset_name: str,
*,

The dataset_name is the main argument and the subject to load, so it makes sense to include this as a positional argument.

This makes sense:

component.load("my_dataset")

This doesn't:

component.load("my_dataset", {"original_name": "new_name"}, ["original_name"], 5)

column_name_mapping: dict,
image_column_names: t.Optional[list],
nb_rows_to_load: t.Optional[int]) -> dd.DataFrame:
"""
Args:
dataset_name: name of the dataset to load.

column_name_mapping: Mapping of the consumed hub dataset to fondant column names
image_column_names: A list containing the original hub image column names. Used to
format the image from HF hub format to a byte string
nb_rows_to_load: optional argument that defines the number of rows to load. Useful for
testing pipeline runs on a small scale
Returns:
Dataset: HF dataset
Dataset: HF dataset.
"""
# 1) Load data, read as Dask dataframe
logger.info("Loading dataset from the hub...")
dask_df = dd.read_parquet(f"hf://datasets/{dataset_name}")

# 2) Rename columns
dask_df = dask_df.rename(
columns={"image": "images_data", "text": "captions_data"}
)
# 2) Make sure images are bytes instead of dicts
if image_column_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if image_column_names:
if image_column_names is not None:

This is usually clearer

for image_column_name in image_column_names:
dask_df[image_column_name] = dask_df[image_column_name].map(
lambda x: x["bytes"], meta=("bytes", bytes)
)

# 3) Rename columns
dask_df = dask_df.rename(columns=column_name_mapping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't create hierarchical columns right?

Is this necessary given that we now use them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the columns are still stored as {subset}_{field} in parquet. They are only transformed to hierarchical columns in the Pandas component.


# 3) Make sure images are bytes instead of dicts
dask_df["images_data"] = dask_df["images_data"].map(
lambda x: x["bytes"], meta=("bytes", bytes)
)
# 4) Optional: only return specific amount of rows

# 4) Add width and height columns
dask_df["images_width"] = dask_df["images_data"].map(
extract_width, meta=("images_width", int)
)
dask_df["images_height"] = dask_df["images_data"].map(
extract_height, meta=("images_height", int)
)
Comment on lines -56 to -61
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we could still do for image columns right? Although then it needs to match the provided component spec as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure about that one, it might be that the original dataset has this metadata and we're assuming that the user requires it.

I was thinking about implementing another component that generates image metadata to go around this, it can be based on a conditional arguments (e.g. estimate_width: True/False, ..).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, downside is that it requires loading the image data into the component. We might be able to do this by only loading the first x bytes, but not sure how this works with different image formats and if this can be done performant with parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found this library: https://github.com/shibukawa/imagesize_py
Tested it out and it seems to work pretty well

if nb_rows_to_load:
dask_df = dask_df.head(nb_rows_to_load)
dask_df = dd.from_pandas(dask_df, npartitions=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling .head() and setting npartitions=1 only makes sense for small nb_rows_to_load, which might not be the case. What if I want to run on 1M rows for testing instead of 5B? 🙂

On the other hand, I'm having trouble finding a good way to do this using Dask without knowing the index labels. What is possible, is sampling, or selecting a number of partitions, but this might be less usable for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point, I was considering small scale for end-to-end testing but this might not always be the case. Maybe best to tackle this separately. Created a ticket for it


return dask_df

Expand Down
4 changes: 2 additions & 2 deletions components/pii_redaction/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@
from pii_detection import scan_pii
from pii_redaction import redact_pii

from fondant.component import TransformComponent
from fondant.component import DaskTransformComponent
from fondant.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


class RemovePIIComponent(TransformComponent):
class RemovePIIComponent(DaskTransformComponent):
"""Component that detects and redacts PII from code."""

def transform(
Expand Down
Empty file.
28 changes: 28 additions & 0 deletions components/write_to_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Write to hub
description: Component that writes a dataset to the hub
image: ghcr.io/ml6team/write_to_hf_hub:0.1.3

consumes:
dummy_variable: #TODO: fill in here
fields:
data:
type: binary

args:
hf_token:
description: The hugging face token used to write to the hub
type: str
username:
description: The username under which to upload the dataset
type: str
dataset_name:
description: The name of the dataset to upload
type: str
image_column_names:
description: A list containing the image column names. Used to format to image to HF hub format
type: list
default: None
column_name_mapping:
description: Mapping of the consumed fondant column names to the written hub column names
type: dict
default: None
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
huggingface_hub==0.14.1
fondant
datasets==2.10.1
fondant==0.1.3
pyarrow>=7.0
Pillow==9.4.0
gcsfs==2023.4.0
Loading