Generic read write component #214

PhilippeMoussalli · 2023-06-15T14:36:33Z

Draft PR for implementing generic read/write components.

Note: the changes from feature/local-starcoder were merged to this branch to make testing easier. Merge this PR after the local startcoder PR.

Things to do:

Modify from_registry method to enable passing path to custom spec
Figure out how to present spec template to user. Should we keep the image/args and just add a todo in the produces and consumes (open for suggestions)
Switch over all pipelines to use the generic load/write and remove the custom ones
Add components to affected pipelines to add missing metadata (e.g. width/height)
Add documentation

components/load_from_hf_hub/fondant_component.yaml

RobbeSneyders

Thanks @PhilippeMoussalli. This should make it a lot easier for users to build a pipeline without having to build any custom component!

RobbeSneyders · 2023-06-15T16:20:39Z

components/load_from_hf_hub/fondant_component.yaml


 args:
  dataset_name:
    description: Name of dataset on the hub
    type: str
+  column_name_mapping:
+    description: column to map the original column names of the input dataset to


Suggested change

description: column to map the original column names of the input dataset to

description: Mapping of the read hub column names to the produced fondant column names.

components/load_from_hf_hub/fondant_component.yaml

RobbeSneyders · 2023-06-15T16:25:36Z

components/load_from_hf_hub/src/main.py

-        dask_df["images_width"] = dask_df["images_data"].map(
-            extract_width, meta=("images_width", int)
-        )
-        dask_df["images_height"] = dask_df["images_data"].map(
-            extract_height, meta=("images_height", int)
-        )


This is something we could still do for image columns right? Although then it needs to match the provided component spec as well.

I'm not completely sure about that one, it might be that the original dataset has this metadata and we're assuming that the user requires it.

I was thinking about implementing another component that generates image metadata to go around this, it can be based on a conditional arguments (e.g. estimate_width: True/False, ..).

Yes, downside is that it requires loading the image data into the component. We might be able to do this by only loading the first x bytes, but not sure how this works with different image formats and if this can be done performant with parquet.

found this library: https://github.com/shibukawa/imagesize_py
Tested it out and it seems to work pretty well

components/write_to_hf_hub/fondant_component.yaml

RobbeSneyders

Thanks @PhilippeMoussalli.

Let's use the PandasTransformComponent for all new transform components from now on. And some more smaller comments 🙂

components/image_resolution_extraction/src/main.py

components/load_from_hf_hub/fondant_component.yaml

examples/pipelines/finetune_stable_diffusion/components/load_from_hf_hub/fondant_component.yaml

examples/pipelines/finetune_stable_diffusion/pipeline.py

examples/pipelines/starcoder/components/load_from_hub/fondant_component.yaml

examples/pipelines/starcoder/pipeline.py

RobbeSneyders · 2023-06-20T08:44:33Z

fondant/compiler.py

This should probably be in Georges' PR.

Added as a suggestion

Re-reviewed new changes

NielsRogge · 2023-06-20T12:48:22Z

components/image_resolution_extraction/src/main.py

@@ -0,0 +1,52 @@
+"""This component filters images of the dataset based on image size (minimum height and width)."""


Docstring to be updated

NielsRogge · 2023-06-20T12:49:25Z

components/image_resolution_extraction/fondant_component.yaml

+    fields:
+      width:
+        type: int16
+      height:
+        type: int16
+      data:
+        type: binary


For components that only add columns, is there a need to specify existing columns in the produces section?

cc @RobbeSneyders

In theory not, but I'm not sure if it works in practice already if you leave them out.

NielsRogge · 2023-06-20T12:51:00Z

components/load_from_hf_hub/fondant_component.yaml

+    description: A list containing the original hub image column names. Used to format the image
+      from HF hub format to a byte string


Suggested change

description: A list containing the original hub image column names. Used to format the image

from HF hub format to a byte string

description: Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string.

NielsRogge · 2023-06-20T12:54:11Z

components/load_from_hf_hub/fondant_component.yaml

+      from HF hub format to a byte string
+    type: list
+    default: None
+  n_rows_to_load:


Suggested change

n_rows_to_load:

num_rows_to_load:

Haha, he just switched this from nb_rows_to_load on my request 😅

I think they're both clear ;P I'll leave it as is for now

NielsRogge · 2023-06-20T12:54:56Z

components/load_from_hf_hub/src/main.py

-            columns={"image": "images_data", "text": "captions_data"}
-        )
+        # 2) Make sure images are bytes instead of dicts
+        if image_column_names:


Suggested change

if image_column_names:

if image_column_names is not None:

This is usually clearer

NielsRogge · 2023-06-20T12:56:11Z

components/load_from_hf_hub/src/main.py

+                )
+
+        # 3) Rename columns
+        dask_df = dask_df.rename(columns=column_name_mapping)


This doesn't create hierarchical columns right?

Is this necessary given that we now use them?

No, the columns are still stored as {subset}_{field} in parquet. They are only transformed to hierarchical columns in the Pandas component.

NielsRogge · 2023-06-20T12:57:37Z

components/write_to_hf_hub/src/main.py

+        # Map image column to hf data format
+        feature_encoder = datasets.Image(decode=True)
+
+        if image_column_names:


Suggested change

if image_column_names:

if image_column_names is not None:

NielsRogge · 2023-06-20T12:57:46Z

components/write_to_hf_hub/src/main.py

+                )
+
+        # Map column names to hf data format
+        if column_name_mapping:


Suggested change

if column_name_mapping:

if column_name_mapping is not None:

If the mapping is empty, we don't need to do this either.

RobbeSneyders

Great work @PhilippeMoussalli!

Draft PR for implementing generic read/write components. Note: the changes from `feature/local-starcoder` were merged to this branch to make testing easier. Merge this PR after the local startcoder PR. Things to do: - [x] Modify `from_registry` method to enable passing path to custom spec - [x] Figure out how to present spec template to user. Should we keep the image/args and just add a todo in the `produces` and `consumes` (open for suggestions) - [x] Switch over all pipelines to use the generic load/write and remove the custom ones - [x] Add components to affected pipelines to add missing metadata (e.g. width/height) - [x] Add documentation

PhilippeMoussalli added 21 commits June 12, 2023 10:44

add writer component

f7e5dcb

Merge branch 'main' into write-component-class

654a5a6

modify component spec schema to accept default arguments

17c5c7e

enable adding default arguments

a4e44d2

test adding default arguments

6d2d1ab

fixmypy issue

af597f8

correct docs

c90b53b

update component spec

a8e55d8

update docs

00dfa55

add optional field to schema

0fa694f

enable defining default arguments

c9f4429

add relevant tests

5ee3b31

Merge branch 'main' into enable-optional-component-arguments

919c21f

add test file

66fdb4c

bugfix string bool

61adc40

change method of defining optionals

88b4729

make component spec optional

64614f7

Merge branch 'main' into enable-optional-component-arguments

218b8ce

implement PR feedback

7c7ace7

make load component generic

3d5b42a

make load component generic

7d88861

PhilippeMoussalli requested review from RobbeSneyders, GeorgesLorre, NielsRogge and ChristiaensBert June 15, 2023 14:36

Base automatically changed from enable-optional-component-arguments to main June 15, 2023 14:37

PhilippeMoussalli changed the title ~~Generic read write component~~ [Draft] Generic read write component Jun 15, 2023

Merge branch 'main' into generic_read_write_component

e2445d1

ChristiaensBert reviewed Jun 15, 2023

View reviewed changes

components/load_from_hf_hub/fondant_component.yaml Show resolved Hide resolved

RobbeSneyders reviewed Jun 15, 2023

View reviewed changes

PhilippeMoussalli added 3 commits June 19, 2023 17:47

Test pipeline on local runner and add options to load few rows

94ecd32

Test pipeline on local runner and add options to load few rows

e731744

modify composable component template

2120a5c

PhilippeMoussalli changed the base branch from main to feature/local-starcoder June 19, 2023 15:52

PhilippeMoussalli changed the title ~~[Draft] Generic read write component~~ [WIP] Generic read write component Jun 19, 2023

add image resolution extration component

c1ec93c

RobbeSneyders reviewed Jun 20, 2023

View reviewed changes

implement PR feedback

316eff3

RobbeSneyders linked an issue Jun 20, 2023 that may be closed by this pull request

Modify components that read/write to HF hub to be generic #197

Closed

add documentation on generic components

a47f42c

PhilippeMoussalli changed the title ~~[WIP] Generic read write component~~ Generic read write component Jun 20, 2023

PhilippeMoussalli added 2 commits June 20, 2023 14:17

typo

2184ef6

unpin fixed version

177bed7

GeorgesLorre force-pushed the feature/local-starcoder branch from 6dd333e to 30db0d8 Compare June 20, 2023 12:45

NielsRogge reviewed Jun 20, 2023

View reviewed changes

Base automatically changed from feature/local-starcoder to main June 20, 2023 12:56

NielsRogge reviewed Jun 20, 2023

View reviewed changes

PhilippeMoussalli added 2 commits June 20, 2023 15:00

Merge branch 'main' into generic_read_write_component

543d673

address PR feedback

4ec35b4

RobbeSneyders approved these changes Jun 21, 2023

View reviewed changes

RobbeSneyders merged commit 26dfad8 into main Jun 21, 2023

RobbeSneyders deleted the generic_read_write_component branch June 21, 2023 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic read write component #214

Generic read write component #214

PhilippeMoussalli commented Jun 15, 2023 •

edited

Loading

RobbeSneyders left a comment

RobbeSneyders Jun 15, 2023

RobbeSneyders Jun 15, 2023

PhilippeMoussalli Jun 16, 2023

RobbeSneyders Jun 19, 2023

PhilippeMoussalli Jun 20, 2023

RobbeSneyders left a comment

RobbeSneyders Jun 20, 2023

PhilippeMoussalli Jun 20, 2023

NielsRogge Jun 20, 2023

NielsRogge Jun 20, 2023

RobbeSneyders Jun 20, 2023

NielsRogge Jun 20, 2023 •

edited

Loading

NielsRogge Jun 20, 2023

RobbeSneyders Jun 20, 2023

PhilippeMoussalli Jun 20, 2023

NielsRogge Jun 20, 2023

NielsRogge Jun 20, 2023

RobbeSneyders Jun 20, 2023

NielsRogge Jun 20, 2023

NielsRogge Jun 20, 2023

RobbeSneyders Jun 21, 2023

RobbeSneyders left a comment

	description: column to map the original column names of the input dataset to
	description: Mapping of the read hub column names to the produced fondant column names.

		@@ -0,0 +1,52 @@
		"""This component filters images of the dataset based on image size (minimum height and width)."""

		description: A list containing the original hub image column names. Used to format the image
		from HF hub format to a byte string

	description: A list containing the original hub image column names. Used to format the image
	from HF hub format to a byte string
	description: Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string.

Generic read write component #214

Generic read write component #214

Conversation

PhilippeMoussalli commented Jun 15, 2023 • edited Loading

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge Jun 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

PhilippeMoussalli commented Jun 15, 2023 •

edited

Loading

NielsRogge Jun 20, 2023 •

edited

Loading