-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic read write component #214
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli. This should make it a lot easier for users to build a pipeline without having to build any custom component!
|
||
args: | ||
dataset_name: | ||
description: Name of dataset on the hub | ||
type: str | ||
column_name_mapping: | ||
description: column to map the original column names of the input dataset to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: column to map the original column names of the input dataset to | |
description: Mapping of the read hub column names to the produced fondant column names. |
dask_df["images_width"] = dask_df["images_data"].map( | ||
extract_width, meta=("images_width", int) | ||
) | ||
dask_df["images_height"] = dask_df["images_data"].map( | ||
extract_height, meta=("images_height", int) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we could still do for image columns right? Although then it needs to match the provided component spec as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure about that one, it might be that the original dataset has this metadata and we're assuming that the user requires it.
I was thinking about implementing another component that generates image metadata to go around this, it can be based on a conditional arguments (e.g. estimate_width
: True/False, ..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, downside is that it requires loading the image data into the component. We might be able to do this by only loading the first x bytes, but not sure how this works with different image formats and if this can be done performant with parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found this library: https://github.com/shibukawa/imagesize_py
Tested it out and it seems to work pretty well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli.
Let's use the PandasTransformComponent
for all new transform components from now on. And some more smaller comments 🙂
examples/pipelines/finetune_stable_diffusion/components/load_from_hf_hub/fondant_component.yaml
Outdated
Show resolved
Hide resolved
examples/pipelines/starcoder/components/load_from_hub/fondant_component.yaml
Outdated
Show resolved
Hide resolved
fondant/compiler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be in Georges' PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as a suggestion
6dd333e
to
30db0d8
Compare
@@ -0,0 +1,52 @@ | |||
"""This component filters images of the dataset based on image size (minimum height and width).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring to be updated
fields: | ||
width: | ||
type: int16 | ||
height: | ||
type: int16 | ||
data: | ||
type: binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For components that only add columns, is there a need to specify existing columns in the produces section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory not, but I'm not sure if it works in practice already if you leave them out.
description: A list containing the original hub image column names. Used to format the image | ||
from HF hub format to a byte string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: A list containing the original hub image column names. Used to format the image | |
from HF hub format to a byte string | |
description: Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string. |
from HF hub format to a byte string | ||
type: list | ||
default: None | ||
n_rows_to_load: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_rows_to_load: | |
num_rows_to_load: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, he just switched this from nb_rows_to_load
on my request 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're both clear ;P I'll leave it as is for now
columns={"image": "images_data", "text": "captions_data"} | ||
) | ||
# 2) Make sure images are bytes instead of dicts | ||
if image_column_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if image_column_names: | |
if image_column_names is not None: |
This is usually clearer
) | ||
|
||
# 3) Rename columns | ||
dask_df = dask_df.rename(columns=column_name_mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't create hierarchical columns right?
Is this necessary given that we now use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the columns are still stored as {subset}_{field}
in parquet. They are only transformed to hierarchical columns in the Pandas component.
# Map image column to hf data format | ||
feature_encoder = datasets.Image(decode=True) | ||
|
||
if image_column_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if image_column_names: | |
if image_column_names is not None: |
) | ||
|
||
# Map column names to hf data format | ||
if column_name_mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if column_name_mapping: | |
if column_name_mapping is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the mapping is empty, we don't need to do this either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @PhilippeMoussalli!
Draft PR for implementing generic read/write components. Note: the changes from `feature/local-starcoder` were merged to this branch to make testing easier. Merge this PR after the local startcoder PR. Things to do: - [x] Modify `from_registry` method to enable passing path to custom spec - [x] Figure out how to present spec template to user. Should we keep the image/args and just add a todo in the `produces` and `consumes` (open for suggestions) - [x] Switch over all pipelines to use the generic load/write and remove the custom ones - [x] Add components to affected pipelines to add missing metadata (e.g. width/height) - [x] Add documentation
Draft PR for implementing generic read/write components.
Note: the changes from
feature/local-starcoder
were merged to this branch to make testing easier. Merge this PR after the local startcoder PR.Things to do:
from_registry
method to enable passing path to custom specproduces
andconsumes
(open for suggestions)