Implement a WriteComponent class #138

PhilippeMoussalli · 2023-05-16T11:45:22Z

Problem Statement

Currently we have two types of components:

LoadComponent : takes as input user arguments for loading dataset (e.g. dataset_id) and loads the dataset from a remote location. This component also creates the initial manifest based on the metadata passed to the component and the component specs.
The user is expected to change the column names to subset_field if the columns of the loaded dataset do not have this. (this might change.
TransformComponent takes as input the evolved manifest based on the component spec and loads the dataset from the artifact registry. it also presents the dataframe to the user as subset_field

Both components share some common functionalities based on the abstract run method:

loading the manifest
processing the dataset based on the load or transform function implemented by the user
evolving the manifest based on the component spec
writing the new datasets
uploading the manifest

What is still missing is a WriteComponent (this is currently implemented with a transform component). This component should take as an argument a write_path and write the dataset with an appropriate schema based on the component spec.

Proposed Approach

Let's take the write_to_hub component as an example. What we want is the following:

Allow users to write the dataset with custom columns names
This can be done by providing a mapping dict argument similar to what Bert proposed for the loading component link
The user returns the dataset with the modified names and we take care of writing it to the appropriate location
Two options here:
- (Option A): User has to implement the writing functions themselves similar to how it's currently implemented here
  Not ideal since they have to call the schema stuff and the to_parquet method which we normally handle in the backend.
- (Option B): User returns a url of where to write the data as well as the dataframe. We take care of writing the dataset to the appropriate location. However, this would require us to implement a DaskDataSink component similar to the DaskDataWriter (we have to change some names) where the biggest difference is that:
  - The write location is the one returned by the user and not based on the manifest (run_id + subset)
  - We do not write any subsets in this component
    Only caveat here is that we're assuming that all the remote locations can be sinked to using the to_parquet method (example). This applies to both the cloud and hf. Also, it will require injecting some dependencies into the Fondant code (writing to hf://) requires hf_hub as a dependency
  - We might want to introduce specific classes for the Data Sinks since it enables for a nicer isolation of the behavior of the components. E.g. the dataset needs to have a specific schema when writing to hub in order to render properly (see below). Where the specific classes would subclass a general DaskDataSink class. Wondering whether we should follow a similar approach for the LoadComponent if we want to introduce many sources.

Tasks

Give feedback

Implement general data writer class
Add tests
Options

Implementation Steps/Tasks

Depends on the choice of solution.

another thing that we will have to implement is change the schema passed to the to_parquet method from a dict to a pyarrow.Schema data type. This is mainly to support adding additional metadata to the schema needed for the write_to_hub component.

Potential Impact

if Option B is introduced:

Component class will have to be adjusted by adding a new class and potentially modifying the run method (maybe we'll have to implement a separate run method per component type since they will share many common methods anymore if the WriteComponent is introduced ) as well as DataIo

Testing

Documentation

We will have to document all 3 types of component in the custom_component.md

Feedback and Suggestions

Dependent features

Additional Notes

The text was updated successfully, but these errors were encountered:

RobbeSneyders · 2023-06-06T08:12:29Z

I think we need two levels of this:

A general WriteComponent class, similar to the LoadComponent, which creates the skeleton for a component that only consumes data.
Implementations of this WriteComponent class for specific sinks, eg. write_to_hub. Ideally these can work by only taking arguments, so the user doesn't need to reimplement them.

PhilippeMoussalli · 2023-06-06T08:18:39Z

-> Implementations of this WriteComponent class for specific sinks, eg. write_to_hub. Ideally these can work by only taking arguments, so the user doesn't need to reimplement them.

So those will be normal reusable components that we provide and not handled in the backend of Fondant right?

RobbeSneyders · 2023-06-06T09:00:26Z

So those will be normal reusable components that we provide and not handled in the backend of Fondant right?

Yes indeed. I think we only need to provide the equivalent of the LoadComponent as part of Fondant itself.

PR that adds a writer class as discussed in #138. This enables us to write the final dataset without having to write the dataset and manifest since there is no modification made on the data. Next steps: - Enable default and optional arguments in components. The optional arguments are needed to make the Reader/Writer components generic (e.g. Write to hub requires special hf metadata to be attached to the image column in case there is any, user needs to pass an optional argument specifying the columns name of the image) - Re implement load/Write to hub component to make them more generic.

github-project-automation bot added this to Fondant development May 16, 2023

github-project-automation bot moved this to Backlog in Fondant development May 16, 2023

PhilippeMoussalli mentioned this issue May 16, 2023

[Alpha pipeline] Write to hub component #140

Merged

PhilippeMoussalli mentioned this issue May 30, 2023

Implement custom component spec for each component #174

Closed

RobbeSneyders moved this from Backlog to Planned in Fondant development May 31, 2023

RobbeSneyders changed the title ~~Implement a DataWriter class~~ Implement a WriteComponent class May 31, 2023

PhilippeMoussalli moved this from Planned to Breakdown in Fondant development Jun 1, 2023

PhilippeMoussalli self-assigned this Jun 1, 2023

PhilippeMoussalli added the Core Core framework label Jun 5, 2023

ml6team deleted a comment from PhilippeMoussalli Jun 6, 2023

PhilippeMoussalli moved this from Breakdown to Ready for development in Fondant development Jun 9, 2023

PhilippeMoussalli mentioned this issue Jun 12, 2023

Add writer component #196

Merged

PhilippeMoussalli moved this from Ready for development to Validation in Fondant development Jun 12, 2023

PhilippeMoussalli linked a pull request Jun 12, 2023 that will close this issue

Add writer component #196

Merged

PhilippeMoussalli closed this as completed in #196 Jun 13, 2023

github-project-automation bot moved this from Validation to Done in Fondant development Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a WriteComponent class #138

Implement a WriteComponent class #138

PhilippeMoussalli commented May 16, 2023 •

edited

Loading

Tasks

RobbeSneyders commented Jun 6, 2023

PhilippeMoussalli commented Jun 6, 2023

RobbeSneyders commented Jun 6, 2023

Implement a WriteComponent class #138

Implement a WriteComponent class #138

Comments

PhilippeMoussalli commented May 16, 2023 • edited Loading

Problem Statement

Proposed Approach

Tasks

Implementation Steps/Tasks

Potential Impact

Testing

Documentation

Feedback and Suggestions

Dependent features

Additional Notes

RobbeSneyders commented Jun 6, 2023

PhilippeMoussalli commented Jun 6, 2023

RobbeSneyders commented Jun 6, 2023

PhilippeMoussalli commented May 16, 2023 •

edited

Loading