Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a WriteComponent class #138

Closed
2 tasks done
PhilippeMoussalli opened this issue May 16, 2023 · 3 comments · Fixed by #196
Closed
2 tasks done

Implement a WriteComponent class #138

PhilippeMoussalli opened this issue May 16, 2023 · 3 comments · Fixed by #196
Assignees
Labels
Core Core framework

Comments

@PhilippeMoussalli
Copy link
Contributor

PhilippeMoussalli commented May 16, 2023

Problem Statement

Currently we have two types of components:

  • LoadComponent : takes as input user arguments for loading dataset (e.g. dataset_id) and loads the dataset from a remote location. This component also creates the initial manifest based on the metadata passed to the component and the component specs.
    The user is expected to change the column names to subset_field if the columns of the loaded dataset do not have this. (this might change.

  • TransformComponent takes as input the evolved manifest based on the component spec and loads the dataset from the artifact registry. it also presents the dataframe to the user as subset_field

Both components share some common functionalities based on the abstract run method:

  • loading the manifest
  • processing the dataset based on the load or transform function implemented by the user
  • evolving the manifest based on the component spec
  • writing the new datasets
  • uploading the manifest

What is still missing is a WriteComponent (this is currently implemented with a transform component). This component should take as an argument a write_path and write the dataset with an appropriate schema based on the component spec.

Proposed Approach

Let's take the write_to_hub component as an example. What we want is the following:

  • Allow users to write the dataset with custom columns names
    This can be done by providing a mapping dict argument similar to what Bert proposed for the loading component link

  • The user returns the dataset with the modified names and we take care of writing it to the appropriate location
    Two options here:

    • (Option A): User has to implement the writing functions themselves similar to how it's currently implemented here
      Not ideal since they have to call the schema stuff and the to_parquet method which we normally handle in the backend.
    • (Option B): User returns a url of where to write the data as well as the dataframe. We take care of writing the dataset to the appropriate location. However, this would require us to implement a DaskDataSink component similar to the DaskDataWriter (we have to change some names) where the biggest difference is that:
      • The write location is the one returned by the user and not based on the manifest (run_id + subset)
      • We do not write any subsets in this component
        Only caveat here is that we're assuming that all the remote locations can be sinked to using the to_parquet method (example). This applies to both the cloud and hf. Also, it will require injecting some dependencies into the Fondant code (writing to hf://) requires hf_hub as a dependency
      • We might want to introduce specific classes for the Data Sinks since it enables for a nicer isolation of the behavior of the components. E.g. the dataset needs to have a specific schema when writing to hub in order to render properly (see below). Where the specific classes would subclass a general DaskDataSink class. Wondering whether we should follow a similar approach for the LoadComponent if we want to introduce many sources.

Tasks

Preview Give feedback

Implementation Steps/Tasks

Depends on the choice of solution.

another thing that we will have to implement is change the schema passed to the to_parquet method from a dict to a pyarrow.Schema data type. This is mainly to support adding additional metadata to the schema needed for the write_to_hub component.

Image

Potential Impact

if Option B is introduced:

Component class will have to be adjusted by adding a new class and potentially modifying the run method (maybe we'll have to implement a separate run method per component type since they will share many common methods anymore if the WriteComponent is introduced ) as well as DataIo

Testing

Documentation

We will have to document all 3 types of component in the custom_component.md

Feedback and Suggestions

Dependent features

Additional Notes

@RobbeSneyders RobbeSneyders moved this from Backlog to Planned in Fondant development May 31, 2023
@RobbeSneyders RobbeSneyders changed the title Implement a DataWriter class Implement a WriteComponent class May 31, 2023
@PhilippeMoussalli PhilippeMoussalli moved this from Planned to Breakdown in Fondant development Jun 1, 2023
@PhilippeMoussalli PhilippeMoussalli self-assigned this Jun 1, 2023
@PhilippeMoussalli PhilippeMoussalli added the Core Core framework label Jun 5, 2023
@ml6team ml6team deleted a comment from PhilippeMoussalli Jun 6, 2023
@RobbeSneyders
Copy link
Member

I think we need two levels of this:

  • A general WriteComponent class, similar to the LoadComponent, which creates the skeleton for a component that only consumes data.
  • Implementations of this WriteComponent class for specific sinks, eg. write_to_hub. Ideally these can work by only taking arguments, so the user doesn't need to reimplement them.

@PhilippeMoussalli
Copy link
Contributor Author

-> Implementations of this WriteComponent class for specific sinks, eg. write_to_hub. Ideally these can work by only taking arguments, so the user doesn't need to reimplement them.

So those will be normal reusable components that we provide and not handled in the backend of Fondant right?

@RobbeSneyders
Copy link
Member

So those will be normal reusable components that we provide and not handled in the backend of Fondant right?

Yes indeed. I think we only need to provide the equivalent of the LoadComponent as part of Fondant itself.

@PhilippeMoussalli PhilippeMoussalli moved this from Breakdown to Ready for development in Fondant development Jun 9, 2023
@PhilippeMoussalli PhilippeMoussalli moved this from Ready for development to Validation in Fondant development Jun 12, 2023
@PhilippeMoussalli PhilippeMoussalli linked a pull request Jun 12, 2023 that will close this issue
PhilippeMoussalli added a commit that referenced this issue Jun 13, 2023
PR that adds a writer class as discussed in #138. 
This enables us to write the final dataset without having to write the
dataset and manifest since there is no modification made on the data.

Next steps: 
- Enable default and optional arguments in components. The optional
arguments are needed to make the Reader/Writer components generic (e.g.
Write to hub requires special hf metadata to be attached to the image
column in case there is any, user needs to pass an optional argument
specifying the columns name of the image)
- Re implement load/Write to hub component to make them more generic.
@github-project-automation github-project-automation bot moved this from Validation to Done in Fondant development Jun 13, 2023
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
PR that adds a writer class as discussed in #138. 
This enables us to write the final dataset without having to write the
dataset and manifest since there is no modification made on the data.

Next steps: 
- Enable default and optional arguments in components. The optional
arguments are needed to make the Reader/Writer components generic (e.g.
Write to hub requires special hf metadata to be attached to the image
column in case there is any, user needs to pass an optional argument
specifying the columns name of the image)
- Re implement load/Write to hub component to make them more generic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants