-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a WriteComponent class #138
Comments
I think we need two levels of this:
|
-> Implementations of this WriteComponent class for specific sinks, eg. write_to_hub. Ideally these can work by only taking arguments, so the user doesn't need to reimplement them. So those will be normal reusable components that we provide and not handled in the backend of Fondant right? |
Yes indeed. I think we only need to provide the equivalent of the |
PR that adds a writer class as discussed in #138. This enables us to write the final dataset without having to write the dataset and manifest since there is no modification made on the data. Next steps: - Enable default and optional arguments in components. The optional arguments are needed to make the Reader/Writer components generic (e.g. Write to hub requires special hf metadata to be attached to the image column in case there is any, user needs to pass an optional argument specifying the columns name of the image) - Re implement load/Write to hub component to make them more generic.
PR that adds a writer class as discussed in #138. This enables us to write the final dataset without having to write the dataset and manifest since there is no modification made on the data. Next steps: - Enable default and optional arguments in components. The optional arguments are needed to make the Reader/Writer components generic (e.g. Write to hub requires special hf metadata to be attached to the image column in case there is any, user needs to pass an optional argument specifying the columns name of the image) - Re implement load/Write to hub component to make them more generic.
Problem Statement
Currently we have two types of components:
LoadComponent
: takes as input user arguments for loading dataset (e.g. dataset_id) and loads the dataset from a remote location. This component also creates the initial manifest based on the metadata passed to the component and the component specs.The user is expected to change the column names to
subset_field
if the columns of the loaded dataset do not have this. (this might change.TransformComponent
takes as input the evolved manifest based on the component spec and loads the dataset from the artifact registry. it also presents the dataframe to the user assubset_field
Both components share some common functionalities based on the abstract run method:
What is still missing is a
WriteComponent
(this is currently implemented with a transform component). This component should take as an argument awrite_path
and write the dataset with an appropriate schema based on the component spec.Proposed Approach
Let's take the
write_to_hub
component as an example. What we want is the following:Allow users to write the dataset with custom columns names
This can be done by providing a mapping dict argument similar to what Bert proposed for the loading component link
The user returns the dataset with the modified names and we take care of writing it to the appropriate location
Two options here:
Not ideal since they have to call the schema stuff and the
to_parquet
method which we normally handle in the backend.DaskDataSink
component similar to theDaskDataWriter
(we have to change some names) where the biggest difference is that:Only caveat here is that we're assuming that all the remote locations can be sinked to using the
to_parquet
method (example). This applies to both the cloud and hf. Also, it will require injecting some dependencies into the Fondant code (writing tohf://
) requires hf_hub as a dependencyDaskDataSink
class. Wondering whether we should follow a similar approach for theLoadComponent
if we want to introduce many sources.Tasks
Implementation Steps/Tasks
Depends on the choice of solution.
another thing that we will have to implement is change the schema passed to the
to_parquet
method from adict
to apyarrow.Schema
data type. This is mainly to support adding additional metadata to the schema needed for thewrite_to_hub
component.Potential Impact
if Option B is introduced:
Component
class will have to be adjusted by adding a new class and potentially modifying therun
method (maybe we'll have to implement a separate run method per component type since they will share many common methods anymore if theWriteComponent
is introduced ) as well asDataIo
Testing
Documentation
We will have to document all 3 types of component in the
custom_component.md
Feedback and Suggestions
Dependent features
Additional Notes
The text was updated successfully, but these errors were encountered: