-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create load from parquet #474
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The component does probably the job for now.
I am just curious about if we can reuse the load from hub component completely.
In the load_from_hub component we do something like this:
dask_df = dd.read_parquet(f"hf://datasets/{self.dataset_name}")
If I see it correctly, the only part that is different here is hf://
vs. gs//
or s3://
.
What do you think about to adjust the load from hub component to a more generic one?
dask.config.set({"dataframe.convert-string": False}) | ||
|
||
|
||
class LoadFromHubComponent(DaskLoadComponent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename it.
class LoadFromHubComponent(DaskLoadComponent): | |
class LoadFromParquet(DaskLoadComponent): |
|
||
def load(self) -> dd.DataFrame: | ||
# 1) Load data, read as Dask dataframe | ||
logger.info("Loading dataset from the hub...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("Loading dataset from the hub...") | |
logger.info("Loading dataset from file...") |
Fairpoint, there is not a lot of difference between them and it can be generalized. I would tackle this in another PR though since it might affect other pipelines (need to change the load from hub component name, examples, docs, ...) #475 |
PR that creates a generic component to load data stored as a parquet from remote storage Largely based on the load from hub component with some modification. Currently needed for the datacomp pipeline. Future modification could include loading data from local storage though we'd need to make sure that the data is mounted locally
PR that creates a generic component to load data stored as a parquet from remote storage
Largely based on the load from hub component with some modification. Currently needed for the datacomp pipeline.
Future modification could include loading data from local storage though we'd need to make sure that the data is mounted locally