Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create load from parquet #474

Merged
merged 4 commits into from
Sep 28, 2023
Merged

Create load from parquet #474

merged 4 commits into from
Sep 28, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

PR that creates a generic component to load data stored as a parquet from remote storage

Largely based on the load from hub component with some modification. Currently needed for the datacomp pipeline.

Future modification could include loading data from local storage though we'd need to make sure that the data is mounted locally

Copy link
Contributor

@mrchtr mrchtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The component does probably the job for now.
I am just curious about if we can reuse the load from hub component completely.
In the load_from_hub component we do something like this:

dask_df = dd.read_parquet(f"hf://datasets/{self.dataset_name}")

If I see it correctly, the only part that is different here is hf:// vs. gs// or s3://.
What do you think about to adjust the load from hub component to a more generic one?

dask.config.set({"dataframe.convert-string": False})


class LoadFromHubComponent(DaskLoadComponent):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename it.

Suggested change
class LoadFromHubComponent(DaskLoadComponent):
class LoadFromParquet(DaskLoadComponent):


def load(self) -> dd.DataFrame:
# 1) Load data, read as Dask dataframe
logger.info("Loading dataset from the hub...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Loading dataset from the hub...")
logger.info("Loading dataset from file...")

@PhilippeMoussalli
Copy link
Contributor Author

The component does probably the job for now. I am just curious about if we can reuse the load from hub component completely. In the load_from_hub component we do something like this:

dask_df = dd.read_parquet(f"hf://datasets/{self.dataset_name}")

If I see it correctly, the only part that is different here is hf:// vs. gs// or s3://. What do you think about to adjust the load from hub component to a more generic one?

Fairpoint, there is not a lot of difference between them and it can be generalized. I would tackle this in another PR though since it might affect other pipelines (need to change the load from hub component name, examples, docs, ...) #475

@PhilippeMoussalli PhilippeMoussalli merged commit 7d3b881 into main Sep 28, 2023
5 checks passed
@PhilippeMoussalli PhilippeMoussalli deleted the create-load-from-parquet branch September 28, 2023 13:38
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
PR that creates a generic component to load data stored as a parquet
from remote storage

Largely based on the load from hub component with some modification.
Currently needed for the datacomp pipeline.

Future modification could include loading data from local storage though
we'd need to make sure that the data is mounted locally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants