Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using dataset fragments instead of parquet uris #13

Open
wjones127 opened this issue May 12, 2023 · 2 comments
Open

Consider using dataset fragments instead of parquet uris #13

wjones127 opened this issue May 12, 2023 · 2 comments

Comments

@wjones127
Copy link

I saw this example and thought it might be a better way to load data.

Right now it looks like you rely on being able to just read from the Parquet files and load the partition values from HIVE-style directories. This isn't robust in two ways:

  1. HIVE-style directories aren't guaranteed in the Delta Lake format. The delta protocol states that "This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log." 1
  2. Deletion vectors and column mapping mean reading the parquet files as-is won't give you the correct data, once we start supporting reader protocols 2 and 3.

In the future, it would be best not to rely on reading from the file URIs and instead read from the dataset fragments, which will provide the correct data as the Delta Protocol continues to evolve.

Footnotes

  1. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#data-files

@wjones127
Copy link
Author

BTW, this would allow column projection pushdown, based on this protocol: https://github.com/dask/dask/blob/12c7c10d0c15391a6522fe2dc7df191f8088967e/dask/dataframe/io/utils.py#L224

Maybe that same protocol will support filter pushdown too?

@wjones127
Copy link
Author

FYI I'm working towards standardizing the interface for PyArrow datasets to make it easier for engines to consume, including Dask. My research for that is how I found that. If interested, feel free to read and/or comment on this document. https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant