Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage format for Delta #87

Open
hbhanawat opened this issue Jul 9, 2019 · 11 comments
Open

Storage format for Delta #87

hbhanawat opened this issue Jul 9, 2019 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@hbhanawat
Copy link

The way Delta is designed it is not tied to a particular storage format. The storage format can also be abstracted out just like the storage. If the file format is made configurable, Delta's capabilities will be available with all the file formats.

See the discussion here:
https://groups.google.com/forum/#!topic/delta-users/PI28n6kFfsE

@hbhanawat
Copy link
Author

I will generate a PR for this soon.

@mukulmurthy mukulmurthy added the enhancement New feature or request label Jul 9, 2019
@marmbrus
Copy link
Contributor

marmbrus commented Jul 9, 2019

Thanks for opening this issue! This has been a common request.

Before you jump into coding, it would be good to briefly discuss how you plan to design this so that all of the features of Delta continue to work together in an easy to understand way. Some questions that come to mind:

  • What formats do you plan to support? (orc behaves very similarly to parquet so seems like a good choice, but on the other end of the spectrum text probably does not make a lot of sense)
  • How do you expose this choice to users? Both through the DataFrame writer and DDL (see Support for registering Delta tables in the HiveMetastore #85).
  • How do the newly supported formats work with our schema evolution?

@hbhanawat
Copy link
Author

Sure. Here are my initial thoughts.

Delta has multiple features. There are features that Delta can provide without support from file format (ACID transactions, time travel) and there are features for which Delta needs support from underlying file format (Schema evolution). Features like ACID transaction and time travel are needed for any data lake write irrespective of the file format.

What I was thinking was that Delta can support various file formats just like DataSourceV2. A file format implements interfaces like SupportsSchemaEvolution, SupportsUpdates etc. Based on what file format has implemented, Delta can throw an exception or execute the specific action on the file format. To start with, Delta can natively support Parquet. As the need arises and the community grows, more and more file formats can be added.

Delta can provide file format as an option in the spark dataframewriter API. May be the option is needed only while creating the table or writing the data for the first time. The DeltaLog can record the file format.

Let me know what you think.

@hbhanawat
Copy link
Author

@marmbrus Do you think we can take this path?

@dennyglee
Copy link
Contributor

Sorry for not following up on this. Closing this due to inactivity. Please reopen if this issue is still relevant and/or reproducible on the latest version of Delta.

@iajoiner
Copy link

iajoiner commented Oct 31, 2022

Can we reopen it? I think it will be really great to add ORC and Avro.

@iajoiner
Copy link

@dennyglee

@dennyglee
Copy link
Contributor

Hey @iajoiner - glad to re-open it, the quick questions are:

  • Are you or anyone else here interested in helping out with designing and a PR for this?
  • Could you provide some additional context or info around why ORC and Avro are important to add?

The reason I'm asking is because the vast majority of the feedback we have received on past Delta Lake surveys, Github, Slack, Google Groups was that Parquet is the primary file format (which is why we closed this issue).

@aersam
Copy link

aersam commented Dec 2, 2022

Apache Feather V2 (=Arrow IPC) would be nice: https://arrow.apache.org/docs/python/feather.html
I think memory-mapping is a good use case as this allows reading files with muss less resources (https://arrow.apache.org/docs/python/ipc.html)

@dennyglee
Copy link
Contributor

Hey @aersam - let’s start a discussion on this in the delta-rs channel in Delta Users Slack. As it’s already working with pyarrow and we have active discussions around arrow/arrow2/data fusion/polars.

tdas pushed a commit to tdas/delta that referenced this issue May 31, 2023
Co-authored-by: gbrueckl <sourcetree@gbrueckl.at>
@aersam
Copy link

aersam commented Jul 14, 2023

Hi guys
I still think IPC would be a good thing to support memory-mapping. Microsoft uses a (so far) proprietary VORDER Compression to make Power BI load Delta Tables fast. I think the better solution would be memory mapping using IPC Format. Or maybe you can get Microsoft to open source their VORDER Compression you can help me vote at their ideas site ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants