-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage format for Delta #87
Comments
I will generate a PR for this soon. |
Thanks for opening this issue! This has been a common request. Before you jump into coding, it would be good to briefly discuss how you plan to design this so that all of the features of Delta continue to work together in an easy to understand way. Some questions that come to mind:
|
Sure. Here are my initial thoughts. Delta has multiple features. There are features that Delta can provide without support from file format (ACID transactions, time travel) and there are features for which Delta needs support from underlying file format (Schema evolution). Features like ACID transaction and time travel are needed for any data lake write irrespective of the file format. What I was thinking was that Delta can support various file formats just like DataSourceV2. A file format implements interfaces like SupportsSchemaEvolution, SupportsUpdates etc. Based on what file format has implemented, Delta can throw an exception or execute the specific action on the file format. To start with, Delta can natively support Parquet. As the need arises and the community grows, more and more file formats can be added. Delta can provide file format as an option in the spark dataframewriter API. May be the option is needed only while creating the table or writing the data for the first time. The DeltaLog can record the file format. Let me know what you think. |
@marmbrus Do you think we can take this path? |
Sorry for not following up on this. Closing this due to inactivity. Please reopen if this issue is still relevant and/or reproducible on the latest version of Delta. |
Can we reopen it? I think it will be really great to add ORC and Avro. |
Hey @iajoiner - glad to re-open it, the quick questions are:
The reason I'm asking is because the vast majority of the feedback we have received on past Delta Lake surveys, Github, Slack, Google Groups was that Parquet is the primary file format (which is why we closed this issue). |
Apache Feather V2 (=Arrow IPC) would be nice: https://arrow.apache.org/docs/python/feather.html |
Hey @aersam - let’s start a discussion on this in the delta-rs channel in Delta Users Slack. As it’s already working with pyarrow and we have active discussions around arrow/arrow2/data fusion/polars. |
Co-authored-by: gbrueckl <sourcetree@gbrueckl.at>
Hi guys |
The way Delta is designed it is not tied to a particular storage format. The storage format can also be abstracted out just like the storage. If the file format is made configurable, Delta's capabilities will be available with all the file formats.
See the discussion here:
https://groups.google.com/forum/#!topic/delta-users/PI28n6kFfsE
The text was updated successfully, but these errors were encountered: