Storage format for Delta #87

hbhanawat · 2019-07-09T14:43:33Z

The way Delta is designed it is not tied to a particular storage format. The storage format can also be abstracted out just like the storage. If the file format is made configurable, Delta's capabilities will be available with all the file formats.

See the discussion here:
https://groups.google.com/forum/#!topic/delta-users/PI28n6kFfsE

hbhanawat · 2019-07-09T14:48:14Z

I will generate a PR for this soon.

marmbrus · 2019-07-09T17:17:55Z

Thanks for opening this issue! This has been a common request.

Before you jump into coding, it would be good to briefly discuss how you plan to design this so that all of the features of Delta continue to work together in an easy to understand way. Some questions that come to mind:

What formats do you plan to support? (orc behaves very similarly to parquet so seems like a good choice, but on the other end of the spectrum text probably does not make a lot of sense)
How do you expose this choice to users? Both through the DataFrame writer and DDL (see Support for registering Delta tables in the HiveMetastore #85).
How do the newly supported formats work with our schema evolution?

hbhanawat · 2019-07-10T03:48:06Z

Sure. Here are my initial thoughts.

Delta has multiple features. There are features that Delta can provide without support from file format (ACID transactions, time travel) and there are features for which Delta needs support from underlying file format (Schema evolution). Features like ACID transaction and time travel are needed for any data lake write irrespective of the file format.

What I was thinking was that Delta can support various file formats just like DataSourceV2. A file format implements interfaces like SupportsSchemaEvolution, SupportsUpdates etc. Based on what file format has implemented, Delta can throw an exception or execute the specific action on the file format. To start with, Delta can natively support Parquet. As the need arises and the community grows, more and more file formats can be added.

Delta can provide file format as an option in the spark dataframewriter API. May be the option is needed only while creating the table or writing the data for the first time. The DeltaLog can record the file format.

Let me know what you think.

hbhanawat · 2019-07-12T15:29:18Z

@marmbrus Do you think we can take this path?

dennyglee · 2021-10-07T23:21:48Z

Sorry for not following up on this. Closing this due to inactivity. Please reopen if this issue is still relevant and/or reproducible on the latest version of Delta.

iajoiner · 2022-10-31T20:18:24Z

Can we reopen it? I think it will be really great to add ORC and Avro.

iajoiner · 2022-10-31T20:19:29Z

@dennyglee

dennyglee · 2022-11-09T01:09:26Z

Hey @iajoiner - glad to re-open it, the quick questions are:

Are you or anyone else here interested in helping out with designing and a PR for this?
Could you provide some additional context or info around why ORC and Avro are important to add?

The reason I'm asking is because the vast majority of the feedback we have received on past Delta Lake surveys, Github, Slack, Google Groups was that Parquet is the primary file format (which is why we closed this issue).

aersam · 2022-12-02T05:41:56Z

Apache Feather V2 (=Arrow IPC) would be nice: https://arrow.apache.org/docs/python/feather.html
I think memory-mapping is a good use case as this allows reading files with muss less resources (https://arrow.apache.org/docs/python/ipc.html)

dennyglee · 2022-12-02T16:58:47Z

Hey @aersam - let’s start a discussion on this in the delta-rs channel in Delta Users Slack. As it’s already working with pyarrow and we have active discussions around arrow/arrow2/data fusion/polars.

Co-authored-by: gbrueckl <sourcetree@gbrueckl.at>

aersam · 2023-07-14T04:45:26Z

Hi guys
I still think IPC would be a good thing to support memory-mapping. Microsoft uses a (so far) proprietary VORDER Compression to make Power BI load Delta Tables fast. I think the better solution would be memory mapping using IPC Format. Or maybe you can get Microsoft to open source their VORDER Compression you can help me vote at their ideas site ;)

mukulmurthy added the enhancement New feature or request label Jul 9, 2019

tdas mentioned this issue Feb 5, 2020

Support orc? #238

Closed

LantaoJin added a commit to LantaoJin/delta that referenced this issue Mar 12, 2021

[CARMEL-2308][FOLLOWUP] update table statistics accurately (delta-io#87)

e3bc9af

dennyglee closed this as completed Oct 7, 2021

scottsand-db assigned dennyglee Nov 1, 2022

dennyglee reopened this Nov 9, 2022

aersam mentioned this issue Nov 18, 2022

[Feature Request] Arrow Storage Format #1481

Closed

tdas pushed a commit to tdas/delta that referenced this issue May 31, 2023

add PowerBI connector (delta-io#87)

26495e9

Co-authored-by: gbrueckl <sourcetree@gbrueckl.at>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage format for Delta #87

Storage format for Delta #87

hbhanawat commented Jul 9, 2019

hbhanawat commented Jul 9, 2019

marmbrus commented Jul 9, 2019 •

edited

Loading

hbhanawat commented Jul 10, 2019

hbhanawat commented Jul 12, 2019

dennyglee commented Oct 7, 2021

iajoiner commented Oct 31, 2022 •

edited

Loading

iajoiner commented Oct 31, 2022

dennyglee commented Nov 9, 2022

aersam commented Dec 2, 2022

dennyglee commented Dec 2, 2022

aersam commented Jul 14, 2023

Storage format for Delta #87

Storage format for Delta #87

Comments

hbhanawat commented Jul 9, 2019

hbhanawat commented Jul 9, 2019

marmbrus commented Jul 9, 2019 • edited Loading

hbhanawat commented Jul 10, 2019

hbhanawat commented Jul 12, 2019

dennyglee commented Oct 7, 2021

iajoiner commented Oct 31, 2022 • edited Loading

iajoiner commented Oct 31, 2022

dennyglee commented Nov 9, 2022

aersam commented Dec 2, 2022

dennyglee commented Dec 2, 2022

aersam commented Jul 14, 2023

marmbrus commented Jul 9, 2019 •

edited

Loading

iajoiner commented Oct 31, 2022 •

edited

Loading