Roadmap 2022 H2 (discussion) #1307

dennyglee · 2022-08-02T04:56:45Z

This is a working issue for folks to provide feedback on the prioritization of the Delta Lake priorities spanning July to December 2022. With the release of Delta Lake 2.0, we wanted to take the opportunity to discuss other vital features for prioritization with the community based on the feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), the Roadmap 2022H2 (discussion), and more.

Note, tasks that are crossed out (i.e., 00) have been completed.

To review the Delta Rust roadmap only, please refer to https://go.delta.io/rust-roadmap for more information.

Priority 0

We will focus on these issues and continue to deliver parts (or all of the issue) over the next six months

Issue	Category	Task	Description	Status
~~256~~	Flink	Flink Source	Build Flink source to read Delta tables in batch and streaming jobs	In Progress
238	Flink	Flink SQL+ Table API + Catalog Support	After Flink Sink and Source, build support for Flink Catalog, SQL, and Table API	In Progress
411, 410	Flink	Productionize support for all cloud object stores	Make sure that Flink Sink can write robustly to S3, GCS, ADLS2 with full transactional guarantees	In Progress
~~610~~	Rust	Integrate with a common object-store abstraction from arrow / Rust ecosystem	This will allow us to provide a more convenient and performant API on the Rust and python side	In Progress
~~575~~	Rust	Support V2 writer protocol	Utilize PyArrow-based writer function (write_deltalake) support writer protocol V2 and object stores S3, GCS, and ADLS2.	In Progress
~~761~~	Rust	Expand write support for cloud object stores	Write to object stores S3, GCS, and ADLS2 from multiple clusters with full transactional guarantees	In Progress
	Rust	DAT Integration	Delta Acceptance Tests running in CI	In Progress
	Rust	Rust documentation	First pass at Rust docs	In Progress
	Rust	Rust blogging	Blog post for the Rust community	In Progress
632	Rust	Commit protocol	Fully protocol compliant optimistic commit protocol	In Progress
851	Rust	Rust writer	Refactor Rust writer API to be flexible for others wishing to build upon delta-rs	In Progress
~~1257~~	Spark	Release Delta 2.1 on Apache Spark 3.3	Ensure the latest version of Delta Lake works with the latest version of Apache Spark™	Released in Delta 2.1
~~1485~~	Spark	Support reading tables with Deletion Vectors	Allow reads on tables that have deletion vectors to mark rows in parquet files as removed.	Released in Delta 2.3
~~1408~~	Spark	Support Table Features protocol	Upgrade the protocol to use Table Features to specify the features needed to read/write to a table.	Released in Delta 2.3
~~1242~~	Spark	Support time travel SQL syntax	Delta currently supports time travel via Python and Scala APIs. We would like to extend support for the SQL syntax `VERSION AS OF` and `TIMESTAMP AS OF` in SELECT statements.	Released in Delta 2.1
	Standalone	Extend Delta Standalone for higher protocol versions	Extend Delta Standalone to support logs using higher protocol versions and advanced features like constraints, generated columns, column mapping, etc.	In Progress
	Standalone	Expand support for data skipping in Delta Standalone	Extend the current data skipping to skip file using column stats and more expressions	In Progress
	Website	Updated Delta Lake documentation	Move Delta Lake documentation to the website GitHub repo to allow easier community collaboration	In Progress
	Website	Consolidate all connector documentation	Consolidate docs of all connectors in the website Github repo	In Progress

Priority 1

We should be able to deliver parts (or all of the issue) over the next six months

Issue	Category	Task	Description	Status
4	Core	Delta Acceptance Testing (DAT)	With various languages interacting with the Delta protocol (e.g., Delta Standalone, Delta Spark, Delta Rust, Trino, etc.), we propose to have the same reference tables and library of reference tests to ensure all Delta APIs remain in compliance.	In Progress
1347	Core	Support Bloom filters	Improve query performance by utilizing bloom filters. The approach is TBD due to recent updates to Apache Parquet to support bloom filters.	Not Started
1387	Core	Enable Delta clone	Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.	Shallow clone is released in Delta 2.3
	Delta connectors	GoLang Delta connector	Support GoLang reading a Delta Lake table natively	Not Started
	Delta connectors	Improve partition filtering in Power BI client	Improved partition filtering using built-in UI filters in Power BI	Not Started
	Delta connectors	Pulsar Source connector	Support Apache Pulsar reading a Delta Lake table natively	Not Started
	Flink	Column stats generation in Flink Sink	Make the Flink Delta sink generate column stats	Not Started
	Presto/Trino	Support higher protocol versions in Presto and Trino	Use Standalone to support higher protocol versions	Not Started
	Rust	Delta Rust API Updates	Update APIs and support more high-level operations on top of delta; this includes better conflict resolution	NA
	Rust	Better support for large logs	Better support for handling large Delta logs/snapshots	NA
	Sharing Connectors	GoLang Delta Sharing client	Support GoLang client for Delta Sharing	NA
	Sharing Connectors	R Delta Sharing client	Support R client for Delta Sharing	NA
1072	Spark	Support for Identity columns	Create an identity column that will be automatically assigned a unique and statistically increasing (or decreasing if the step is negative) value.	Not Started
	Spark	Support querying Change Data Feed (CDF) using SQL queries	To support querying CDF using SQL queries in Apache Spark, we need to allow custom TVFs to be resolved using injected rules.	Released in Delta 2.3
1156	Spark	Support Auto Compaction	Provide auto compaction functionality to simplify compaction tasks	In Progress
1198	Spark	Support Optimize Writes	Optimize Spark to Delta Lake writes	In Progress
~~1462~~	Spark	Enable converting from Iceberg to Delta	Enable converting parquet-backed Iceberg tables to Delta tables without rewriting parquet files.	Released in Delta 2.3
~~1464~~	Spark	Shallow clone Iceberg tables	Enable shallow cloning parquet-backed Iceberg tables following the Delta protocols without the need to copy all of the data.	Released in Delta 2.3
~~1349~~	Spark	Improve semantics of column mapping and Change Data Feed	Improve semantics of how column renames/drops (aka column mapping) interact with CDF and streaming	Released in Delta 2.3

Priority 2

Nice to have

Category	Task	Description	Status
Sharing	Share individual partitions	Support Sharing individual partitions in Delta Sharing	NA
Sharing Connectors	Rust Delta Sharing client	Support Rust client for Delta Sharing	NA
Sharing Connectors	Starburst/Trino Delta Sharing connector	Support Starburst/Trino client for Delta Sharing	NA
Sharing Connectors	Airflow Delta Sharing connector	Support sharing data from Airflow sensor	NA
Rust	Process	Release improvements	NA

History

2022-08-01: Initial creation
2022-08-02: Delta Sharing updates
2022-08-08: Include Identity columns in the roadmap
2022-09-13: Update issues and include into roadmap auto compaction, optimize writes, and bloom filters.
2022-09-19: Update to include Delta Clone
2022-09-22: Including working Delta Rust roadmap document
2022-10-26: Included updated Delta Rust roadmap in GitHub link
2022-10-27: Included converting and shallow cloning Iceberg to Delta

The text was updated successfully, but these errors were encountered:

dennyglee · 2022-08-02T04:57:21Z

Note, we will be adding/updating the issue over the next few weeks but I'm a little behind schedule so thought I would get the roadmap discussion started ASAP. Thanks!

edfreeman · 2022-08-02T09:17:17Z

Hi folks. Can't see it explicitly mentioned so thought I'd ask - will identity columns support (i.e. writer version 6) be added in this H2 wave? That's a big feature we're keen to be able to use outside of Databricks, and it didn't quite make it into 2.0 by the looks of things.

dennyglee · 2022-08-08T22:13:09Z

Hi folks. Can't see it explicitly mentioned so thought I'd ask - will identity columns support (i.e. writer version 6) be added in this H2 wave? That's a big feature we're keen to be able to use outside of Databricks, and it didn't quite make it into 2.0 by the looks of things.

Thanks for the call out @edfreeman - identity columns has been added :)

sezruby · 2022-08-18T01:40:51Z

Hi @dennyglee, what about Auto compaction and Optimize Write? I don't think the PRs are getting some attention for review / merge. Could you add it to the roadmap?

dennyglee · 2022-08-18T02:53:32Z

Oh good point! Let me get back to you on this shortly! Sorry about that!

…

On Wed, Aug 17, 2022 at 18:41 EJ Song ***@***.***> wrote: Hi @dennyglee <https://github.com/dennyglee>, what about Auto compaction and Optimize Write? I don't think the PRs are getting some attention for review / merge. Could you add it to the roadmap? — Reply to this email directly, view it on GitHub <#1307 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALBHLLDOCHFYXAGKWGKXEDVZWIC3ANCNFSM55JXKP7A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

keen85 · 2022-08-19T12:47:53Z

Hi @dennyglee, what about support for displaying DDL of delta tables (SHOW CREATE TABLE)
#1032
#1255

dennyglee · 2022-08-19T15:29:35Z

Hi @dennyglee, what about support for displaying DDL of delta tables (SHOW CREATE TABLE) #1032 #1255

Good call out @keen85 - let me check with @zpappa on this!

zpappa · 2022-08-19T17:33:51Z

Hi @dennyglee, what about support for displaying DDL of delta tables (SHOW CREATE TABLE) #1032 #1255

Good call out @keen85 - let me check with @zpappa on this!

I have some minor style and test updates for this PR to be considered done, I can finish them today and we can try to pull them Into 2.1

dudzicp · 2022-08-25T20:46:40Z

How about delta caching that is present on databricks? Is there a plan for such feature?

tdas · 2022-08-25T21:27:19Z

"Delta caching" is actually a Databricks Runtime engine feature, not part of the format. Caching data on an processing engine's executor/workers nodes is something that can really be done well by the engine itself, not by a data format. It's unfortunate and confusing that we had marketed it under the "Delta" brand name, even though it's really not part of the "Delta Lake" storage format. So, in short, its not really possible to open source that as part of Delta Lake.

djouallah · 2022-08-26T23:19:23Z

I Have being experimenting with Delta lake in Google Cloud and DuckDB and it is very promising, but without a local SSD cache it will never be fast enough, maybe we need a cache for Delta independently from Databricks implementation, Delta knows which files needs to be scanned, keeping a local copy on the first call will be really useful, at least for a the standalone reader

khwj · 2022-08-28T05:05:42Z

Hi @dennyglee, what about supporting analyze table #581 and Bloom filter indexes#1347?

MaksGS09 · 2022-08-31T10:35:27Z

Hi!
How about Big Query integration?

sezruby · 2022-09-01T03:08:03Z

Hi @dennyglee any update?

dennyglee · 2022-09-07T22:36:01Z

Sorry about that @sezruby - yes, we will be adding these to the roadmap very shortly. Thanks for your patience (I’ve been out the last two weeks)

dennyglee · 2022-09-14T04:39:30Z

Some quick updates:

@sezruby Thanks for your patience - auto compaction and optimize writes have been included to the roadmap.
@khwj included bloom filters and added some comments directly to your issue
@MaksGS09 We're still determining the resources required for performant BQ integration
We are currently reviewing #1032 Changes to support SHOW CREATE TABLE #1255 and Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

HTH!

p2bauer · 2022-09-18T23:12:57Z

Hi @dennyglee ! I know it was on the 2022 H1 github page, but I haven't seen any mention on clone functionality being moved into the open source library. Is there any update around that? I poked around the current source code but didn't really see it anywhere.

dennyglee · 2022-09-20T05:21:55Z

Thanks @p2bauer - great call out. I've added this to the roadmap and created issue #1387 to track this. HTH!

SanthoshPrasad · 2022-10-04T13:09:40Z

Hi @dennyglee , Is there any update on supporting higher protocol versions in Presto and Trino?

dennyglee · 2022-10-06T06:23:27Z

Great question @SanthoshPrasad - we've been working with the PrestoDB and the Trino communities on this and we should have some updates on various progress around this over the next couple of months. One of the methods we're doing this is through our DAT effort (Delta Acceptance Testing) so we can more cleanly document and clarify which APIs are on which protocol version. If you're interested in learning more on this, please join us in the #dat channel in Delta Users Slack. HTH!

dennyglee · 2022-10-26T06:21:43Z

Suggest we add Airbyte Destination S3: add delta lake/delta table support to the roadmap as it's already part of the Delta Rust Roadmap - WDYT?

melin · 2022-10-27T05:11:49Z

Support jdbc catalog
#1459

keen85 · 2022-10-30T11:24:26Z

I'd like to suggest adding "Register VACUUM in delta log" to the roadmap

dudzicp · 2022-11-09T14:02:38Z

I know that each commit, min/max values are calculated for each parquet file and are present in the delta log json, but how about adding more granularity to existing data skipping mechanism, by using parquet page skipping?
Relevant links:

Would this be doable?

dennyglee · 2022-11-15T05:37:18Z

I know that each commit, min/max values are calculated for each parquet file and are present in the delta log json, but how about adding more granularity to existing data skipping mechanism, by using parquet page skipping? Relevant links:

https://issues.apache.org/jira/browse/PARQUET-922

https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

Would this be doable?

@dudzicp Oh, could you please create a separate issue for this and we can discuss the specifics there? Thanks!

dudzicp · 2023-02-10T15:04:40Z

How about bucketing?

benbauer89 · 2023-02-21T13:53:07Z

Hey, I am interested in more details regarding https://delta.io/sharing/ It's stated that presto and trino are coming soon, but I could not really find any details or timelines. Please notice, that I am asking regaring delta sharing in context of Uniyt Catalog in particular and not necessarily regarding delta & trino/presto integration

dennyglee · 2023-03-11T18:30:11Z

Ahh, for Delta Sharing features within the context of UC, please ping Databricks community. Thanks!

robertkossendey · 2023-03-12T19:39:49Z

Hey @dennyglee any updates on the Roadmap? :) I was creating some issues in the mack project (e.g. Python support for table property update) but wanted to make sure that the delta-spark team is not working on the things that I came up with already.

dennyglee · 2023-03-12T22:50:33Z

Thanks for your patience @robertkossendey - we're working on this but admittedly way behind schedule due to all of the various asks, eh?! Saying this, please continue working on mack project activities as those are the ones we're pretty sure make more sense for mack to address or at least if we plan to merge this into delta-spark, it'll be further out on the roadmap. Thanks for the ping, eh?!

robertkossendey · 2023-03-13T08:50:02Z

Good to know, thank you for the update @dennyglee!

felipepessoto · 2023-05-25T22:45:39Z

@dennyglee @allisonport-db do you have any updates on auto compact and optimize write?

dennyglee added the enhancement New feature or request label Aug 2, 2022

dennyglee pinned this issue Aug 2, 2022

dennyglee mentioned this issue Aug 2, 2022

Roadmap 2022 H1 (discussion) #920

Closed

keen85 mentioned this issue Aug 12, 2022

Add support for GENERATED ALWAYS AS IDENTITY in DeltaTableBuilder #1072

Closed

nkarpov mentioned this issue Aug 18, 2022

Are there plans to support merge on read mode #276

Open

chris-aeviator mentioned this issue Sep 2, 2022

Roadmap goodwillpunning/nodejs-sharing-client#9

Open

scottsand-db mentioned this issue Sep 8, 2022

[Feature Request] add WHEN NOT MATCHED BY SOURCE/TARGET clause suppoort #1364

Closed

3 tasks

zsxwing mentioned this issue Sep 14, 2022

[Feature Request] Enable LOAD DATA for delta tables. #1354

Open

3 tasks

keen85 mentioned this issue Oct 9, 2022

[BUG] metric numDeletedRows missing in Delta log when DELETING complete partition #1423

Closed

3 tasks

JonasJ-ap mentioned this issue Jan 6, 2023

Delta: Support Snapshot Delta Lake Table to Iceberg Table apache/iceberg#6449

Merged

robertkossendey mentioned this issue Mar 13, 2023

Provide an API to add constraints to a Delta Table via Python MrPowers/mack#106

Open

armckinney mentioned this issue May 30, 2023

[Feature Request] Enable Clone of Delta Lake tables #1387

Open

3 tasks

tdas unpinned this issue Jun 6, 2023

rasidhan mentioned this issue Oct 10, 2023

[Spark] Implement optimized write. #2145

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap 2022 H2 (discussion) #1307

Roadmap 2022 H2 (discussion) #1307

dennyglee commented Aug 2, 2022 •

edited by vkorukanti

Loading

dennyglee commented Aug 2, 2022

edfreeman commented Aug 2, 2022

dennyglee commented Aug 8, 2022

sezruby commented Aug 18, 2022

dennyglee commented Aug 18, 2022 via email

keen85 commented Aug 19, 2022

dennyglee commented Aug 19, 2022

zpappa commented Aug 19, 2022

dudzicp commented Aug 25, 2022

tdas commented Aug 25, 2022

djouallah commented Aug 26, 2022 •

edited

Loading

khwj commented Aug 28, 2022

MaksGS09 commented Aug 31, 2022

sezruby commented Sep 1, 2022

dennyglee commented Sep 7, 2022

dennyglee commented Sep 14, 2022

p2bauer commented Sep 18, 2022

dennyglee commented Sep 20, 2022

SanthoshPrasad commented Oct 4, 2022

dennyglee commented Oct 6, 2022

dennyglee commented Oct 26, 2022

melin commented Oct 27, 2022 •

edited

Loading

keen85 commented Oct 30, 2022

dudzicp commented Nov 9, 2022

dennyglee commented Nov 15, 2022 •

edited

Loading

dudzicp commented Feb 10, 2023

benbauer89 commented Feb 21, 2023

dennyglee commented Mar 11, 2023

robertkossendey commented Mar 12, 2023

dennyglee commented Mar 12, 2023

robertkossendey commented Mar 13, 2023

felipepessoto commented May 25, 2023

Roadmap 2022 H2 (discussion) #1307

Roadmap 2022 H2 (discussion) #1307

Comments

dennyglee commented Aug 2, 2022 • edited by vkorukanti Loading

Priority 0

Priority 1

Priority 2

History

dennyglee commented Aug 2, 2022

edfreeman commented Aug 2, 2022

dennyglee commented Aug 8, 2022

sezruby commented Aug 18, 2022

dennyglee commented Aug 18, 2022 via email

keen85 commented Aug 19, 2022

dennyglee commented Aug 19, 2022

zpappa commented Aug 19, 2022

dudzicp commented Aug 25, 2022

tdas commented Aug 25, 2022

djouallah commented Aug 26, 2022 • edited Loading

khwj commented Aug 28, 2022

MaksGS09 commented Aug 31, 2022

sezruby commented Sep 1, 2022

dennyglee commented Sep 7, 2022

dennyglee commented Sep 14, 2022

p2bauer commented Sep 18, 2022

dennyglee commented Sep 20, 2022

SanthoshPrasad commented Oct 4, 2022

dennyglee commented Oct 6, 2022

dennyglee commented Oct 26, 2022

melin commented Oct 27, 2022 • edited Loading

keen85 commented Oct 30, 2022

dudzicp commented Nov 9, 2022

dennyglee commented Nov 15, 2022 • edited Loading

dudzicp commented Feb 10, 2023

benbauer89 commented Feb 21, 2023

dennyglee commented Mar 11, 2023

robertkossendey commented Mar 12, 2023

dennyglee commented Mar 12, 2023

robertkossendey commented Mar 13, 2023

felipepessoto commented May 25, 2023

dennyglee commented Aug 2, 2022 •

edited by vkorukanti

Loading

djouallah commented Aug 26, 2022 •

edited

Loading

melin commented Oct 27, 2022 •

edited

Loading

dennyglee commented Nov 15, 2022 •

edited

Loading