-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap 2022 H1 (discussion) #920
Comments
Coincidentally good timing, together with The Onehouse announcement yesterday!!! |
Auto optimize would be a great addition too. From the Databricks docs it looks like there are two types:
|
Created an item for Big Query connector: https://github.com/delta-io/connectors/issues/282 |
It would be great if the CDF was open source on the latest date. I really interest with this feature! |
@novemberdude agree, add cdf please! |
Thanks for your feedback @novemberdude and @Shadlezzz - we’ll definitely take this into consideration! |
Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team! |
Based on your feedback from here and Slack, we've added CDF, Cloning, and Hive/Delta writer to the roadmap. HTH! |
Hi @sliu4 - there are a number of possible solutions to what you're facing. Could you provide the context here and/or do not hesitate to chime in the Delta Users slack. This may be worthy of us developing a PIP (project improvement proposal) so we can get more feedback on design. |
hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time. I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM. Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log. |
Hi @sliu4 - this is super interesting and while I do think the "devil is in the details", the concept of a lifecycle policy may in fact work well with the context of Delta's transaction log since we would be able to use the transaction log to determine what to DELETE/VACUUM, etc. It implies that the lifecycle policy itself would categorize different tables with different policy granular levels (e.g. HBI, MBI, LBI or GDPR compliance etc.), read the Delta transaction log, and then initiate the process within that context to ensure transactional consistency when running the lifecycle policy. Adding to this, you could probably utilize user metadata as a way to track a single life cycle policy across multiple tables and/or create a policy table that includes the table/version numbers for the associated application of policy. It would be worth diving into more - if you're up for it, ping me on Slack and we can find a time to dive in further, eh?! HTH! Denny |
Following . |
@dennyglee I see that you have |
Hey @hoffrocket - let me get back to you on the timing of this - thanks! |
This feature is very important for data DR. |
Not to pile on, but also very keen on the |
We're trying our best @p2bauer - we're going to send out the proposed priorities in the next week or so for all of us to review and help prioritize. Thanks! |
Closing this issue as we're working on #1307 |
This is the proposed Delta Lake 2022 H1 roadmap discussion thread. Below are the initially proposed items for the roadmap to be completed by June 2022. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!
Performance Optimizations
Based on the overwhelming feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), Delta Lake 2021H2 survey, and 2021H2 roadmap, we propose the following Delta Lake performance enhancements in the next two quarters.
Schema Operations
For this year, our focus will be on columnar mappings.
replaceWhere
option but in various scenarios, it is more convenient to specify overwrite partition.Integrations
Extending from the recent releases of PrestoDB, Hive 3, and Delta Sink for Apache Flink Streams API, we have additional integrations planned.
Operations Enhancements
Two very popular requests are planned for this semester: Table Restore, S3 multi-cluster writes.
Updates
If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.
The text was updated successfully, but these errors were encountered: