Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Optimizing (file compaction) Delta Lake tables #927

Closed
vkorukanti opened this issue Feb 7, 2022 · 3 comments
Closed

Support for Optimizing (file compaction) Delta Lake tables #927

vkorukanti opened this issue Feb 7, 2022 · 3 comments
Milestone

Comments

@vkorukanti
Copy link
Collaborator

Overview

Table optimize is an operation to rearrange the data and/or metadata to speed up queries and/or reduce the metadata size. Some of the ways to accomplish this is to compact small files into large files and/or ordering data by column, clustering the data in Z-order curves etc.

This work adds the “OPTIMIZE (file compaction)” as outlined on the Delta OSS 2022 H1 roadmap here.

Requirements

  • Optimize should respect the transactional properties of the Delta table. That means it can run in parallel with reads and writes without violating any ACID properties.
  • In case of conflict during optimize run, optimize should retry once before failing.
  • Option to select a subset of partitions in a table to optimize.

Design Sketch

Design details are here.

Future Work

  • Support partial progress capture: Instead of committing all the file compaction job changes at the end of the job, commit these changes periodically to DeltaLog so that even if the job fails at least some progress is captured.
  • Support for Z-Order: Data clustering via multi-column locality-preserving space-filling curves with offline sorting.
@akshay26031996
Copy link

Please share when the solution for this issue is getting released.

@akshatnair
Copy link

@tdas is this merged or just closed?

@vkorukanti
Copy link
Collaborator Author

@tdas is this merged or just closed?

This is merged as e366cc. This will be part of the Delta Lake 1.2 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants