-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement optimize command #98
Comments
While vacuum is a big step forward, it does not obviate need for optimize. These two commands have different purpose, and ( looking further out ) optimize will eventually become part of either batch or background worker, continuously operating with lower priority, as is common to all relational engines. |
@xianwill - The steps you've outlined sound good to me and think small file compaction via |
@wjones127 - oh, wow, looks like amazing progress is being made, so exciting! |
@wjones127 - looks like #607 was merged!! Will it be relatively easy to expose this functionality via the Python bindings? |
Resolved by #607. |
Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.
I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting
dataChange: false
in the delta log entry. An MVP might look like:Files no longer relavant to the log may be cleaned up later by vacuum (see #97)
The text was updated successfully, but these errors were encountered: