Implement optimize command #98

xianwill · 2021-03-01T18:52:50Z

Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.

I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false in the delta log entry. An MVP might look like:

Query the delta log for file stats relevant to the current optimize run (based on predicate)
Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
Combine files by looping through each input file and writing its record batches into the new output file.
Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.

Files no longer relavant to the log may be cleaned up later by vacuum (see #97)

The text was updated successfully, but these errors were encountered:

MironAtHome · 2021-11-03T11:55:24Z

While vacuum is a big step forward, it does not obviate need for optimize. These two commands have different purpose, and ( looking further out ) optimize will eventually become part of either batch or background worker, continuously operating with lower priority, as is common to all relational engines.

MrPowers · 2022-06-01T13:29:41Z

@xianwill - The steps you've outlined sound good to me and think small file compaction via OPTIMIZE would be a great addition to this library. Also agree that the first pass should ignore z order indexing. Do you know if this work is planning on getting prioritized anytime soon?

wjones127 · 2022-06-01T16:33:52Z

@MrPowers this is actively being worked on. Did you see #607?

MrPowers · 2022-06-01T17:10:36Z

@wjones127 - oh, wow, looks like amazing progress is being made, so exciting!

MrPowers · 2022-06-04T17:25:14Z

@wjones127 - looks like #607 was merged!! Will it be relatively easy to expose this functionality via the Python bindings?

wjones127 · 2022-06-04T18:08:18Z

@MrPowers Yeah shouldn't be too difficult. I created #622 to track that.

wjones127 · 2022-09-28T03:00:59Z

Resolved by #607.

houqp added binding/rust Issues for the Rust crate enhancement New feature or request labels Mar 1, 2021

Blajda mentioned this issue May 16, 2022

Bin packing optimization #607

Merged

4 tasks

wjones127 mentioned this issue May 18, 2022

Concatenate parquet files without deserializing? apache/arrow-rs#1711

Closed

wjones127 closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement optimize command #98

Implement optimize command #98

xianwill commented Mar 1, 2021 •

edited

Loading

MironAtHome commented Nov 3, 2021

MrPowers commented Jun 1, 2022

wjones127 commented Jun 1, 2022

MrPowers commented Jun 1, 2022

MrPowers commented Jun 4, 2022

wjones127 commented Jun 4, 2022

wjones127 commented Sep 28, 2022

Implement optimize command #98

Implement optimize command #98

Comments

xianwill commented Mar 1, 2021 • edited Loading

MironAtHome commented Nov 3, 2021

MrPowers commented Jun 1, 2022

wjones127 commented Jun 1, 2022

MrPowers commented Jun 1, 2022

MrPowers commented Jun 4, 2022

wjones127 commented Jun 4, 2022

wjones127 commented Sep 28, 2022

xianwill commented Mar 1, 2021 •

edited

Loading