Support persistent compression state #4166

CAFxX · 2024-10-09T02:54:00Z

Is your feature request related to a problem? Please describe.

I have a tar archive containing periodic full backups of the same database. Each backup is very similar to the previous one, so each additional backup compressed in the same tar archive is orders of magnitude smaller, after compression, than the same file compressed standalone. Unfortunately, today appending each new backup (so that compression can leverage the redundancies between files) basically entails decompressing the full tar file, appending the new backup to it, and then compressing the whole tar file again - because no compressor I am aware of implements a way to persist (or even just reconstruct) the compression state of an existing stream.

Describe the solution you'd like
I would like a way to append data to an existing zstd stream making use of state of the whole stream, so that the new data can be compressed efficiently exploiting redundancies with the data already present in the compressed stream.

The persistent compression state could very well be larger than the compressed stream: this is acceptable.

The persistent compression state does not need to be publicly documented, nor stable across versions or platforms. If the persisted compression state is invalid/corrupt, it should be ignored.

This could take the form of a --state STATE_FILE switch that could be used as follows:

# persist the compression state in STATE_FILE 
zstd --state STATE_FILE -o OUTPUT_FILE INPUT_FILE

# append data to OUTPUT_FILE using the state from STATE_FILE, persist the final compression state in STATE_FILE
zstd --state STATE_FILE --append -o OUTPUT_FILE INPUT_FILE

Ideally, it should also be possible to reconstruct (and persist) the compression state of an existing stream.

If a stream consists of multiple independent sections (e.g. because the stream is rsyncable, or because a section was appended without making use of the persistent compression state) the persistent state would only be the one covering the section since the last state reset.

Describe alternatives you've considered
There are alternatives in the specific scenario I described above (e.g. do incremental backups, use a diff-like tool before compression, decompress+append+recompress, etc.) but they are not always practical or applicable in this or other scenarios.

Additional context

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2024-11-05T17:49:18Z

There could be a way to append new data as part of the same stream as the existing compressed tar file,
however, there would still be a need to decompress the whole tar file first.
The main benefit would be that there would be no need to compress again the whole tar file.
Such an approach wouldn't need an additional state.

Cyan4973 added the feature request label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support persistent compression state #4166

Support persistent compression state #4166

CAFxX commented Oct 9, 2024 •

edited

Loading

Cyan4973 commented Nov 5, 2024

Support persistent compression state #4166

Support persistent compression state #4166

Comments

CAFxX commented Oct 9, 2024 • edited Loading

Cyan4973 commented Nov 5, 2024

CAFxX commented Oct 9, 2024 •

edited

Loading