Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persistent compression state #4166

Open
CAFxX opened this issue Oct 9, 2024 · 1 comment
Open

Support persistent compression state #4166

CAFxX opened this issue Oct 9, 2024 · 1 comment

Comments

@CAFxX
Copy link

CAFxX commented Oct 9, 2024

Is your feature request related to a problem? Please describe.

I have a tar archive containing periodic full backups of the same database. Each backup is very similar to the previous one, so each additional backup compressed in the same tar archive is orders of magnitude smaller, after compression, than the same file compressed standalone. Unfortunately, today appending each new backup (so that compression can leverage the redundancies between files) basically entails decompressing the full tar file, appending the new backup to it, and then compressing the whole tar file again - because no compressor I am aware of implements a way to persist (or even just reconstruct) the compression state of an existing stream.

Describe the solution you'd like
I would like a way to append data to an existing zstd stream making use of state of the whole stream, so that the new data can be compressed efficiently exploiting redundancies with the data already present in the compressed stream.

The persistent compression state could very well be larger than the compressed stream: this is acceptable.

The persistent compression state does not need to be publicly documented, nor stable across versions or platforms. If the persisted compression state is invalid/corrupt, it should be ignored.

This could take the form of a --state STATE_FILE switch that could be used as follows:

# persist the compression state in STATE_FILE 
zstd --state STATE_FILE -o OUTPUT_FILE INPUT_FILE

# append data to OUTPUT_FILE using the state from STATE_FILE, persist the final compression state in STATE_FILE
zstd --state STATE_FILE --append -o OUTPUT_FILE INPUT_FILE

Ideally, it should also be possible to reconstruct (and persist) the compression state of an existing stream.

If a stream consists of multiple independent sections (e.g. because the stream is rsyncable, or because a section was appended without making use of the persistent compression state) the persistent state would only be the one covering the section since the last state reset.

Describe alternatives you've considered
There are alternatives in the specific scenario I described above (e.g. do incremental backups, use a diff-like tool before compression, decompress+append+recompress, etc.) but they are not always practical or applicable in this or other scenarios.

Additional context

@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 5, 2024

There could be a way to append new data as part of the same stream as the existing compressed tar file,
however, there would still be a need to decompress the whole tar file first.
The main benefit would be that there would be no need to compress again the whole tar file.
Such an approach wouldn't need an additional state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants