Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add library and cli flags for file format with embedded dictionary #4036

Open
pmeenan opened this issue Apr 29, 2024 · 0 comments
Open

Add library and cli flags for file format with embedded dictionary #4036

pmeenan opened this issue Apr 29, 2024 · 0 comments

Comments

@pmeenan
Copy link

pmeenan commented Apr 29, 2024

This is still in flight but I wanted to get some feedback from the tooling side before we go too far on the IETF spec for dictionary-compressed responses.

We are considering creating a new file/stream format that adds a 35-byte header before the compressed stream with a magic signature (DCZ) and sha-256 hash of the dictionary that was used to compress the resource.

Currently the dictionary hash is sent in a separate header but there may be value in putting the hash in the file itself and removing the need for an extra header.

Optimally, if we go down this path it would be useful for the Zstandard cli and API's to support generating and decompressing these streams directly rather than wrapping their output in more tooling.

On compression:

  • Add a flag for generating "Dictionary-Compressed Zstandard"
  • Flag limits the compression window to the larger of 8 MB or 1.25 * the size of the dictionary, up to 128 MB
  • Flag requires a dictionary to be specified
  • Output stream is prefixed with DCZ + sha-256 hash of dictionary

On decompression:

  • Add a flag for decompressing "Dictionary-Compressed Zstandard" (or autodetect from the magic signature)
  • Flag requires a dictionary be specified
  • Flag sets compression window max to the larger of 8 MB or 1.25 * the size of the dictionary, up to 128 MB
  • On decompression, if the hash doesn't match the provided dictionary, fail

Does this sound reasonable and make sense to add if we do go down the route of specifying a stream prefix for the dictionary-compressed streams? Are there any concerns/suggestions on the plan itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant