Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about training dictionary functions #2566

Closed
ghost opened this issue Mar 30, 2021 · 5 comments
Closed

Questions about training dictionary functions #2566

ghost opened this issue Mar 30, 2021 · 5 comments

Comments

@ghost
Copy link

ghost commented Mar 30, 2021

Currently, the stable API has two functions for training dictionary [1]:

  • ZDICT_trainFromBuffer()
  • ZDICT_finalizeDictionary()

ZDICT_finalizeDictionary()'s doc said specifying compression level is helpful for compression [2]:

 * The compression level can be set in `parameters`. You should pass the
 * compression level you expect to use in production. The statistics for each
 * compression level differ, so tuning the dictionary for the compression level
 * can help quite a bit.

But ZDICT_trainFromBuffer() uses ZSTD_CLEVEL_DEFAULT as compression level [3], does the trained dictionary have sub-optimal effect for other compression levels?

If this is an API defect, maybe ZDICT_finalizeDictionary() function can be allowed to accept NULL as basis dictionary. In this case, it only trains the dictionary, doesn't finalize the dictionary. This will not break the API, and it's backward compatible. Then ZDICT_trainFromBuffer() function can be marked as obsolete.

Moreover, it would be nice to describe ZDICT_finalizeDictionary() function from user's perspective, for example:

  • When to use this function?
  • Why not re-train a dictionary with the old samples and the new samples together?
  • Can the user compose a raw content dictionary by hand, and use it as a basis dictionary? For example, manually select the common parts in some JSON files.

[1] https://github.com/facebook/zstd/blob/v1.4.9/lib/dictBuilder/zdict.h#L40-L109
[2] https://github.com/facebook/zstd/blob/v1.4.9/lib/dictBuilder/zdict.h#L80-L83
[3] https://github.com/facebook/zstd/blob/v1.4.9/lib/dictBuilder/zdict.c#L1116

@ghost
Copy link
Author

ghost commented Apr 10, 2021

These notes come from README.md, zdict.h, format specification, GitHub issues. If put them together (such as a guide), may be it can prevent users from having their misunderstandings.

  • If use pre-trained zstd dictionary, the compression ratio achievable on small data (a few KiB) improves dramatically, has best effect on data that smaller than 1 KiB.
  • Zstd dictionary has negligible effect on large data (multi-MiB) compression.
  • Zstd dictionary is vulnerable.
  • Introduce the Dictionary ID, and the reserved ranges for future use.
  • A reasonable dictionary has a size of ~100 KiB. It’s possible to select smaller or larger size, just by specifying dict_size argument.
  • It’s recommended to provide a few thousands samples, though this can vary a lot.
  • It’s recommended that total size of all samples be about ~x100 times the target size of dictionary.
  • Dictionary training will fail if there are not enough samples to construct a dictionary, or if most of the samples are too small (< 8 bytes being the lower limit). If dictionary training fails, you should use zstd without a dictionary, as the dictionary would’ve been ineffective anyways.

@terrelln
Copy link
Contributor

terrelln commented May 4, 2021

Zstd dictionary is vulnerable.

What do you mean by that?

@ghost
Copy link
Author

ghost commented May 4, 2021 via email

@Cyan4973
Copy link
Contributor

Cyan4973 commented May 4, 2021

It depends on what is associated with the word "vulnerable". If it means that there are some known exploitable vulnerabilities, then it's too strong.

Essentially, an externally provided dictionary content, which could have been tampered with by a 3rd party, is as dangerous as any compressed payload sent to the decoder.

Good thing is, libzstd is now well protected against such attack vector, as its dictionary entry point is heavily fuzzed before each release. So even an intentionally crafted dictionary content stands little chance to trigger a direct out-of-buffer read/write scenario.

terrelln added a commit to terrelln/zstd that referenced this issue May 6, 2021
The FAQ covers the questions asked in Issue facebook#2566. It first covers why
you would want to use a dictionary, then what a dictionary is, and
finally it tells you how to train a dictionary, and clarifies some of
the parameters.

There is definitely more that could be said about some of the advanced
trainers, but this should be a good start.
terrelln added a commit to terrelln/zstd that referenced this issue May 6, 2021
The FAQ covers the questions asked in Issue facebook#2566. It first covers why
you would want to use a dictionary, then what a dictionary is, and
finally it tells you how to train a dictionary, and clarifies some of
the parameters.

There is definitely more that could be said about some of the advanced
trainers, but this should be a good start.
@senhuang42
Copy link
Contributor

Addressed in #2622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants