Questions about training dictionary functions #2566

ghost · 2021-03-30T04:12:44Z

Currently, the stable API has two functions for training dictionary [1]:

ZDICT_trainFromBuffer()
ZDICT_finalizeDictionary()

ZDICT_finalizeDictionary()'s doc said specifying compression level is helpful for compression [2]:

 * The compression level can be set in `parameters`. You should pass the
 * compression level you expect to use in production. The statistics for each
 * compression level differ, so tuning the dictionary for the compression level
 * can help quite a bit.

But ZDICT_trainFromBuffer() uses ZSTD_CLEVEL_DEFAULT as compression level [3], does the trained dictionary have sub-optimal effect for other compression levels?

If this is an API defect, maybe ZDICT_finalizeDictionary() function can be allowed to accept NULL as basis dictionary. In this case, it only trains the dictionary, doesn't finalize the dictionary. This will not break the API, and it's backward compatible. Then ZDICT_trainFromBuffer() function can be marked as obsolete.

Moreover, it would be nice to describe ZDICT_finalizeDictionary() function from user's perspective, for example:

When to use this function?
Why not re-train a dictionary with the old samples and the new samples together?
Can the user compose a raw content dictionary by hand, and use it as a basis dictionary? For example, manually select the common parts in some JSON files.

[1] https://github.com/facebook/zstd/blob/v1.4.9/lib/dictBuilder/zdict.h#L40-L109
[2] https://github.com/facebook/zstd/blob/v1.4.9/lib/dictBuilder/zdict.h#L80-L83
[3] https://github.com/facebook/zstd/blob/v1.4.9/lib/dictBuilder/zdict.c#L1116

The text was updated successfully, but these errors were encountered:

ghost · 2021-04-10T03:05:45Z

These notes come from README.md, zdict.h, format specification, GitHub issues. If put them together (such as a guide), may be it can prevent users from having their misunderstandings.

If use pre-trained zstd dictionary, the compression ratio achievable on small data (a few KiB) improves dramatically, has best effect on data that smaller than 1 KiB.
Zstd dictionary has negligible effect on large data (multi-MiB) compression.
Zstd dictionary is vulnerable.
Introduce the Dictionary ID, and the reserved ranges for future use.
A reasonable dictionary has a size of ~100 KiB. It’s possible to select smaller or larger size, just by specifying dict_size argument.
It’s recommended to provide a few thousands samples, though this can vary a lot.
It’s recommended that total size of all samples be about ~x100 times the target size of dictionary.
Dictionary training will fail if there are not enough samples to construct a dictionary, or if most of the samples are too small (< 8 bytes being the lower limit). If dictionary training fails, you should use zstd without a dictionary, as the dictionary would’ve been ineffective anyways.

terrelln · 2021-05-04T04:29:09Z

Zstd dictionary is vulnerable.

What do you mean by that?

ghost · 2021-05-04T06:18:42Z

doc/zstd_compression_format.md has these words: "If a dictionary is provided by an external source, it should be loaded with great care, its content considered untrusted." Does it mean vulnerable?

Cyan4973 · 2021-05-04T16:57:47Z

It depends on what is associated with the word "vulnerable". If it means that there are some known exploitable vulnerabilities, then it's too strong.

Essentially, an externally provided dictionary content, which could have been tampered with by a 3rd party, is as dangerous as any compressed payload sent to the decoder.

Good thing is, libzstd is now well protected against such attack vector, as its dictionary entry point is heavily fuzzed before each release. So even an intentionally crafted dictionary content stands little chance to trigger a direct out-of-buffer read/write scenario.

The FAQ covers the questions asked in Issue facebook#2566. It first covers why you would want to use a dictionary, then what a dictionary is, and finally it tells you how to train a dictionary, and clarifies some of the parameters. There is definitely more that could be said about some of the advanced trainers, but this should be a good start.

senhuang42 · 2021-05-10T14:04:05Z

Addressed in #2622

Cyan4973 added the documentation label Mar 30, 2021

terrelln mentioned this issue May 6, 2021

[zdict] Add a FAQ to the top of zdict.h #2622

Merged

senhuang42 closed this as completed May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about training dictionary functions #2566

Questions about training dictionary functions #2566

ghost commented Mar 30, 2021

ghost commented Apr 10, 2021

terrelln commented May 4, 2021

ghost commented May 4, 2021 via email •

edited by ghost

Loading

Cyan4973 commented May 4, 2021

senhuang42 commented May 10, 2021

Questions about training dictionary functions #2566

Questions about training dictionary functions #2566

Comments

ghost commented Mar 30, 2021

ghost commented Apr 10, 2021

terrelln commented May 4, 2021

ghost commented May 4, 2021 via email • edited by ghost Loading

Cyan4973 commented May 4, 2021

senhuang42 commented May 10, 2021

ghost commented May 4, 2021 via email •

edited by ghost

Loading