[lib] Add ZSTD_c_deterministicRefPrefix #2616

terrelln · 2021-05-05T20:21:44Z

This flag forces zstd to always load the prefix in ext-dict mode, even
if it happens to be contiguous, to force determinism. It also applies to
dictionaries that are re-processed.

A determinism test case is also added, which fails without
ZSTD_c_deterministicRefPrefix and passes with it set.

Question: Should this be the default behavior? It isn't in this PR.

Cyan4973 · 2021-05-05T21:39:04Z

We have an internal use case currently,
which is enough to warrant the existence of this flag,
but I don't see a strong need to make the new behavior the default.
We can always do that later if it happens to be preferable.

Cyan4973 · 2021-05-05T21:54:09Z

lib/compress/zstd_compress.c

@@ -559,6 +559,11 @@ ZSTD_bounds ZSTD_cParam_getBounds(ZSTD_cParameter param)
        bounds.upperBound = (int)ZSTD_urm_enableRowMatchFinder;
        return bounds;

+    case ZSTD_c_deterministicRefPrefix:


opened question :
flag name : ZSTD_c_deterministicRefPrefix
the name describes pretty well the objective, not the how.

In general, this is a good rule to select a name.

However, in this case, I was wondering :
in which case would anyone select "no" to this property ?
"Do you want a deterministic output ?" "No, I'm fine with unpredictable randomness"

I guess, one only answers "no" if there is something in exchange.
In this case, the only answer I could think of would probably be more speed
(and I don't even know how much speed we are talking about, it could be negligible).

Even then, making the effort to benefit from this speed only makes sense if the user consistently organizes layout so that the prefix always stands just before the destination buffer. In which case ... the output is deterministic.
So it feels strange to have to answer "no" to ZSTD_c_deterministicRefPrefix in order to benefit from better speed by concatenating output buffer right after dictionary prefix.

So, I was wondering : presuming someone would be interested in better speed, by ensuring that dictionary prefix and output buffer are contiguous, what would be the "right" parameter name that would feel "correct"?

I could imagine a 3-stages parameter : always separated, always contiguous (in which case it must fail if buffers are not contiguous), automatic (hence non deterministic).

Thoughts ?

I would be fine to add this, it would be simple to do. But, it does add more complexity to the API, so there should be some gain, like speed.

I guess that means I should measure the speed difference between ext-dict & prefix dictionaries... I will do that, and if there is a significant gain, I can extend the API.

I recall measuring something like a 5% speed gain on a subset of the httparchive corpus with a 16KB dict level 1, for contiguous vs. non-contiguous.

Prefix-mode is ~2.5% faster at level 1, 1.3% faster at level 3, and 2.5% faster at level 5.

I think that loss is not worth the complexity, especially because the user would probably have to copy their buffers to ensure they're contiguous. I will update the comments to explain the tradeoff a bit more.

Agreed with the outcome.

I'm surprised though that prefix mode (which I understand as "contiguous buffers") ends up being slower at level 5. The expectation was that it would be at least as fast, if only by a tiny margin, but not slower.

I'm surprised though that prefix mode (which I understand as "contiguous buffers") ends up being slower at level 5. The expectation was that it would be at least as fast, if only by a tiny margin, but not slower.

That was a typo, it is 2.5% faster.

This flag forces zstd to always load the prefix in ext-dict mode, even if it happens to be contiguous, to force determinism. It also applies to dictionaries that are re-processed. A determinism test case is also added, which fails without `ZSTD_c_deterministicRefPrefix` and passes with it set. Question: Should this be the default behavior? It isn't in this PR.

facebook-github-bot added the CLA Signed label May 5, 2021

terrelln force-pushed the deterministic-dict branch 2 times, most recently from efea18d to 1a30e62 Compare May 5, 2021 21:02

Cyan4973 reviewed May 5, 2021

View reviewed changes

terrelln force-pushed the deterministic-dict branch from 1a30e62 to 172b4b6 Compare May 6, 2021 01:44

Cyan4973 approved these changes May 6, 2021

View reviewed changes

terrelln merged commit 207e33b into facebook:dev May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lib] Add ZSTD_c_deterministicRefPrefix #2616

[lib] Add ZSTD_c_deterministicRefPrefix #2616

terrelln commented May 5, 2021

Cyan4973 commented May 5, 2021 •

edited

Loading

Cyan4973 May 5, 2021

terrelln May 5, 2021

senhuang42 May 5, 2021 •

edited

Loading

terrelln May 6, 2021 •

edited

Loading

Cyan4973 May 6, 2021

terrelln May 6, 2021

[lib] Add ZSTD_c_deterministicRefPrefix #2616

[lib] Add ZSTD_c_deterministicRefPrefix #2616

Conversation

terrelln commented May 5, 2021

Cyan4973 commented May 5, 2021 • edited Loading

Cyan4973 May 5, 2021

Choose a reason for hiding this comment

terrelln May 5, 2021

Choose a reason for hiding this comment

senhuang42 May 5, 2021 • edited Loading

Choose a reason for hiding this comment

terrelln May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 May 6, 2021

Choose a reason for hiding this comment

terrelln May 6, 2021

Choose a reason for hiding this comment

Cyan4973 commented May 5, 2021 •

edited

Loading

senhuang42 May 5, 2021 •

edited

Loading

terrelln May 6, 2021 •

edited

Loading