Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct the block splitter mismatched repcodes detection. #2574

Merged
merged 1 commit into from
Apr 13, 2021

Conversation

senhuang42
Copy link
Contributor

The block splitter maintains separate repcode histories for both compression and decompression. Previously, only the repcode updating process factored in litLength==0, but not the check for mismatch itself.

This PR fixes that logic and should make the ZSTD_seqStore_resolveOffCodes() easier to read and make sense of, and corrects an incorrect (or easy to misinterpret) comment regarding what seqDef::offset actually means.

No performance impact.

this_branch: 1#silesia.tar       : 211957760 ->  72914525 (2.907), 161.8 MB/s ,1137.9 MB/s 
dev: 1#silesia.tar       : 211957760 ->  72914525 (2.907), 161.4 MB/s ,1140.0 MB/s

Test Plan: verifies that this fixes the OSS-fuzz error, and all previous failures remain fixed.

@senhuang42 senhuang42 changed the title Correct the block splitter mismatched repcodes detection comparison. Correct the block splitter mismatched repcodes detection. Apr 9, 2021
@@ -352,7 +352,7 @@ typedef enum {
* Private declarations
*********************************************/
typedef struct seqDef_s {
U32 offset; /* Offset code of the sequence */
U32 offset; /* offset == rawOffset + ZSTD_REP_NUM, or equivalently, offCode + 1 */
Copy link
Contributor

@Cyan4973 Cyan4973 Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's
offset == rawOffset + ZSTD_REP_MOVE, or equivalently, offCode + 2

Copy link
Contributor Author

@senhuang42 senhuang42 Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cyan4973 Are the values contained within the rep array not considered to be raw offsets? I've found that I need to add 3 == ZSTD_REP_NUM to the value within rep to obtain the correct value to give seqDef::offset, and that ZSTD_updateRep() expects an offCode which appears to be offset - 1?

seqDef::offset can have value == 1 which should correspond to offCode == 0, right?

Copy link
Contributor

@Cyan4973 Cyan4973 Apr 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the messiest confusing part of the code,
and it is probably worth a PR just dedicated to fix it.

Initially, offset was really that, an offset. See value 16, it means offset == 16.
Quite logically, the name offset was adopted to represent this value everywhere it's needed and stored.

There was an exception though, as 0 can't be a valid offset.
So it was refurbished to mean repCode.
Back then, there was only a single repcode value available.
So OK, offset means offset when the value > 0, and there is a special case when offset==0.
Still fair enough.

Later on, much later in the life of pre-v1.0 zstd, Przemyslaw added the ability to manage 3 repcodes.
This was beneficial for the high compression mode, resulting in small but measurable benefits.
It complexifies repcode history update, but Przemyslaw also worked that out pretty well, so speed was barely affected.

But hey, the associated variable was already called offset and used in multiple places over the code base,
so the name of the variable was left as is.
Except that now, it's no longer an "offset". It's effectively a sum type, that means either a repcode, or a offset.

At the time, it felt like a minor, localized, source of confusion and complexity, and we could live with it.
Sure. But as the code ages, and its nb of contributors increases, this little quirk feels less and less acceptable.

Are the values contained within the rep array not considered to be raw offsets?

I presume "rep array" means the 3-elts repcode history array ?
I don't remember precisely, but I would expect them to be raw offsets, since the rep array shouldn't contain repcodes.

However, I'm surprised you bring that up, as this comment is written in the context of seqDef, which corresponds to the Sequence Store. So I don't see a direct relation with the repcode history.

If I remember correctly, in the seqStore, the offset values [0, 1, 2] actually correspond to the 3 repcodes. Then starting with code 3, the "real" offset range starts, with this first valid offset being 1. This makes a difference of 2, which corresponds to ZSTD_REP_MOVE.

Copy link
Contributor Author

@senhuang42 senhuang42 Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the explanation on this topic! Interesting stuff here...

If I remember correctly, in the seqStore, the offset values [0, 1, 2] actually correspond to the 3 repcodes. Then starting with code 3, the "real" offset range starts, with this first valid offset being 1. This makes a difference of 2, which corresponds to ZSTD_REP_MOVE.

I guess what I'm saying is that what I've observed is that in the seqStore, the offset values [1, 2, 3] actually correspond to the 3 repcodes, rather than [0, 1, 2] as you mentioned above. Thus, the "real" offset range starts with code 4.

If we look at zstd_fast.c for an example of converting a true raw offset into a seqStore->offset, we start by adding ZSTD_REP_MOVE to the raw offset value, getting an offCode, as you mentioned above:

offcode = offset_1 + ZSTD_REP_MOVE;

And then store offCode as such, with ZSTD_storeSeq():

ZSTD_storeSeq(seqStore, (size_t)(ip0-anchor), anchor, iend, offcode, mLength-MINMATCH);

But if we take a look at the actual implementation of ZSTD_storeSeq(), there's something funny here:

In ZSTD_storeSeq(), we increment what we actually store:

seqStorePtr->sequences[0].offset = offCode + 1;

So, we actually end up adding 3 == ZSTD_REP_NUM to the raw offset to obtain seqStore->offset, so seqStore->offset == rawOffset + 3 == offCode + 1, right?

This also matches up with the callsites of ZSTD_updateRep(), which expects an offCode, and are passed the value seqStore->offset - 1, seemingly implying that offCode+1 = seqStore->offset

It's not actually clear to me why we increment it. Maybe it is a left-over from back before repcodes existed as you mentioned, and never actually got fully fixed to support offset == 0 and so we still have that MIN(seqStore->offset) == 1 for some reason. Because the scheme that you described in your comment makes more sense than what currently seems to exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offCode gets broken up into two pieces. The code nbBits = ZS_highbit32(offCode) and the nbBits extra bits. ZS_highbit32(0) = undefined & ZS_highbit32(1) = 0. So we need to make sure that offCode is never zero, so we add 1.

So on compression side, we use offset == 1, 2, 3 for repcodes. On decompression side it is offsets 0, 1, 2 for repcodes. They correspond to the same thing, since we've shifted up by one on the compression side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes a lot of sense, thanks for the clarification.

assert(offCode < ZSTD_REP_NUM);
if (adjustedOffCode == ZSTD_REP_NUM) {
/* litlength == 0 and offCode == 2 implies selection of first repcode - 1 */
return rep[0] - 1;
Copy link
Contributor

@Cyan4973 Cyan4973 Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor :
for this to be valid, we must have rep[0] > 1.
This is probably guaranteed by construction.
Still, probably worth an assert(), just for debug cases.

Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments left.
Good fix, especially given it also improves code readability in the same effort.

@senhuang42 senhuang42 force-pushed the repcode_mismatch_detector_fix branch 2 times, most recently from 0b93627 to 550f76f Compare April 12, 2021 19:41
@senhuang42 senhuang42 merged commit 12c045f into facebook:dev Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants