-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct the block splitter mismatched repcodes detection. #2574
Correct the block splitter mismatched repcodes detection. #2574
Conversation
@@ -352,7 +352,7 @@ typedef enum { | |||
* Private declarations | |||
*********************************************/ | |||
typedef struct seqDef_s { | |||
U32 offset; /* Offset code of the sequence */ | |||
U32 offset; /* offset == rawOffset + ZSTD_REP_NUM, or equivalently, offCode + 1 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's
offset == rawOffset + ZSTD_REP_MOVE, or equivalently, offCode + 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cyan4973 Are the values contained within the rep
array not considered to be raw offsets? I've found that I need to add 3 == ZSTD_REP_NUM to the value within rep
to obtain the correct value to give seqDef::offset
, and that ZSTD_updateRep()
expects an offCode
which appears to be offset - 1
?
seqDef::offset
can have value == 1 which should correspond to offCode == 0
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one of the messiest confusing part of the code,
and it is probably worth a PR just dedicated to fix it.
Initially, offset
was really that, an offset
. See value 16
, it means offset == 16
.
Quite logically, the name offset
was adopted to represent this value everywhere it's needed and stored.
There was an exception though, as 0
can't be a valid offset.
So it was refurbished to mean repCode
.
Back then, there was only a single repcode value available.
So OK, offset
means offset
when the value > 0
, and there is a special case when offset==0
.
Still fair enough.
Later on, much later in the life of pre-v1.0 zstd
, Przemyslaw added the ability to manage 3 repcodes.
This was beneficial for the high compression mode, resulting in small but measurable benefits.
It complexifies repcode history update, but Przemyslaw also worked that out pretty well, so speed was barely affected.
But hey, the associated variable was already called offset
and used in multiple places over the code base,
so the name of the variable was left as is.
Except that now, it's no longer an "offset". It's effectively a sum type, that means either a repcode
, or a offset.
At the time, it felt like a minor, localized, source of confusion and complexity, and we could live with it.
Sure. But as the code ages, and its nb of contributors increases, this little quirk feels less and less acceptable.
Are the values contained within the rep array not considered to be raw offsets?
I presume "rep array" means the 3-elts repcode history array ?
I don't remember precisely, but I would expect them to be raw offsets, since the rep
array shouldn't contain repcodes
.
However, I'm surprised you bring that up, as this comment is written in the context of seqDef
, which corresponds to the Sequence Store. So I don't see a direct relation with the repcode history.
If I remember correctly, in the seqStore
, the offset
values [0, 1, 2]
actually correspond to the 3 repcodes. Then starting with code 3
, the "real" offset range starts, with this first valid offset being 1. This makes a difference of 2, which corresponds to ZSTD_REP_MOVE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the explanation on this topic! Interesting stuff here...
If I remember correctly, in the seqStore, the offset values [0, 1, 2] actually correspond to the 3 repcodes. Then starting with code 3, the "real" offset range starts, with this first valid offset being 1. This makes a difference of 2, which corresponds to ZSTD_REP_MOVE.
I guess what I'm saying is that what I've observed is that in the seqStore
, the offset
values [1, 2, 3]
actually correspond to the 3 repcodes, rather than [0, 1, 2]
as you mentioned above. Thus, the "real" offset range starts with code 4
.
If we look at zstd_fast.c
for an example of converting a true raw offset into a seqStore->offset
, we start by adding ZSTD_REP_MOVE
to the raw offset value, getting an offCode
, as you mentioned above:
Line 143 in 56421f3
offcode = offset_1 + ZSTD_REP_MOVE; |
And then store offCode
as such, with ZSTD_storeSeq()
:
Line 152 in 56421f3
ZSTD_storeSeq(seqStore, (size_t)(ip0-anchor), anchor, iend, offcode, mLength-MINMATCH); |
But if we take a look at the actual implementation of ZSTD_storeSeq()
, there's something funny here:
In ZSTD_storeSeq()
, we increment what we actually store:
zstd/lib/compress/zstd_compress_internal.h
Line 619 in 56421f3
seqStorePtr->sequences[0].offset = offCode + 1; |
So, we actually end up adding 3 == ZSTD_REP_NUM
to the raw offset to obtain seqStore->offset
, so seqStore->offset == rawOffset + 3 == offCode + 1
, right?
This also matches up with the callsites of ZSTD_updateRep()
, which expects an offCode
, and are passed the value seqStore->offset - 1
, seemingly implying that offCode+1 = seqStore->offset
It's not actually clear to me why we increment it. Maybe it is a left-over from back before repcodes existed as you mentioned, and never actually got fully fixed to support offset == 0
and so we still have that MIN(seqStore->offset) == 1
for some reason. Because the scheme that you described in your comment makes more sense than what currently seems to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offCode
gets broken up into two pieces. The code nbBits = ZS_highbit32(offCode)
and the nbBits
extra bits. ZS_highbit32(0) = undefined & ZS_highbit32(1) = 0
. So we need to make sure that offCode
is never zero, so we add 1.
So on compression side, we use offset == 1, 2, 3 for repcodes. On decompression side it is offsets 0, 1, 2 for repcodes. They correspond to the same thing, since we've shifted up by one on the compression side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes a lot of sense, thanks for the clarification.
assert(offCode < ZSTD_REP_NUM); | ||
if (adjustedOffCode == ZSTD_REP_NUM) { | ||
/* litlength == 0 and offCode == 2 implies selection of first repcode - 1 */ | ||
return rep[0] - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor :
for this to be valid, we must have rep[0] > 1
.
This is probably guaranteed by construction.
Still, probably worth an assert()
, just for debug cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments left.
Good fix, especially given it also improves code readability in the same effort.
0b93627
to
550f76f
Compare
The block splitter maintains separate repcode histories for both compression and decompression. Previously, only the repcode updating process factored in
litLength==0
, but not the check for mismatch itself.This PR fixes that logic and should make the
ZSTD_seqStore_resolveOffCodes()
easier to read and make sense of, and corrects an incorrect (or easy to misinterpret) comment regarding whatseqDef::offset
actually means.No performance impact.
Test Plan: verifies that this fixes the OSS-fuzz error, and all previous failures remain fixed.