-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive block splitting #2447
Conversation
@@ -1796,6 +1798,13 @@ ZSTDLIB_API size_t ZSTD_CCtx_refPrefix_advanced(ZSTD_CCtx* cctx, const void* pre | |||
*/ | |||
#define ZSTD_c_validateSequences ZSTD_c_experimentalParam12 | |||
|
|||
/* ZSTD_c_splitBlocks | |||
* Default is 0 == disabled. Set to 1 to enable block splitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI :
depending on the way we will want to integrate this capability into normal compression levels,
we may need an additional "auto" value (which would likely use the 0
slot)
which would be interpreted as "enabled" or "disabled" by zstd
,
depending on compression level (or more likely based on compression strategy).
There is no urgency to decide that now.
50a0e71
to
818c2da
Compare
lib/compress/zstd_compress.c
Outdated
if (cSeqsSize == 0) { | ||
cSize = ZSTD_noCompressBlock(op, dstCapacity, ip, srcSize, lastBlock); | ||
FORWARD_IF_ERROR(cSize, "Nocompress block failed"); | ||
DEBUGLOG(4, "Writing out nocompress block, size: %zu", cSize); | ||
} else if (cSeqsSize == 1) { | ||
cSize = ZSTD_rleCompressBlock(op, dstCapacity, *ip, srcSize, lastBlock); | ||
FORWARD_IF_ERROR(cSize, "RLE compress block failed"); | ||
DEBUGLOG(4, "Writing out RLE block, size: %zu", cSize); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is potentially a bug to use either of these compression modes.
These modes change the repcode history. So it would only be safe to use them if you can guarantee that repcodes don't reference sequences in this block. A sufficient condition would be to check that the next 3 sequences aren't repcodes (and there are at least 3 remaining sequences). A safer approach would be to not allow them at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case that catches this, or verify that the oss-fuzz fuzzers can catch this?
I believe you don't see this trivially because you don't split blocks of less than 300 sequences. That makes no-compress blocks or rle blocks less likely, but not impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the fuzzer seems to catch this fairly easily when I reduce this constant down to 20, but when it's at 300, running it for a few hours doesn't catch it (though I guess at some point it might get caught anyways).
For a unit test, I'm having trouble trying to recreate this scenario in a unit test - a block of 300 sequences would need generated to be such that the first 150 sequences must be uncompressible, followed by a sequence that uses repcode history to reference an offset from one of the 150 sequences (corrupted), followed by 150 very compressible sequences (to incentivize block splitting).
How would someone generate a series of 150 uncompressible sequences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe https://en.wikipedia.org/wiki/De_Bruijn_sequence with 4-byte runs of zeros interspersed (for matches)?
|
||
cSizeChunk = ZSTD_compressSequences_singleBlock(zc, &chunkSeqStore, op, dstCapacity, ip, srcBytes, lastBlockActual); | ||
FORWARD_IF_ERROR(cSizeChunk, "Compressing chunk failed!"); | ||
ZSTD_memcpy(zc->blockState.nextCBlock->rep, zc->blockState.prevCBlock->rep, sizeof(U32)*ZSTD_REP_NUM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this call for? You have already called ZSTD_confirmRepcodesAndEntropy()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove this then we can't roundtrip anymore - I believe that's because if a nocompress/RLE block is emitted, then we should still copy over the repcodes so that the decoder can still reference the repcodes that refer to offsets from the block before the nocompress/RLE.
I was playing around with your block splitter when I ran into those errors. I was trying this algorithm, which is a slight variation on your split, that tries
That shows that this is room for gains, depending on the file. Once you get all the infra for supporting block splitting working, do you intend to continue looking into stronger splitting algorithms? #define MIN_SEQUENCES_BLOCK_SPLITTING 19
static void ZSTD_deriveBlockSplitsHelper(seqStoreSplits* splits, size_t startIdx, size_t endIdx,
const ZSTD_CCtx* zc, const seqStore_t* origSeqStore) {
seqStore_t fullSeqStoreChunk;
seqStore_t firstHalfSeqStore;
seqStore_t secondHalfSeqStore;
size_t estimatedOriginalSize;
size_t numChunks = 19;
size_t chunkSize = (endIdx - startIdx) / numChunks;
size_t minCost;
size_t bestSplit;
if (endIdx - startIdx < MIN_SEQUENCES_BLOCK_SPLITTING || splits->idx >= MAX_NB_SPLITS) {
return;
}
ZSTD_deriveSeqStoreChunk(&fullSeqStoreChunk, origSeqStore, startIdx, endIdx);
estimatedOriginalSize = ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(&fullSeqStoreChunk, zc);
if (ZSTD_isError(estimatedOriginalSize)) {
return;
}
minCost = estimatedOriginalSize;
for (size_t s = 1; s < numChunks; ++s) {
size_t estimatedFirstHalfSize;
size_t estimatedSecondHalfSize;
size_t midIdx = startIdx + s * chunkSize;
if (midIdx <= startIdx || midIdx >= endIdx)
continue;
ZSTD_deriveSeqStoreChunk(&firstHalfSeqStore, origSeqStore, startIdx, midIdx);
ZSTD_deriveSeqStoreChunk(&secondHalfSeqStore, origSeqStore, midIdx, endIdx);
estimatedFirstHalfSize = ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(&firstHalfSeqStore, zc);
estimatedSecondHalfSize = ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(&secondHalfSeqStore, zc);
if (ZSTD_isError(estimatedOriginalSize) || ZSTD_isError(estimatedFirstHalfSize) || ZSTD_isError(estimatedSecondHalfSize)) {
continue;
}
if (estimatedFirstHalfSize + estimatedSecondHalfSize < minCost) {
minCost = estimatedFirstHalfSize + estimatedSecondHalfSize;
bestSplit = midIdx;
}
}
if (minCost < estimatedOriginalSize) {
size_t const midIdx = bestSplit;
ZSTD_deriveBlockSplitsHelper(splits, startIdx, midIdx, zc, origSeqStore);
splits->splitLocations[splits->idx] = (U32)midIdx;
splits->idx++;
ZSTD_deriveBlockSplitsHelper(splits, midIdx, endIdx, zc, origSeqStore);
}
} |
Yep, though any potential alternative block splitting methods probably warrant a separate PR, since I think this one by itself seems fast enough to have demonstrated "free" value for the higher compression levels, and is probably usable at the upper-medium ones too. |
6ca36d0
to
af3d679
Compare
06d0d03
to
6153280
Compare
const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy, | ||
BYTE* dst, const BYTE* const dstEnd, | ||
ZSTD_strategy strategy, unsigned* countWorkspace, | ||
void* entropyWorkspace, size_t entropyWkspSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment:
when a function has so many parameters, it warrants some caution.
Maybe there are ways to reduce that amount.
There is nothing obviously wrong, nor any obvious solution,
so I'll just keep an eye on this topic while reading the rest of the code.
const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy, | ||
BYTE* dst, const BYTE* const dstEnd, | ||
ZSTD_strategy strategy, unsigned* countWorkspace, | ||
void* entropyWorkspace, size_t entropyWkspSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document how much should be entropyWkspSize
for the function to be successful.
lib/compress/zstd_compress.c
Outdated
U32 MLtype; | ||
BYTE* seqHead = op++; | ||
/* build stats for sequences */ | ||
entropyStatisticsSize = ZSTD_buildSequencesStatistics(seqStorePtr, nbSeq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope of this variable entropyStatisticsSize
seems limited to this block.
Prefer declaring it here, so that it ends its lifetime by the end of the block.
General rule: the shorter the lifetime, the better.
lib/compress/zstd_compress.c
Outdated
/* ZSTD_buildSequencesStatistics() is guaranteed to overwrite these values */ | ||
U32 LLtype = set_basic; | ||
U32 Offtype = set_basic; | ||
U32 MLtype = set_basic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this invocation of ZSTD_buildSequencesStatistics()
,
LLtype
, Offtype
and MLtype
are initialized.
But in previous invocation, they were not initialized.
This suggests that one of these 2 formulations is incorrect:
either these values are read, and they should be initialized,
either they are not, and initializing them make it appear their initial value matter.
Reading the code of ZSTD_buildSequencesStatistics()
, it appears they are not read.
So they don't need to be initialized.
This underlines a classical problem with in
+out
variables (by reference):
we don't know, from the function signature, if they need to be initialized before being passed to the function.
Prefer returning these 3 values as part of the function return
statement.
It will make it clear that they are only out
values.
lib/compress/zstd_compress.c
Outdated
*/ | ||
MEM_STATIC size_t | ||
ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq, | ||
U32* LLtype, U32* Offtype, U32* MLtype, size_t* lastCountSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that LLtype
, Offtype
and MLtype
are simple scalar out values.
Prefer passing them as part of the return
statement of the function
(which therefore must be upgrade to a struct
),
so that it's clear they are purely out
values, not in
+ out
.
@@ -2378,7 +2438,7 @@ ZSTD_entropyCompressSequences(seqStore_t* seqStorePtr, | |||
void* dst, size_t dstCapacity, | |||
size_t srcSize, | |||
void* entropyWorkspace, size_t entropyWkspSize, | |||
int bmi2) | |||
int bmi2, U32 const canEmitUncompressed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting,
so you want to be able to forbid ZSTD_entropyCompressSequences()
from sending non-compressed sequences
therefore including cases where it's not considered advantageous.
- Is that just a test feature, or a permanent one ?
- If permanent, could you explain me in which cases this control is necessary ?
- the specification doesn't require the compressed data to be smaller than uncompressed one. But a compressed block must still be smaller than a full block (a.k.a 128 KB for the general case). The full block also includes the literals block. What would happen in this case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a permanent feature - the reason I include this here is that I want to avoid sending uncompressed/RLE blocks when it would destroy repcode history in a bad way while block splitting. i.e., if the next block's first three sequences contain a repcode, we shouldn't try to emit an uncompressed or RLE block, since then the next block's repcode sequence will reference a sequence that is now "lost" as part of an uncompressed or RLE block and therefore invalid.
I suppose that this case, it could be possible that we emit a "compressed" block that is larger than an uncompressed one - basically the a similar limitation as superblocks, though I'm not quite sure how we could get around this while block splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have one question on this point :
Can you guarantee that this mechanism (forcing the output to be compressed even when disadvantageous) is not going to produce a block size large than a full block ?
If yes, can you add an assert()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can guarantee that, at least from this particular codepath.
One thing we could potentially do is: if it turns out that our split, compressed block's total size is larger than 128 KB, then what we could do is trigger a re-compression using the traditional method (single block). Do you think that'd be sufficient? Is it accurate to say that the upper bound on a zstd-format block is then 131075 bytes? (128K raw block + 3 byte block header).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it accurate to say that the upper bound on a compressed block is then 131075 bytes? (128K + 3 byte block header).
Well, depending on where the size is measured, yes this can be correct.
Not though that there are several places in the code which do not include the block header cost.
One thing we could potentially do is: if it turns out that our split, compressed block's total size is larger than 128 KB, then what we could do is trigger a re-compression using the traditional method (single block). Do you think that'd be sufficient?
Yes.
That being said, the limitation is lower than that :
no individual block can be > 128 KB
.
If the original block is split, and the sum of both parts is > 128 KB, this is not a "problem", in the sense that it's still valid (inefficient sure, but valid).
But if one block is > 128 KB, then it's beyond spec, and can break the decoder.
That's what we want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay good to know - I believe in this case we'd need to go with this stricter limitation of not allowing the split blocks cumulatively to be over 128K (and in so doing prevent any single one from being over 128K).
This is because the only case in which we'd potentially emit a compressed block larger than 128K is when we forcibly disable uncompressed or RLE blocks because doing so would erroneously destroy the repcode history. So in this situation the following two statements are simultaneously true: 1) emitting a compressed block instead of uncompressed/RLE block is necessary to maintain repcode history 2) emitting a compressed block may exceed the 128K limit.
Thus, we have to basically re-compress in case statement 2) turns out to be true (we can't do it "halfway"). The latest commit adds in this functionality.
ZSTD_entropyCTables_t* nextEntropy, | ||
const ZSTD_CCtx_params* cctxParams, | ||
ZSTD_entropyCTablesMetadata_t* entropyMetadata, | ||
void* workspace, size_t wkspSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: specify what should be the (minimum) size of wkspSize
for this function to work properly.
lib/compress/zstd_compress.c
Outdated
/* Returns the size estimate for the FSE-compressed symbols (of, ml, ll) of a block */ | ||
static size_t ZSTD_estimateBlockSize_symbolType(symbolEncodingType_e type, | ||
const BYTE* codeTable, unsigned maxCode, | ||
size_t nbSeq, const FSE_CTable* fseCTable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: parameter order
nbSeq
is, if I understand correctly, the actual size of codeTable
buffer.
So it should be placed alongside it.
lib/compress/zstd_compress.c
Outdated
assert(max <= defaultMax); | ||
cSymbolTypeSizeEstimateInBits = max <= defaultMax | ||
? ZSTD_crossEntropyCost(defaultNorm, defaultNormLog, countWksp, max) | ||
: ERROR(GENERIC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error case shouldn't be necessary.
As per the assert()
before, this case should never happen
(even in presence of bogus input data).
return cSeqSizeEstimate + sequencesSectionHeaderSize; | ||
} | ||
|
||
/* Returns the size estimate for a given stream of literals, of, ll, ml */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question:
do you have a debug mode
in which the estimation and the real size of the block are compared ?
If yes, do you have results ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this into the debuglogs. While developing this/eyeballing: the literals size estimate was typically almost exact (1-4 bytes off) while the sequences size estimate was a bit further off.
An excerpt of results from the beginning of silesia.tar
:
../lib/compress/zstd_compress.c: Estimated size: 8108 actual size: 8158
../lib/compress/zstd_compress.c: Estimated size: 6107 actual size: 6129
../lib/compress/zstd_compress.c: Estimated size: 10062 actual size: 10058
../lib/compress/zstd_compress.c: Estimated size: 19766 actual size: 19718
../lib/compress/zstd_compress.c: Estimated size: 20583 actual size: 20464
../lib/compress/zstd_compress.c: Estimated size: 22150 actual size: 22076
../lib/compress/zstd_compress.c: Estimated size: 19845 actual size: 19756
../lib/compress/zstd_compress.c: Estimated size: 21328 actual size: 21261
../lib/compress/zstd_compress.c: Estimated size: 17800 actual size: 17706
../lib/compress/zstd_compress.c: Estimated size: 19530 actual size: 19484
for (i = 0; i < nbSeqs; ++i) { | ||
seqDef seq = seqStore->sequencesStart[i]; | ||
literalsBytes += seq.litLength; | ||
if (i == seqStore->longLengthPos && seqStore->longLengthID == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, and not caused by this PR :
maybe it would have been better to employ an enum
for the values of longLengthID
because at this stage, the meaning of 1
is a bit lost
(requires accessing the relevant documentation in zstd_internal.h
)
e56351e
to
6db18d6
Compare
@Cyan4973 some stats on the accuracy of the block splitter: Generally the false positive rate (splitting when we shouldn't have) seems pretty low:
Additionally, some benchmarks that include decompression speed, since higher density of block boundaries would imply slower decompression:
It looks like in general, we can get around ~0.8 of a compression level gain for a more or less fixed-cost compression speed slowdown that becomes hardly noticeable at higher levels, as well as a minor, but maybe noticeable decompression speed slowdown. I'm of the opinion that this should definitely be enabled on the higher compression level settings, though the medium levels are up for debate, and I'd expect row hash to also change how we'd think about the medium levels with respect to the block splitter, considering their significant speedup. |
6db18d6
to
0aac768
Compare
I agree. Enabling this feature by default for high compression levels is a good strategy. Question :
|
The former case - It means that the block was determined to be split and we split it, but it turns out we should not have. The criteria is simply that the 128K split block has a larger or equal compressed size to the 128K unsplit block. Here's some more info about the number of bytes that could be saved had we recompressed the "incorrectly" split block as a single block, and it doesn't seem very large.
To make this even better we'd need to add more things to the algorithm itself, i.e. addressing the latter case you mentioned: |
To do this, we would just have to lower the threshold, into negative value territory. That being said, no need to address that in this PR. |
I've set this to conservatively enable by default: for I've integrated it in the same style/codepaths as ldm getting enabled by default. |
c5668f3
to
b924e5a
Compare
@@ -3370,6 +3371,14 @@ static size_t ZSTD_compressBlock_splitBlock_internal(ZSTD_CCtx* zc, void* dst, s | |||
cSize += cSizeChunk; | |||
currSeqStore = nextSeqStore; | |||
} | |||
|
|||
if (cSize > ZSTD_BLOCKSIZE_MAX + ZSTD_blockHeaderSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things I'm not sure to understand here :
- are you certain that, at this place in the code,
cSize
includes the block header size ? - the limitation regarding the maximum compressed block size related to the complete block. Thus, in compressed format, it includes both the literals and sequence sub-blocks. Yet, below, I only notice
ZSTD_compressSequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes,
ZSTD_compressSequences_singleBlock()
includes the block header, and thiscSize
measurement includes all block headers from split blocks. ZSTD_compressSequences_singleBlock()
: This function compresses both the literals and the sequences (I had named it in the style of the existingZSTD_entropyCompressSequences()
which also compresses literals and sequences). Maybe a better name would beZSTD_compressSeqStore_singleBlock()
and rename the existingZSTD_entropyCompressSequences()
toZSTD_entropyCompressSeqStore()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the new name is much better for clarity
81ae4b1
to
5b566eb
Compare
Most recent push is just to improve and clean up the git commit history - will merge once tests look okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @senhuang42, can you please move the commented items into a workspace? Block splitting is using too much stack space.
You can test it with:
cd contrib/linux-kernel
CC=clang make -j test
That test is failing currently. Before this PR I see Maximum stack size: 2056
, after I see Maximum stack size: 3096
.
seqStore_t fullSeqStoreChunk; | ||
seqStore_t firstHalfSeqStore; | ||
seqStore_t secondHalfSeqStore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each seqStore_t
is 80 bytes, so this is 240 bytes of stack space. That is too much.
size_t cSize = 0; | ||
const BYTE* ip = (const BYTE*)src; | ||
BYTE* op = (BYTE*)dst; | ||
U32 partitions[MAX_NB_SPLITS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 196*4 = 784 bytes of stack space, that is too much.
seqStore_t nextSeqStore; | ||
seqStore_t currSeqStore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another 160 bytes of stack space.
* Returns the estimated compressed size of the seqStore, or a zstd error. | ||
*/ | ||
static size_t ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(seqStore_t* seqStore, const ZSTD_CCtx* zc) { | ||
ZSTD_entropyCTablesMetadata_t entropyMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a few hundred bytes of stack space, which is too much.
Seems like the best thing to do would be to define a new struct that holds all these things, like a block split context or something, allocate that to the cctx. |
New experimental param, controlled by
ZSTD_c_splitBlocks
.Some rudimentary benchmarks (fastest of 4 runs with
fullbench.c
):(Decompression speed not measured, though I'd assume that to be a bit slower since there's more blocks).
calgary.tar (has files < 128K, should expect obvious benefit from splitting)
dev
block_splitter
dev
block_splitter
silesia.tar (less obvious benefit from splitting)
dev
block_splitter
dev
block_splitter
Currently this is not particularly optimized for speed, but it seems that the speed impact for relative gain is not too bad at the middle levels (and stays at the very least on the speed/ratio curve), and at the speed impact at high levels is negligible.
Follow-ups:
silesia.tar
andcalgary.tar
. (would not be surprised if some of the fuzzers fail)