Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive block splitting #2447

Merged
merged 15 commits into from
Mar 24, 2021
Merged

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Dec 30, 2020

New experimental param, controlled by ZSTD_c_splitBlocks.

Some rudimentary benchmarks (fastest of 4 runs with fullbench.c):
(Decompression speed not measured, though I'd assume that to be a bit slower since there's more blocks).

calgary.tar (has files < 128K, should expect obvious benefit from splitting)

level ratio dev ratio block_splitter ratio %gain speed dev speed block_splitter
3 3.075 3.104 0.94% 195.4MB/s 129.9MB/s
6 3.199 3.224 0.78% 88.6MB/s 69.5MB/s
9 3.345 3.374 0.87% 36.8MB/s 33.4MB/s
13 3.408 3.438 0.88% 13.0MB/s 12.5MB/s
16 3.542 3.570 0.78% 7.4MB/s 7.3MB/s
19 3.644 3.663 0.52% 3.9MB/s 4.0MB/s
22 3.646 3.664 0.49% 3.3MB/s 3.4MB/s

silesia.tar (less obvious benefit from splitting)

level ratio dev ratio block_splitter ratio %gain speed dev speed block_splitter
6 3.365 3.380 0.44% 99.6MB/s 79.2MB/s
13 3.647 3.666 0.52% 11.9MB/s 10.8MB/s
19 3.979 3.990 0.27% 2.9MB/s 2.8MB/s

Currently this is not particularly optimized for speed, but it seems that the speed impact for relative gain is not too bad at the middle levels (and stays at the very least on the speed/ratio curve), and at the speed impact at high levels is negligible.

Follow-ups:

  • Needs more testing beyond the simple unit test and local roundtrip experiments with silesia.tar and calgary.tar. (would not be surprised if some of the fuzzers fail)
  • Improve speed
  • Explore some other ways to potentially get minor gains (push around exact boundary locations)
  • Try to get acceptable speed on optimal algo splitter version (though this recursive solution comes surprisingly close)
  • Look some more at the estimation functions (currently they are slightly altered versions of what superblocks uses) and make sure they are correct.
  • Look into breaking the 128KB block boundaries

@@ -1796,6 +1798,13 @@ ZSTDLIB_API size_t ZSTD_CCtx_refPrefix_advanced(ZSTD_CCtx* cctx, const void* pre
*/
#define ZSTD_c_validateSequences ZSTD_c_experimentalParam12

/* ZSTD_c_splitBlocks
* Default is 0 == disabled. Set to 1 to enable block splitting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI :
depending on the way we will want to integrate this capability into normal compression levels,
we may need an additional "auto" value (which would likely use the 0 slot)
which would be interpreted as "enabled" or "disabled" by zstd,
depending on compression level (or more likely based on compression strategy).

There is no urgency to decide that now.

@senhuang42 senhuang42 force-pushed the block_splitter_v2 branch 4 times, most recently from 50a0e71 to 818c2da Compare January 11, 2021 15:04
Comment on lines 3157 to 3223
if (cSeqsSize == 0) {
cSize = ZSTD_noCompressBlock(op, dstCapacity, ip, srcSize, lastBlock);
FORWARD_IF_ERROR(cSize, "Nocompress block failed");
DEBUGLOG(4, "Writing out nocompress block, size: %zu", cSize);
} else if (cSeqsSize == 1) {
cSize = ZSTD_rleCompressBlock(op, dstCapacity, *ip, srcSize, lastBlock);
FORWARD_IF_ERROR(cSize, "RLE compress block failed");
DEBUGLOG(4, "Writing out RLE block, size: %zu", cSize);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is potentially a bug to use either of these compression modes.

These modes change the repcode history. So it would only be safe to use them if you can guarantee that repcodes don't reference sequences in this block. A sufficient condition would be to check that the next 3 sequences aren't repcodes (and there are at least 3 remaining sequences). A safer approach would be to not allow them at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case that catches this, or verify that the oss-fuzz fuzzers can catch this?

I believe you don't see this trivially because you don't split blocks of less than 300 sequences. That makes no-compress blocks or rle blocks less likely, but not impossible.

Copy link
Contributor Author

@senhuang42 senhuang42 Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the fuzzer seems to catch this fairly easily when I reduce this constant down to 20, but when it's at 300, running it for a few hours doesn't catch it (though I guess at some point it might get caught anyways).

For a unit test, I'm having trouble trying to recreate this scenario in a unit test - a block of 300 sequences would need generated to be such that the first 150 sequences must be uncompressible, followed by a sequence that uses repcode history to reference an offset from one of the 150 sequences (corrupted), followed by 150 very compressible sequences (to incentivize block splitting).

How would someone generate a series of 150 uncompressible sequences?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe https://en.wikipedia.org/wiki/De_Bruijn_sequence with 4-byte runs of zeros interspersed (for matches)?


cSizeChunk = ZSTD_compressSequences_singleBlock(zc, &chunkSeqStore, op, dstCapacity, ip, srcBytes, lastBlockActual);
FORWARD_IF_ERROR(cSizeChunk, "Compressing chunk failed!");
ZSTD_memcpy(zc->blockState.nextCBlock->rep, zc->blockState.prevCBlock->rep, sizeof(U32)*ZSTD_REP_NUM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this call for? You have already called ZSTD_confirmRepcodesAndEntropy().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove this then we can't roundtrip anymore - I believe that's because if a nocompress/RLE block is emitted, then we should still copy over the repcodes so that the decoder can still reference the repcodes that refer to offsets from the block before the nocompress/RLE.

@terrelln
Copy link
Contributor

I was playing around with your block splitter when I ran into those errors. I was trying this algorithm, which is a slight variation on your split, that tries numChunks splits instead of 2. Obviously, it is significantly slower, but I saw improved compression ratios:

file +ratio%
mozilla +0.4%
calgary +0.15%
cantrbry +0.1%
kennedy +0.3%
seqs +5.5%

That shows that this is room for gains, depending on the file. Once you get all the infra for supporting block splitting working, do you intend to continue looking into stronger splitting algorithms?

#define MIN_SEQUENCES_BLOCK_SPLITTING 19
static void ZSTD_deriveBlockSplitsHelper(seqStoreSplits* splits, size_t startIdx, size_t endIdx,
                                         const ZSTD_CCtx* zc, const seqStore_t* origSeqStore) {
    seqStore_t fullSeqStoreChunk;
    seqStore_t firstHalfSeqStore;
    seqStore_t secondHalfSeqStore;
    size_t estimatedOriginalSize;
    size_t numChunks = 19;
    size_t chunkSize = (endIdx - startIdx) / numChunks;
    size_t minCost;
    size_t bestSplit;


    if (endIdx - startIdx < MIN_SEQUENCES_BLOCK_SPLITTING || splits->idx >= MAX_NB_SPLITS) {
        return;
    }
    ZSTD_deriveSeqStoreChunk(&fullSeqStoreChunk, origSeqStore, startIdx, endIdx);
    estimatedOriginalSize = ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(&fullSeqStoreChunk, zc);
    if (ZSTD_isError(estimatedOriginalSize)) {
        return;
    }
    minCost = estimatedOriginalSize;

    for (size_t s = 1; s < numChunks; ++s) {
        size_t estimatedFirstHalfSize;
        size_t estimatedSecondHalfSize;
        size_t midIdx = startIdx + s * chunkSize;
        if (midIdx <= startIdx || midIdx >= endIdx)
            continue;
        ZSTD_deriveSeqStoreChunk(&firstHalfSeqStore, origSeqStore, startIdx, midIdx);
        ZSTD_deriveSeqStoreChunk(&secondHalfSeqStore, origSeqStore, midIdx, endIdx);
        estimatedFirstHalfSize = ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(&firstHalfSeqStore, zc);
        estimatedSecondHalfSize = ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(&secondHalfSeqStore, zc);
        if (ZSTD_isError(estimatedOriginalSize) || ZSTD_isError(estimatedFirstHalfSize) || ZSTD_isError(estimatedSecondHalfSize)) {
            continue;
        }
        if (estimatedFirstHalfSize + estimatedSecondHalfSize < minCost) {
            minCost = estimatedFirstHalfSize + estimatedSecondHalfSize;
            bestSplit = midIdx;
        }
    }
    if (minCost < estimatedOriginalSize) {
        size_t const midIdx = bestSplit;
        ZSTD_deriveBlockSplitsHelper(splits, startIdx, midIdx, zc, origSeqStore);
        splits->splitLocations[splits->idx] = (U32)midIdx;
        splits->idx++;
        ZSTD_deriveBlockSplitsHelper(splits, midIdx, endIdx, zc, origSeqStore);
    }
}

@senhuang42
Copy link
Contributor Author

That shows that this is room for gains, depending on the file. Once you get all the infra for supporting block splitting working, do you intend to continue looking into stronger splitting algorithms?

Yep, though any potential alternative block splitting methods probably warrant a separate PR, since I think this one by itself seems fast enough to have demonstrated "free" value for the higher compression levels, and is probably usable at the upper-medium ones too.

@senhuang42 senhuang42 force-pushed the block_splitter_v2 branch 2 times, most recently from 6ca36d0 to af3d679 Compare February 26, 2021 19:03
@senhuang42 senhuang42 force-pushed the block_splitter_v2 branch 3 times, most recently from 06d0d03 to 6153280 Compare March 9, 2021 18:10
const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy,
BYTE* dst, const BYTE* const dstEnd,
ZSTD_strategy strategy, unsigned* countWorkspace,
void* entropyWorkspace, size_t entropyWkspSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:
when a function has so many parameters, it warrants some caution.
Maybe there are ways to reduce that amount.

There is nothing obviously wrong, nor any obvious solution,
so I'll just keep an eye on this topic while reading the rest of the code.

const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy,
BYTE* dst, const BYTE* const dstEnd,
ZSTD_strategy strategy, unsigned* countWorkspace,
void* entropyWorkspace, size_t entropyWkspSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document how much should be entropyWkspSize for the function to be successful.

U32 MLtype;
BYTE* seqHead = op++;
/* build stats for sequences */
entropyStatisticsSize = ZSTD_buildSequencesStatistics(seqStorePtr, nbSeq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this variable entropyStatisticsSize seems limited to this block.
Prefer declaring it here, so that it ends its lifetime by the end of the block.
General rule: the shorter the lifetime, the better.

/* ZSTD_buildSequencesStatistics() is guaranteed to overwrite these values */
U32 LLtype = set_basic;
U32 Offtype = set_basic;
U32 MLtype = set_basic;
Copy link
Contributor

@Cyan4973 Cyan4973 Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this invocation of ZSTD_buildSequencesStatistics(),
LLtype, Offtype and MLtype are initialized.

But in previous invocation, they were not initialized.

This suggests that one of these 2 formulations is incorrect:
either these values are read, and they should be initialized,
either they are not, and initializing them make it appear their initial value matter.

Reading the code of ZSTD_buildSequencesStatistics(), it appears they are not read.
So they don't need to be initialized.

This underlines a classical problem with in+out variables (by reference):
we don't know, from the function signature, if they need to be initialized before being passed to the function.

Prefer returning these 3 values as part of the function return statement.
It will make it clear that they are only out values.

*/
MEM_STATIC size_t
ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq,
U32* LLtype, U32* Offtype, U32* MLtype, size_t* lastCountSize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that LLtype, Offtype and MLtype are simple scalar out values.
Prefer passing them as part of the return statement of the function
(which therefore must be upgrade to a struct),
so that it's clear they are purely out values, not in + out.

@@ -2378,7 +2438,7 @@ ZSTD_entropyCompressSequences(seqStore_t* seqStorePtr,
void* dst, size_t dstCapacity,
size_t srcSize,
void* entropyWorkspace, size_t entropyWkspSize,
int bmi2)
int bmi2, U32 const canEmitUncompressed)
Copy link
Contributor

@Cyan4973 Cyan4973 Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting,
so you want to be able to forbid ZSTD_entropyCompressSequences()
from sending non-compressed sequences
therefore including cases where it's not considered advantageous.

  1. Is that just a test feature, or a permanent one ?
  2. If permanent, could you explain me in which cases this control is necessary ?
  3. the specification doesn't require the compressed data to be smaller than uncompressed one. But a compressed block must still be smaller than a full block (a.k.a 128 KB for the general case). The full block also includes the literals block. What would happen in this case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a permanent feature - the reason I include this here is that I want to avoid sending uncompressed/RLE blocks when it would destroy repcode history in a bad way while block splitting. i.e., if the next block's first three sequences contain a repcode, we shouldn't try to emit an uncompressed or RLE block, since then the next block's repcode sequence will reference a sequence that is now "lost" as part of an uncompressed or RLE block and therefore invalid.

I suppose that this case, it could be possible that we emit a "compressed" block that is larger than an uncompressed one - basically the a similar limitation as superblocks, though I'm not quite sure how we could get around this while block splitting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have one question on this point :

Can you guarantee that this mechanism (forcing the output to be compressed even when disadvantageous) is not going to produce a block size large than a full block ?
If yes, can you add an assert() ?

Copy link
Contributor Author

@senhuang42 senhuang42 Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can guarantee that, at least from this particular codepath.

One thing we could potentially do is: if it turns out that our split, compressed block's total size is larger than 128 KB, then what we could do is trigger a re-compression using the traditional method (single block). Do you think that'd be sufficient? Is it accurate to say that the upper bound on a zstd-format block is then 131075 bytes? (128K raw block + 3 byte block header).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate to say that the upper bound on a compressed block is then 131075 bytes? (128K + 3 byte block header).

Well, depending on where the size is measured, yes this can be correct.

Not though that there are several places in the code which do not include the block header cost.

One thing we could potentially do is: if it turns out that our split, compressed block's total size is larger than 128 KB, then what we could do is trigger a re-compression using the traditional method (single block). Do you think that'd be sufficient?

Yes.

That being said, the limitation is lower than that :
no individual block can be > 128 KB.

If the original block is split, and the sum of both parts is > 128 KB, this is not a "problem", in the sense that it's still valid (inefficient sure, but valid).
But if one block is > 128 KB, then it's beyond spec, and can break the decoder.
That's what we want to avoid.

Copy link
Contributor Author

@senhuang42 senhuang42 Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay good to know - I believe in this case we'd need to go with this stricter limitation of not allowing the split blocks cumulatively to be over 128K (and in so doing prevent any single one from being over 128K).

This is because the only case in which we'd potentially emit a compressed block larger than 128K is when we forcibly disable uncompressed or RLE blocks because doing so would erroneously destroy the repcode history. So in this situation the following two statements are simultaneously true: 1) emitting a compressed block instead of uncompressed/RLE block is necessary to maintain repcode history 2) emitting a compressed block may exceed the 128K limit.

Thus, we have to basically re-compress in case statement 2) turns out to be true (we can't do it "halfway"). The latest commit adds in this functionality.

ZSTD_entropyCTables_t* nextEntropy,
const ZSTD_CCtx_params* cctxParams,
ZSTD_entropyCTablesMetadata_t* entropyMetadata,
void* workspace, size_t wkspSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: specify what should be the (minimum) size of wkspSize for this function to work properly.

/* Returns the size estimate for the FSE-compressed symbols (of, ml, ll) of a block */
static size_t ZSTD_estimateBlockSize_symbolType(symbolEncodingType_e type,
const BYTE* codeTable, unsigned maxCode,
size_t nbSeq, const FSE_CTable* fseCTable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: parameter order

nbSeq is, if I understand correctly, the actual size of codeTable buffer.
So it should be placed alongside it.

assert(max <= defaultMax);
cSymbolTypeSizeEstimateInBits = max <= defaultMax
? ZSTD_crossEntropyCost(defaultNorm, defaultNormLog, countWksp, max)
: ERROR(GENERIC);
Copy link
Contributor

@Cyan4973 Cyan4973 Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error case shouldn't be necessary.
As per the assert() before, this case should never happen
(even in presence of bogus input data).

return cSeqSizeEstimate + sequencesSectionHeaderSize;
}

/* Returns the size estimate for a given stream of literals, of, ll, ml */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
do you have a debug mode
in which the estimation and the real size of the block are compared ?
If yes, do you have results ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this into the debuglogs. While developing this/eyeballing: the literals size estimate was typically almost exact (1-4 bytes off) while the sequences size estimate was a bit further off.

An excerpt of results from the beginning of silesia.tar:

../lib/compress/zstd_compress.c: Estimated size: 8108 actual size: 8158 
../lib/compress/zstd_compress.c: Estimated size: 6107 actual size: 6129 
../lib/compress/zstd_compress.c: Estimated size: 10062 actual size: 10058 
../lib/compress/zstd_compress.c: Estimated size: 19766 actual size: 19718 
../lib/compress/zstd_compress.c: Estimated size: 20583 actual size: 20464 
../lib/compress/zstd_compress.c: Estimated size: 22150 actual size: 22076 
../lib/compress/zstd_compress.c: Estimated size: 19845 actual size: 19756 
../lib/compress/zstd_compress.c: Estimated size: 21328 actual size: 21261 
../lib/compress/zstd_compress.c: Estimated size: 17800 actual size: 17706 
../lib/compress/zstd_compress.c: Estimated size: 19530 actual size: 19484

for (i = 0; i < nbSeqs; ++i) {
seqDef seq = seqStore->sequencesStart[i];
literalsBytes += seq.litLength;
if (i == seqStore->longLengthPos && seqStore->longLengthID == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, and not caused by this PR :

maybe it would have been better to employ an enum for the values of longLengthID
because at this stage, the meaning of 1 is a bit lost
(requires accessing the relevant documentation in zstd_internal.h)

@senhuang42 senhuang42 force-pushed the block_splitter_v2 branch 2 times, most recently from e56351e to 6db18d6 Compare March 17, 2021 17:58
@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 19, 2021

@Cyan4973 some stats on the accuracy of the block splitter:

Generally the false positive rate (splitting when we shouldn't have) seems pretty low:

silesia.tar
-level 9: no split: 1118 right: 496 wrong: 4 false_pos_rate: 0.8%
-level 13: no split: 1120 right: 488 wrong: 10 false_pos_rate: 2.0%
-level 19: no split: 1193 right: 416 wrong: 9 false_pos_rate: 2.1%
-level 22: no split: 1200 right: 414 wrong: 4 false_pos_rate: 0.95%

Additionally, some benchmarks that include decompression speed, since higher density of block boundaries would imply slower decompression:

no block splitting:
 9#silesia.tar       : 211950592 ->  60174265 (3.522),  35.6 MB/s ,1136.9 MB/s 
10#silesia.tar       : 211950592 ->  59525124 (3.561),  30.4 MB/s ,1133.4 MB/s 
11#silesia.tar       : 211950592 ->  59230515 (3.578),  24.9 MB/s ,1133.8 MB/s 
12#silesia.tar       : 211950592 ->  58769028 (3.607),  17.2 MB/s ,1144.9 MB/s 
13#silesia.tar       : 211950592 ->  58076720 (3.649),  11.8 MB/s ,1166.1 MB/s 
14#silesia.tar       : 211950592 ->  57565058 (3.682),  9.90 MB/s ,1173.4 MB/s 
15#silesia.tar       : 211950592 ->  57158512 (3.708),  7.60 MB/s ,1182.9 MB/s 
16#silesia.tar       : 211950592 ->  55685499 (3.806),  5.88 MB/s ,1149.0 MB/s 
17#silesia.tar       : 211950592 ->  54591166 (3.883),  4.63 MB/s ,1118.9 MB/s 
18#silesia.tar       : 211950592 ->  53698939 (3.947),  3.71 MB/s ,1067.9 MB/s 
19#silesia.tar       : 211950592 ->  53263176 (3.979),  3.04 MB/s ,1054.7 MB/s

with block splitting:
 9#silesia.tar       : 211950592 ->  59761275 (3.547),  31.9 MB/s ,1084.0 MB/s 
10#silesia.tar       : 211950592 ->  59128797 (3.585),  27.5 MB/s ,1078.3 MB/s 
11#silesia.tar       : 211950592 ->  58838943 (3.602),  23.0 MB/s ,1079.3 MB/s 
12#silesia.tar       : 211950592 ->  58380963 (3.630),  16.4 MB/s ,1088.8 MB/s 
13#silesia.tar       : 211950592 ->  57686254 (3.674),  11.4 MB/s ,1107.5 MB/s 
14#silesia.tar       : 211950592 ->  57181298 (3.707),  9.63 MB/s ,1115.8 MB/s 
15#silesia.tar       : 211950592 ->  56777472 (3.733),  7.41 MB/s ,1124.2 MB/s 
16#silesia.tar       : 211950592 ->  55338421 (3.830),  5.77 MB/s ,1092.2 MB/s 
17#silesia.tar       : 211950592 ->  54270477 (3.905),  4.57 MB/s ,1067.3 MB/s 
18#silesia.tar       : 211950592 ->  53428123 (3.967),  3.69 MB/s ,1031.5 MB/s 
19#silesia.tar       : 211950592 ->  52998717 (3.999),  3.01 MB/s ,1020.6 MB/s

It looks like in general, we can get around ~0.8 of a compression level gain for a more or less fixed-cost compression speed slowdown that becomes hardly noticeable at higher levels, as well as a minor, but maybe noticeable decompression speed slowdown. I'm of the opinion that this should definitely be enabled on the higher compression level settings, though the medium levels are up for debate, and I'd expect row hash to also change how we'd think about the medium levels with respect to the block splitter, considering their significant speedup.

@Cyan4973
Copy link
Contributor

I agree. Enabling this feature by default for high compression levels is a good strategy.
Looking at benchmark figures, the split seems to be around btlazy2 strategy.

Question :
in the first report, what does mean wrong ?

  • It determined that block should be split, but actually it results in worse compression ratio
    • Is it "strictly worse", or not good enough (according to a criteria) ?
  • It determined that block should not be split, but actually, if it was, it would improve compression ratio

@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 19, 2021

in the first report, what does mean wrong ?

The former case - It means that the block was determined to be split and we split it, but it turns out we should not have. The criteria is simply that the 128K split block has a larger or equal compressed size to the 128K unsplit block.

Here's some more info about the number of bytes that could be saved had we recompressed the "incorrectly" split block as a single block, and it doesn't seem very large.

silesia.tar
-level 9: potential savings: 78 bytes
-level 13: potential savings: 92 bytes
-level 19: potential savings: 159 bytes
-level 22: potential savings: 159 bytes

To make this even better we'd need to add more things to the algorithm itself, i.e. addressing the latter case you mentioned:
discover potential savings we didn't find.

@Cyan4973
Copy link
Contributor

we'd need to add more things to the algorithm itself, i.e. addressing the latter case you mentioned: discover potential savings we didn't find.

To do this, we would just have to lower the threshold, into negative value territory.
It would cover situations where the estimation believes that splitting is not advantageous, while actual compression shows a gain.
Of course, such a mechanism implies an ability to "roll back" the split, because now, it will be a "bad" decision in a larger nb of cases.
It also implies a larger cpu cost.

That being said, no need to address that in this PR.
These are just thoughts for some future follow-up.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 22, 2021

I've set this to conservatively enable by default: for strategy >= btopt and windowLog >= 128K (default size of block).

I've integrated it in the same style/codepaths as ldm getting enabled by default.

@@ -3370,6 +3371,14 @@ static size_t ZSTD_compressBlock_splitBlock_internal(ZSTD_CCtx* zc, void* dst, s
cSize += cSizeChunk;
currSeqStore = nextSeqStore;
}

if (cSize > ZSTD_BLOCKSIZE_MAX + ZSTD_blockHeaderSize) {
Copy link
Contributor

@Cyan4973 Cyan4973 Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I'm not sure to understand here :

  1. are you certain that, at this place in the code, cSize includes the block header size ?
  2. the limitation regarding the maximum compressed block size related to the complete block. Thus, in compressed format, it includes both the literals and sequence sub-blocks. Yet, below, I only notice ZSTD_compressSequences

Copy link
Contributor Author

@senhuang42 senhuang42 Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, ZSTD_compressSequences_singleBlock() includes the block header, and this cSize measurement includes all block headers from split blocks.
  2. ZSTD_compressSequences_singleBlock(): This function compresses both the literals and the sequences (I had named it in the style of the existing ZSTD_entropyCompressSequences() which also compresses literals and sequences). Maybe a better name would be ZSTD_compressSeqStore_singleBlock() and rename the existing ZSTD_entropyCompressSequences() to ZSTD_entropyCompressSeqStore()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the new name is much better for clarity

@senhuang42
Copy link
Contributor Author

Most recent push is just to improve and clean up the git commit history - will merge once tests look okay.

@senhuang42 senhuang42 merged commit bf542c8 into facebook:dev Mar 24, 2021
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @senhuang42, can you please move the commented items into a workspace? Block splitting is using too much stack space.

You can test it with:

cd contrib/linux-kernel
CC=clang make -j test

That test is failing currently. Before this PR I see Maximum stack size: 2056, after I see Maximum stack size: 3096.

Comment on lines +3256 to +3258
seqStore_t fullSeqStoreChunk;
seqStore_t firstHalfSeqStore;
seqStore_t secondHalfSeqStore;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each seqStore_t is 80 bytes, so this is 240 bytes of stack space. That is too much.

size_t cSize = 0;
const BYTE* ip = (const BYTE*)src;
BYTE* op = (BYTE*)dst;
U32 partitions[MAX_NB_SPLITS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 196*4 = 784 bytes of stack space, that is too much.

Comment on lines +3329 to +3330
seqStore_t nextSeqStore;
seqStore_t currSeqStore;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another 160 bytes of stack space.

* Returns the estimated compressed size of the seqStore, or a zstd error.
*/
static size_t ZSTD_buildEntropyStatisticsAndEstimateSubBlockSize(seqStore_t* seqStore, const ZSTD_CCtx* zc) {
ZSTD_entropyCTablesMetadata_t entropyMetadata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a few hundred bytes of stack space, which is too much.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Sep 15, 2021

Seems like the best thing to do would be to define a new struct that holds all these things, like a block split context or something, allocate that to the cctx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants