I'm not 100% sure that block splitter is working properly #3204

thatch · 2022-07-18T19:37:34Z

Describe the bug

This is not a correctness issue. It simply appear to be leaving a lot of ratio unattained when faced with real-world data.

Side notes:

zstd is zstd 1.5.2-3 from archlinux
python-zstandard is 0.18.0 using backend_c
block splitter being an experimental param, I don't think I can enable separately in the zstd CLI or in the python-zstandard compression params. I am using level 19 with the assumption that it is always on with that level...

from #2832
uncompressed.bin https://gist.github.com/thatch/853de8089e17462f2e500a7f87c76736
zstd -19: 179956 bytes
trivial subdivision in python: 129343 bytes

from #3014
test.bin https://gist.github.com/thatch/c2c2cc7304208423543edc38f2b7ed76
zstd -19: 2688424 bytes
trivial subdivision in python: 1021806 bytes

Expected behavior
Ratio to be as good (or a little better) than my trivial code.

For trivial random input with blocks of known entropy, it does appear to work. I have some evidence that this affects ELF files, which isn't super surprising because they likely have 4k-aligned sections.

terrelln · 2022-07-25T18:40:27Z

Thanks for the report! We know our block splitter is pretty trivial right now, and hope to find some time to work on it in the future. This case in particular is probably an interaction between block splitting, and the optimal parser, that tries to make decisions based on what it estimates the costs of particular matches are. So the block splitting changes the costs, but we don't re-parse based on the newly split blocks.

trivial subdivision in python: 1021806 bytes

How do you sub-divide in Python? I want to make sure that we can exactly reproduce both the good & bad cases, so we can investigate exactly what is going on to cause this huge regression.

thatch · 2022-07-25T19:16:02Z

How do you sub-divide in Python?

Not very easily, unfortunately. See https://gist.github.com/thatch/c2e8aa3d34a7a548687f43813f9cb788

Usage: python compress_subdivide.py -19 input_filename.bin

After reading 128KB, it tries that as well as 64+64, and 32+32+32+32, etc (uniformly sized within a 128KB block). In initial testing there appeared to be a single minimum [sizes appeared to look like concave up], hence the "break" which avoids the worst of the performance penalty.

Note that the script will never choose something like 32+64+32, or 32+128+... (altering the stride) which are other potentials for wins if the determination is cheap enough [I think you might do the former, but not the latter]. Because I'm trying to keep accurate estimates of what the block would take to compress including use of backreferences I have to prime it with a fair amount of data, which must be recompressed after every context reset.

If there were an API to insert an already-compressed block/window worth of data, or copy a context, this would be a lot more performance for scripts like this. But I think in your shoes I'd say "zstd should be able to at least match this internally" and keep the simple public API.

Cyan4973 · 2024-06-24T19:08:53Z

I'm currently revisiting this topic.
After investigation, the main reason for the substantial compression ratios differences reported in this issue, for both #2832 and #3014, is not block splitting per se, but an indirect side effect.

The ratio difference comes from subtle reinforcement effects in the optimal parser's cost evaluation function. Minuscule differences in initial conditions can throw the cost function into one direction or the other. Splitting blocks differently can throw the evaluation into one direction or another, which is why trying multiple size combinations increases the odds to throw the cost function into the better direction (obviously, at a cpu cost).

The situation was improved in v1.5.6, and now the compression ratio is more stable and closer to the best scenario. Using mentioned samples, uncompressed.bin now compresses to 131,410 bytes, and test.bin to 831,424 bytes, which is in line or better than compression results reported here using the python multi-split-compare script. These results were achieved without modifying the current block splitting function.

It doesn't mean that the block splitting function is perfect, but improvements in this function are generally expected to be in the ~1% range, save unusually catastrophic scenarios.

Cyan4973 · 2024-10-03T01:42:42Z

fixed in v1.5.6

terrelln added enhancement optimization labels Jul 25, 2022

Cyan4973 mentioned this issue Dec 18, 2022

Block splitter : minor reformatting #3376

Merged

Cyan4973 self-assigned this Jun 24, 2024

Cyan4973 closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm not 100% sure that block splitter is working properly #3204

I'm not 100% sure that block splitter is working properly #3204

thatch commented Jul 18, 2022

terrelln commented Jul 25, 2022

thatch commented Jul 25, 2022 •

edited

Loading

Cyan4973 commented Jun 24, 2024 •

edited

Loading

Cyan4973 commented Oct 3, 2024

I'm not 100% sure that block splitter is working properly #3204

I'm not 100% sure that block splitter is working properly #3204

Comments

thatch commented Jul 18, 2022

terrelln commented Jul 25, 2022

thatch commented Jul 25, 2022 • edited Loading

Cyan4973 commented Jun 24, 2024 • edited Loading

Cyan4973 commented Oct 3, 2024

thatch commented Jul 25, 2022 •

edited

Loading

Cyan4973 commented Jun 24, 2024 •

edited

Loading