Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I'm not 100% sure that block splitter is working properly #3204

Closed
thatch opened this issue Jul 18, 2022 · 4 comments
Closed

I'm not 100% sure that block splitter is working properly #3204

thatch opened this issue Jul 18, 2022 · 4 comments

Comments

@thatch
Copy link

thatch commented Jul 18, 2022

Describe the bug

This is not a correctness issue. It simply appear to be leaving a lot of ratio unattained when faced with real-world data.

Side notes:

  • zstd is zstd 1.5.2-3 from archlinux
  • python-zstandard is 0.18.0 using backend_c
  • block splitter being an experimental param, I don't think I can enable separately in the zstd CLI or in the python-zstandard compression params. I am using level 19 with the assumption that it is always on with that level...

from #2832
uncompressed.bin https://gist.github.com/thatch/853de8089e17462f2e500a7f87c76736
zstd -19: 179956 bytes
trivial subdivision in python: 129343 bytes

from #3014
test.bin https://gist.github.com/thatch/c2c2cc7304208423543edc38f2b7ed76
zstd -19: 2688424 bytes
trivial subdivision in python: 1021806 bytes

Expected behavior
Ratio to be as good (or a little better) than my trivial code.

For trivial random input with blocks of known entropy, it does appear to work. I have some evidence that this affects ELF files, which isn't super surprising because they likely have 4k-aligned sections.

@terrelln
Copy link
Contributor

Thanks for the report! We know our block splitter is pretty trivial right now, and hope to find some time to work on it in the future. This case in particular is probably an interaction between block splitting, and the optimal parser, that tries to make decisions based on what it estimates the costs of particular matches are. So the block splitting changes the costs, but we don't re-parse based on the newly split blocks.

trivial subdivision in python: 1021806 bytes

How do you sub-divide in Python? I want to make sure that we can exactly reproduce both the good & bad cases, so we can investigate exactly what is going on to cause this huge regression.

@thatch
Copy link
Author

thatch commented Jul 25, 2022

How do you sub-divide in Python?

Not very easily, unfortunately. See https://gist.github.com/thatch/c2e8aa3d34a7a548687f43813f9cb788

Usage: python compress_subdivide.py -19 input_filename.bin

After reading 128KB, it tries that as well as 64+64, and 32+32+32+32, etc (uniformly sized within a 128KB block). In initial testing there appeared to be a single minimum [sizes appeared to look like concave up], hence the "break" which avoids the worst of the performance penalty.

Note that the script will never choose something like 32+64+32, or 32+128+... (altering the stride) which are other potentials for wins if the determination is cheap enough [I think you might do the former, but not the latter]. Because I'm trying to keep accurate estimates of what the block would take to compress including use of backreferences I have to prime it with a fair amount of data, which must be recompressed after every context reset.

If there were an API to insert an already-compressed block/window worth of data, or copy a context, this would be a lot more performance for scripts like this. But I think in your shoes I'd say "zstd should be able to at least match this internally" and keep the simple public API.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 24, 2024

I'm currently revisiting this topic.
After investigation, the main reason for the substantial compression ratios differences reported in this issue, for both #2832 and #3014, is not block splitting per se, but an indirect side effect.

The ratio difference comes from subtle reinforcement effects in the optimal parser's cost evaluation function. Minuscule differences in initial conditions can throw the cost function into one direction or the other. Splitting blocks differently can throw the evaluation into one direction or another, which is why trying multiple size combinations increases the odds to throw the cost function into the better direction (obviously, at a cpu cost).

The situation was improved in v1.5.6, and now the compression ratio is more stable and closer to the best scenario. Using mentioned samples, uncompressed.bin now compresses to 131,410 bytes, and test.bin to 831,424 bytes, which is in line or better than compression results reported here using the python multi-split-compare script. These results were achieved without modifying the current block splitting function.

It doesn't mean that the block splitting function is perfect, but improvements in this function are generally expected to be in the ~1% range, save unusually catastrophic scenarios.

@Cyan4973 Cyan4973 self-assigned this Jun 24, 2024
@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 3, 2024

fixed in v1.5.6

@Cyan4973 Cyan4973 closed this as completed Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants