-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm not 100% sure that block splitter is working properly #3204
Comments
Thanks for the report! We know our block splitter is pretty trivial right now, and hope to find some time to work on it in the future. This case in particular is probably an interaction between block splitting, and the optimal parser, that tries to make decisions based on what it estimates the costs of particular matches are. So the block splitting changes the costs, but we don't re-parse based on the newly split blocks.
How do you sub-divide in Python? I want to make sure that we can exactly reproduce both the good & bad cases, so we can investigate exactly what is going on to cause this huge regression. |
Not very easily, unfortunately. See https://gist.github.com/thatch/c2e8aa3d34a7a548687f43813f9cb788 Usage: After reading 128KB, it tries that as well as 64+64, and 32+32+32+32, etc (uniformly sized within a 128KB block). In initial testing there appeared to be a single minimum [sizes appeared to look like concave up], hence the "break" which avoids the worst of the performance penalty. Note that the script will never choose something like 32+64+32, or 32+128+... (altering the stride) which are other potentials for wins if the determination is cheap enough [I think you might do the former, but not the latter]. Because I'm trying to keep accurate estimates of what the block would take to compress including use of backreferences I have to prime it with a fair amount of data, which must be recompressed after every context reset. If there were an API to insert an already-compressed block/window worth of data, or copy a context, this would be a lot more performance for scripts like this. But I think in your shoes I'd say "zstd should be able to at least match this internally" and keep the simple public API. |
I'm currently revisiting this topic. The ratio difference comes from subtle reinforcement effects in the optimal parser's cost evaluation function. Minuscule differences in initial conditions can throw the cost function into one direction or the other. Splitting blocks differently can throw the evaluation into one direction or another, which is why trying multiple size combinations increases the odds to throw the cost function into the better direction (obviously, at a cpu cost). The situation was improved in It doesn't mean that the block splitting function is perfect, but improvements in this function are generally expected to be in the ~1% range, save unusually catastrophic scenarios. |
fixed in |
Describe the bug
This is not a correctness issue. It simply appear to be leaving a lot of ratio unattained when faced with real-world data.
Side notes:
from #2832
uncompressed.bin https://gist.github.com/thatch/853de8089e17462f2e500a7f87c76736
zstd -19: 179956 bytes
trivial subdivision in python: 129343 bytes
from #3014
test.bin https://gist.github.com/thatch/c2c2cc7304208423543edc38f2b7ed76
zstd -19: 2688424 bytes
trivial subdivision in python: 1021806 bytes
Expected behavior
Ratio to be as good (or a little better) than my trivial code.
For trivial random input with blocks of known entropy, it does appear to work. I have some evidence that this affects ELF files, which isn't super surprising because they likely have 4k-aligned sections.
The text was updated successfully, but these errors were encountered: