-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible data loss at stream end in -w4 -l2 #16
Comments
The encoder is correct, the problem is in the decoder. The tag bit (1) plus the window size (4) plus the lookahead size (2) is only 7 bits, and the decoder can end up in a state where it has 7 bits of unprocessed data, but the end of input stream takes precedence and it prematurely indicates that it has completed. This will not happen under any other configuration because everything else will use at least 8 bits, causing things to be buffered correctly. |
I'm considering closing this by increasing the minimum lookahead size to 3, rather than adding special cases. Thoughts, @unixdj? |
Resolved by 15ebadd . The minimum lookahead needs to be increased to 3, because otherwise a state can occur at the end of the input whose interpretation is ambiguous, leading to either a few bytes being discarded or the last byte being repeated incorrectly. |
Increasing minimum lookahead size fixes the issue, of course. The problem with the encoder is that it encodes "a" as [0xb0 0x80], which, if the decoder is fixed to handle the case above, may be interpreted by the decoder as:
and decoded as "aa". The encoder, however, would never produce such stream (though it would make sense to). So in order to handle this correctly, there are two solutions: (a) Make the decoder disregard backreferences shorter than the encoder would produce. I would say that option (b) seems most conceptuallly sound. |
Indeed. I discovered that "aa" issue while investigating the root cause of the original issue, as well. (b) seems like a good choice, and IIRC you've already done some investigation along those lines. If that covers all the cases, then the minimum length won't need to be increased after all. Option (a) reminded me that there may be an opportunity to improve the compression ratio slightly (breaking reverse-compatibility), added as issue #17. |
Yes, I've tried (b), and I reckon it covers all the corner cases. Here's the code: |
I got an excellent bug report from @unixdj with detailed, minimal steps to reproduce data loss at the end of the stream:
A regression test has been added.
This is not currently known to occur under any other combinations of settings except -w4 -l2, the absolute minimum.
A change that fixes this without breaking reverse compatibility would be strongly preferred.
The text was updated successfully, but these errors were encountered: