Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data loss at stream end in -w4 -l2 #16

Closed
silentbicycle opened this issue May 24, 2015 · 6 comments
Closed

Possible data loss at stream end in -w4 -l2 #16

silentbicycle opened this issue May 24, 2015 · 6 comments
Labels
Milestone

Comments

@silentbicycle
Copy link
Collaborator

I got an excellent bug report from @unixdj with detailed, minimal steps to reproduce data loss at the end of the stream:

$ echo -n aaaa | ./heatshrink -e -w4 -l2 | ./heatshrink -d -w4 -l2
a   # should be "aaaa"

A regression test has been added.

This is not currently known to occur under any other combinations of settings except -w4 -l2, the absolute minimum.

A change that fixes this without breaking reverse compatibility would be strongly preferred.

@silentbicycle
Copy link
Collaborator Author

The encoder is correct, the problem is in the decoder. The tag bit (1) plus the window size (4) plus the lookahead size (2) is only 7 bits, and the decoder can end up in a state where it has 7 bits of unprocessed data, but the end of input stream takes precedence and it prematurely indicates that it has completed.

This will not happen under any other configuration because everything else will use at least 8 bits, causing things to be buffered correctly.

@silentbicycle
Copy link
Collaborator Author

I'm considering closing this by increasing the minimum lookahead size to 3, rather than adding special cases. Thoughts, @unixdj?

@silentbicycle
Copy link
Collaborator Author

Resolved by 15ebadd . The minimum lookahead needs to be increased to 3, because otherwise a state can occur at the end of the input whose interpretation is ambiguous, leading to either a few bytes being discarded or the last byte being repeated incorrectly.

@unixdj
Copy link
Contributor

unixdj commented May 24, 2015

Increasing minimum lookahead size fixes the issue, of course.

The problem with the encoder is that it encodes "a" as [0xb0 0x80], which, if the decoder is fixed to handle the case above, may be interpreted by the decoder as:

1        tag bit: literal follows
01100001 'a'
0        tag bit: backref
0000     backref index 1
00       backref count 1

and decoded as "aa". The encoder, however, would never produce such stream (though it would make sense to). So in order to handle this correctly, there are two solutions:

(a) Make the decoder disregard backreferences shorter than the encoder would produce.
(b) Make the encoder fill the empty bits in the last byte with ones instead of zeroes, so that the decoder would finish in the HSDS_YIELD_LITERAL state.

I would say that option (b) seems most conceptuallly sound.

@silentbicycle
Copy link
Collaborator Author

Indeed. I discovered that "aa" issue while investigating the root cause of the original issue, as well.

(b) seems like a good choice, and IIRC you've already done some investigation along those lines. If that covers all the cases, then the minimum length won't need to be increased after all.

Option (a) reminded me that there may be an opportunity to improve the compression ratio slightly (breaking reverse-compatibility), added as issue #17.

@silentbicycle silentbicycle reopened this May 24, 2015
@unixdj
Copy link
Contributor

unixdj commented May 24, 2015

Yes, I've tried (b), and I reckon it covers all the corner cases. Here's the code:

master...unixdj:eof42

@silentbicycle silentbicycle added this to the v0.4.0 milestone Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants