Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing break even point may lower compression ratio #32

Open
unixdj opened this issue Jun 8, 2015 · 2 comments
Open

Reducing break even point may lower compression ratio #32

unixdj opened this issue Jun 8, 2015 · 2 comments

Comments

@unixdj
Copy link
Contributor

unixdj commented Jun 8, 2015

Reducing the break even point, as done in f533992, should theoretically improve compression ratio. However, in some cases emitting a literal or two followed by a backref would make the encoding overall shorter than two shorter backrefs that the (naturally greedy) encoder finds. E.g., when compressing kennedy.xls from the Canterbury Corpus with -w 8 -l 7 with different break even points, one can observe many such occurences (- is break even point 3, + is 2):

 ## step_search, scan @ +15 (271/512), input size 256
 ss Match not found
 ## step_search, scan @ +16 (272/512), input size 256
-ss Match not found
-## step_search, scan @ +17 (273/512), input size 256
-ss Found match of 6 bytes at 8
+ss Found match of 2 bytes at 1
+## step_search, scan @ +18 (274/512), input size 256
+ss Found match of 5 bytes at 8
 ## step_search, scan @ +23 (279/512), input size 256
 ss Match not found
 ## step_search, scan @ +24 (280/512), input size 256

It would be nice to try to find a way to optimize this within the contraints of an embedded environment. If possible, without backtracking.

@unixdj
Copy link
Contributor Author

unixdj commented Jun 8, 2015

A quick bechmark shows that the following moification of step_search() seems to improve compression ratio:

match = find_match(here);
if (match.found && find_match(here + 1).len <= match.len) {
    emit_backref();
else
    emit_literal();

(here + 1 seems to work better than here + break_even_point / 9 for some reason.)

Time to code it without calling find_match() twice as much. It's also completely meaningless when backref encoding is not longer than literal encoding.

@unixdj
Copy link
Contributor Author

unixdj commented Jun 24, 2015

I haven't had much time to play with it lately, but here's an implementation of the above:
https://github.com/atomicobject/heatshrink/compare/master...unixdj:look-forward?expand=1
It's still really slow. Really really slow. But compression ratio is higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant