Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User control over minimum match length #57

Closed
jkbonfield opened this issue Oct 25, 2019 · 9 comments
Closed

User control over minimum match length #57

jkbonfield opened this issue Oct 25, 2019 · 9 comments

Comments

@jkbonfield
Copy link

jkbonfield commented Oct 25, 2019

One of the zlib features is the strategy option, which permits Z_DEFAULT_STRATEGY vs Z_FILTERED. The latter changes the minimum match length which correspondingly then favours huffman over LZ for small matches. On some data sets it can have a big impact.

The same logic can work for libdeflate. Eg with and without this tiny patch:

diff --git a/lib/deflate_compress.c b/lib/deflate_compress.c
index 9ecb8b6..5e7890a 100644
--- a/lib/deflate_compress.c
+++ b/lib/deflate_compress.c
@@ -2099,7 +2099,7 @@ deflate_compress_lazy(struct libdeflate_compressor * restrict c,
                                                               &cur_offset);
                        in_next += 1;
 
-                       if (cur_len < DEFLATE_MIN_MATCH_LEN) {
+                       if (cur_len < DEFLATE_MIN_MATCH_LEN+4) {
                                /* No match found.  Choose a literal.  */
                                deflate_choose_literal(c, *(in_next - 1), &litrunlen);
                                observe_literal(&c->split_stats, *(in_next - 1));
;#orig
@ gen1c[compress.../libdeflate]; ./gzip < /tmp/rn|wc -c
5544678
@ gen1c[compress.../libdeflate]; ./gzip -9 < /tmp/rn|wc -c
5352317

;# Patched
@ gen1c[compress.../libdeflate]; ./gzip < /tmp/rn|wc -c
5039812

;# Zlib
@ gen1c[compress.../libdeflate]; ~/work/compression/zlib_test Z_DEFAULT_STRATEGY < /tmp/rn|wc -c
5513639
@ gen1c[compress.../libdeflate]; ~/work/compression/zlib_test Z_FILTERED < /tmp/rn|wc -c
5118920
@ gen1c[compress.../libdeflate]; ~/work/compression/zlib_test Z_FILTERED 9 < /tmp/rn|wc -c
4973229

;# 7z mx9 gets it to 4646598

This data is a series of DNA sequencing read identifiers, looking like:

VP2-06:112:H7LNDMCVY:1:1124:21694:10473
VP2-06:112:H7LNDMCVY:1:1158:23665:6370
VP2-06:112:H7LNDMCVY:1:1219:23746:16250
VP2-06:112:H7LNDMCVY:1:1243:16414:36119
VP2-06:112:H7LNDMCVY:1:1251:6253:36119
VP2-06:112:H7LNDMCVY:1:1324:31412:16595
VP2-06:112:H7LNDMCVY:1:1431:22119:16125
VP2-06:112:H7LNDMCVY:1:2152:28881:21512
VP2-06:112:H7LNDMCVY:1:2207:21287:33567
VP2-06:112:H7LNDMCVY:1:2219:32651:24940
...

There are periodic duplicates as the data is paired, typically 50-150 lines apart.
There are lots of spurious matches in the rather random looking numbers (X and Y locations on a slide) which we're not interested in deflating, while the main prefix is critical. I think exposing the minimum match length, albeit in multiple places and I haven't a clue on the btree version, is a simple tweak to gain more out of the library.

Does this sound like something you'd consider adding?

Edit: with optimal parsing disabled and gzip -9 it gets "4977440" using min match of 7. Pretty close to zlib (I think that has min match of 5 though).

Edit: Fixed formatting.

@adamkewley
Copy link

Just so I understand, this optimization is good for DNA files because having a smaller match length causes individual literals to be encoded using a more efficient Huffman encoding (because the resulting tree will only have LZ matches > ~8 in length in the lcode tree), and DNA reads are mostly random-looking data?

@chrchang
Copy link

This issue concerns Illumina-sequencing read identifiers, not the actual DNA data in the read.

@adamkewley
Copy link

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

@ebiggers
Copy link
Owner

Do you have any sample files available I could test with? I'd like to investigate whether this case could be handled better automatically using some heuristics. Adding more options increases API complexity so is generally something I'd like to avoid.

@jkbonfield
Copy link
Author

Many thanks for the quick reply. A heuristic if possible would be better still. I was just thinking about what would be simple (and offering to do it if it's something you'd want). I don't understand it well enough to work out where to put heuristics though.

Take a look at the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/names

A quick run through all the *.names files with and without a +4 hack as above:

Original

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
72201
79313
101113
55364
100884
53751
75087
31300
59432
19211
177192

With change:

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
66852
72401
92062
50348
93563
54053
68551
32222
54043
17307
161436

It generally helps a lot, but not universally.
On this data I've wondered whether a custom algorithm that creates deflate format using purely line-oriented prefix matching would give the best speed for size tradeoff. (However for slower modes of my work I have a dedicated (non-deflate) name compressor anyway, so the impetus to be clever isn't huge.)

@jkbonfield
Copy link
Author

jkbonfield commented Oct 25, 2019

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

I haven't analysed quite what it's doing, but my thinking was the form of these strings is basically common prefix (instrument name) then :lane:tile:x:y. There is some dictionary component to tile, but the x and y are largely random. There may be lots of X and Y strings and substrings just because with only 10 characters the digits have a good chance of random matching (1 in 100 if we include colon + 2 digits).

It's quite easy for parsers to generate an LZ match for "Y newline prefix:lane" instead of just "prefix:lane", possibly at the expense of making a longer distance when it's not really justified for random chance match. (I assume that's what the optimal parser is trying to judge though.) Changing the min match length doesn't stop that so it's not what my change achieved clearly. I did try hacking it to require a newline starting character but that turned out to be tragic, so canned that idea quickly.

@jkbonfield
Copy link
Author

jkbonfield commented Oct 28, 2019

Another test, this time with DNA sequence data, also shows Zlib's Z_DEFAULT_STRATEGY vs Z_FILTERED to be a win. I'll put the data file in ftp://ftp.sanger.ac.uk/pub/users/jkb/seq

      Z_DEFAULT  Z_FILTERED   Default+Opt    Default-Opt  MIN_10 -Opt   7za
1     11185034   11185034     8807052        8807052      6767390
2     10172931   10172931     7672585        7672585      6329695
3      9300280    9300280     7453628        7453628      6208820
4      7606177    6803282     7310028        7310028      6102399
5      6904720    6537932     6765513        6765513      6027264
6      6658313    6447436     6600476        6600476      5897747
7      6592924    6440727     6494304        6494304      5831461
8      6448238    6325678     6514868        6434674      5790594
9      6427283    6303307     6217224        6385173      5751458       5722540
10     		  	      6086605
11     			      5977583
12     			      5909323

That shows Zlib with Z_DEFAULT_STRATEGY vs Z_FILTERED, Libdeflate as cloned (optimal parsing enabled for level 8-12), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled (levels 8,9 only), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled and +7 on the existing min_match len of 3, and finally max compression of 7zip in gzip mode.

You can see by default libdeflate beats zlib's defaults. Zlib gains a bit using Z_FILTERED, but it's still beaten when we use higher levels of libdeflate. However libdefault with a longer minimum match length trounces everything bar 7za and it gets pretty close to matching that too.

For what it's worth, I see the same benefits in zlib. Z_FILTERED is changing min match from 3 to 5. Bumping it to 10 manually shows level 6 gives 5935354 and level 9 5773148, so again we see potential for manual tweakage there too.

PS. I don't know how to limit the minimum match length for the optimal parser, so no numbers have been presented there.

ebiggers added a commit that referenced this issue Dec 31, 2021
In the greedy and lazy compressors, automatically increase the minimum
match length from the default of 3 if the data doesn't contain many
different literals.  This greatly improves the compression ratio of
levels 1-9 on certain types of data, such as DNA sequencing data, while
not worsening the ratio on other types of data.

The near-optimal compressor (used by compression levels 10-12) continues
to use a minimum match length of 3, since it already did a much better
job at deciding when short matches are worthwhile.

Resolves #57
ebiggers added a commit that referenced this issue Dec 31, 2021
In the greedy and lazy compressors, automatically increase the minimum
match length from the default of 3 if the data doesn't contain many
different literals.  This greatly improves the compression ratio of
levels 1-9 on certain types of data, such as DNA sequencing data, while
not worsening the ratio on other types of data.

The near-optimal compressor (used by compression levels 10-12) continues
to use a minimum match length of 3, since it already did a much better
job at deciding when short matches are worthwhile.

Resolves #57
ebiggers added a commit that referenced this issue Dec 31, 2021
In the greedy and lazy compressors, automatically increase the minimum
match length from the default of 3 if the data doesn't contain many
different literals.  This greatly improves the compression ratio of
levels 1-9 on certain types of data, such as DNA sequencing data, while
not worsening the ratio on other types of data.

The near-optimal compressor (used by compression levels 10-12) continues
to use a minimum match length of 3, since it already did a better job at
deciding when short matches are worthwhile.  (The method for setting the
initial costs needs improvement; later commits address that.)

Resolves #57
@ebiggers
Copy link
Owner

ebiggers commented Jan 1, 2022

I implemented some heuristics which handle setting the minimum match length automatically. Commit 69a7ca0 is the main one.

Here are the results for compressing all the .names files from https://github.com/jkbonfield/htscodecs-corpus/tree/master/names (as suggested by #57 (comment)) concatenated to each other (original size 33986678 bytes). Table entries contain compressed size in bytes / compression time in milliseconds; "new" is commit 3dca7de and "old" is commit 047aa84:

Level libdeflate (new) libdeflate (old) zlib
1 5212311 / 124 5384620 / 114 6007496 / 180
2 4965955 / 152 5283019 / 125 5856632 / 200
3 4934629 / 168 5229218 / 140 5598722 / 293
4 4850526 / 185 5142765 / 164 5223622 / 317
5 4608971 / 221 4902344 / 206 5011470 / 463
6 4505713 / 280 4804306 / 271 4927452 / 765
7 4453372 / 417 4758233 / 395 4749052 / 1049
8 4326029 / 760 4975789 / 1225 4658839 / 2557
9 4305630 / 1222 4643199 / 1759 4658374 / 3009
10 4044222 / 3471 4314207 / 2749 N/A
11 4023112 / 4762 4100868 / 4899 N/A
12 4019345 / 5893 4048534 / 6219 N/A

Here are the results for the seq file mentioned in #57 (comment). (Original size 109258164 bytes; it contains 301-character DNA sequences.)

Level libdeflate (new) libdeflate (old) zlib
1 8762885 / 284 8807034 / 281 11185034 / 464
2 7389469 / 317 7672567 / 291 10172931 / 526
3 6519681 / 380 7453610 / 314 9300280 / 811
4 6107118 / 502 7310010 / 382 7606177 / 816
5 6042739 / 511 6765495 / 407 6904720 / 1318
6 5916795 / 906 6600458 / 569 6658313 / 2961
7 5869025 / 2268 6494286 / 1137 6592924 / 5258
8 5769527 / 4326 6514850 / 4014 6448238 / 13671
9 5764232 / 5057 6217206 / 4907 6427283 / 14969
10 5727261 / 7158 6086587 / 6346 N/A
11 5669294 / 11322 5977565 / 8700 N/A
12 5601831 / 21063 5909305 / 12232 N/A

Results for TAIR10_chr1.fas (Arabidopsis thaliana chromosome 1; original size 30812908 bytes):

Level libdeflate (new) libdeflate (old) zlib
1 9622710 / 163 9622848 / 167 10434288 / 370
2 9445449 / 238 9505670 / 210 10055812 / 498
3 8621168 / 390 9370333 / 264 9650765 / 1016
4 8125898 / 689 9230411 / 376 9649402 / 733
5 8122981 / 697 9212240 / 431 9396040 / 1863
6 8080866 / 1476 9047177 / 679 9081950 / 5066
7 8120894 / 3832 8860922 / 1546 8931774 / 8950
8 8137945 / 6482 8642301 / 3975 8730647 / 25025
9 8135049 / 7257 8358494 / 5054 8723843 / 26177
10 7951675 / 4353 8343450 / 5255 N/A
11 7951495 / 4999 8304799 / 6008 N/A
12 7950784 / 5585 8296509 / 6865 N/A

Results for a FASTQ file containing Illumina sequencing reads (original size 146798341 bytes):

Level libdeflate (new) libdeflate (old) zlib
1 66241681 / 1008 68846711 / 1008 71346659 / 1993
2 65232261 / 1168 67777691 / 1142 70120245 / 2379
3 64525976 / 1458 66962260 / 1426 68239283 / 3870
4 64210758 / 1612 66104860 / 1879 66418997 / 3361
5 63297095 / 1847 65360530 / 2071 65578003 / 6554
6 62579781 / 2594 64673920 / 2854 64258311 / 14491
7 61960388 / 4458 64029845 / 4601 63481420 / 21404
8 61733834 / 7801 62099740 / 12120 63056558 / 37828
9 61683688 / 9269 60705152 / 15939 62951205 / 45619
10 59923123 / 16291 60470044 / 16975 N/A
11 59900738 / 19499 60032207 / 20433 N/A
12 59889501 / 23012 59956692 / 23556 N/A

Note: I improved the algorithms in other ways than the min_match_len heuristics, but that is the main one that matters here. In cases where the min_match_len improved the compression ratio a lot at levels 1-7, it is expected for there to be a performance regression, due to the increased number of literals used. Also, the regression in compression ratio at level 9 on the last file is expected, since the algorithm used at levels 8-9 changed to a faster one that avoids very bad results with certain types of files.

@jkbonfield
Copy link
Author

Many thanks. That looks like great work. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants