User control over minimum match length #57

jkbonfield · 2019-10-25T14:58:11Z

One of the zlib features is the strategy option, which permits Z_DEFAULT_STRATEGY vs Z_FILTERED. The latter changes the minimum match length which correspondingly then favours huffman over LZ for small matches. On some data sets it can have a big impact.

The same logic can work for libdeflate. Eg with and without this tiny patch:

diff --git a/lib/deflate_compress.c b/lib/deflate_compress.c
index 9ecb8b6..5e7890a 100644
--- a/lib/deflate_compress.c
+++ b/lib/deflate_compress.c
@@ -2099,7 +2099,7 @@ deflate_compress_lazy(struct libdeflate_compressor * restrict c,
                                                               &cur_offset);
                        in_next += 1;
 
-                       if (cur_len < DEFLATE_MIN_MATCH_LEN) {
+                       if (cur_len < DEFLATE_MIN_MATCH_LEN+4) {
                                /* No match found.  Choose a literal.  */
                                deflate_choose_literal(c, *(in_next - 1), &litrunlen);
                                observe_literal(&c->split_stats, *(in_next - 1));

;#orig
@ gen1c[compress.../libdeflate]; ./gzip < /tmp/rn|wc -c
5544678
@ gen1c[compress.../libdeflate]; ./gzip -9 < /tmp/rn|wc -c
5352317

;# Patched
@ gen1c[compress.../libdeflate]; ./gzip < /tmp/rn|wc -c
5039812

;# Zlib
@ gen1c[compress.../libdeflate]; ~/work/compression/zlib_test Z_DEFAULT_STRATEGY < /tmp/rn|wc -c
5513639
@ gen1c[compress.../libdeflate]; ~/work/compression/zlib_test Z_FILTERED < /tmp/rn|wc -c
5118920
@ gen1c[compress.../libdeflate]; ~/work/compression/zlib_test Z_FILTERED 9 < /tmp/rn|wc -c
4973229

;# 7z mx9 gets it to 4646598

This data is a series of DNA sequencing read identifiers, looking like:

VP2-06:112:H7LNDMCVY:1:1124:21694:10473
VP2-06:112:H7LNDMCVY:1:1158:23665:6370
VP2-06:112:H7LNDMCVY:1:1219:23746:16250
VP2-06:112:H7LNDMCVY:1:1243:16414:36119
VP2-06:112:H7LNDMCVY:1:1251:6253:36119
VP2-06:112:H7LNDMCVY:1:1324:31412:16595
VP2-06:112:H7LNDMCVY:1:1431:22119:16125
VP2-06:112:H7LNDMCVY:1:2152:28881:21512
VP2-06:112:H7LNDMCVY:1:2207:21287:33567
VP2-06:112:H7LNDMCVY:1:2219:32651:24940
...

There are periodic duplicates as the data is paired, typically 50-150 lines apart.
There are lots of spurious matches in the rather random looking numbers (X and Y locations on a slide) which we're not interested in deflating, while the main prefix is critical. I think exposing the minimum match length, albeit in multiple places and I haven't a clue on the btree version, is a simple tweak to gain more out of the library.

Does this sound like something you'd consider adding?

Edit: with optimal parsing disabled and gzip -9 it gets "4977440" using min match of 7. Pretty close to zlib (I think that has min match of 5 though).

Edit: Fixed formatting.

The text was updated successfully, but these errors were encountered:

adamkewley · 2019-10-25T15:20:33Z

Just so I understand, this optimization is good for DNA files because having a smaller match length causes individual literals to be encoded using a more efficient Huffman encoding (because the resulting tree will only have LZ matches > ~8 in length in the lcode tree), and DNA reads are mostly random-looking data?

chrchang · 2019-10-25T15:24:08Z

This issue concerns Illumina-sequencing read identifiers, not the actual DNA data in the read.

adamkewley · 2019-10-25T16:05:21Z

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

ebiggers · 2019-10-25T16:14:21Z

Do you have any sample files available I could test with? I'd like to investigate whether this case could be handled better automatically using some heuristics. Adding more options increases API complexity so is generally something I'd like to avoid.

jkbonfield · 2019-10-25T22:51:05Z

Many thanks for the quick reply. A heuristic if possible would be better still. I was just thinking about what would be simple (and offering to do it if it's something you'd want). I don't understand it well enough to work out where to put heuristics though.

Take a look at the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/names

A quick run through all the *.names files with and without a +4 hack as above:

Original

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
72201
79313
101113
55364
100884
53751
75087
31300
59432
19211
177192

With change:

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
66852
72401
92062
50348
93563
54053
68551
32222
54043
17307
161436

It generally helps a lot, but not universally.
On this data I've wondered whether a custom algorithm that creates deflate format using purely line-oriented prefix matching would give the best speed for size tradeoff. (However for slower modes of my work I have a dedicated (non-deflate) name compressor anyway, so the impetus to be clever isn't huge.)

jkbonfield · 2019-10-25T22:59:47Z

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

I haven't analysed quite what it's doing, but my thinking was the form of these strings is basically common prefix (instrument name) then :lane:tile:x:y. There is some dictionary component to tile, but the x and y are largely random. There may be lots of X and Y strings and substrings just because with only 10 characters the digits have a good chance of random matching (1 in 100 if we include colon + 2 digits).

It's quite easy for parsers to generate an LZ match for "Y newline prefix:lane" instead of just "prefix:lane", possibly at the expense of making a longer distance when it's not really justified for random chance match. (I assume that's what the optimal parser is trying to judge though.) Changing the min match length doesn't stop that so it's not what my change achieved clearly. I did try hacking it to require a newline starting character but that turned out to be tragic, so canned that idea quickly.

jkbonfield · 2019-10-28T17:42:36Z

Another test, this time with DNA sequence data, also shows Zlib's Z_DEFAULT_STRATEGY vs Z_FILTERED to be a win. I'll put the data file in ftp://ftp.sanger.ac.uk/pub/users/jkb/seq

      Z_DEFAULT  Z_FILTERED   Default+Opt    Default-Opt  MIN_10 -Opt   7za
1     11185034   11185034     8807052        8807052      6767390
2     10172931   10172931     7672585        7672585      6329695
3      9300280    9300280     7453628        7453628      6208820
4      7606177    6803282     7310028        7310028      6102399
5      6904720    6537932     6765513        6765513      6027264
6      6658313    6447436     6600476        6600476      5897747
7      6592924    6440727     6494304        6494304      5831461
8      6448238    6325678     6514868        6434674      5790594
9      6427283    6303307     6217224        6385173      5751458       5722540
10     		  	      6086605
11     			      5977583
12     			      5909323

That shows Zlib with Z_DEFAULT_STRATEGY vs Z_FILTERED, Libdeflate as cloned (optimal parsing enabled for level 8-12), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled (levels 8,9 only), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled and +7 on the existing min_match len of 3, and finally max compression of 7zip in gzip mode.

You can see by default libdeflate beats zlib's defaults. Zlib gains a bit using Z_FILTERED, but it's still beaten when we use higher levels of libdeflate. However libdefault with a longer minimum match length trounces everything bar 7za and it gets pretty close to matching that too.

For what it's worth, I see the same benefits in zlib. Z_FILTERED is changing min match from 3 to 5. Bumping it to 10 manually shows level 6 gives 5935354 and level 9 5773148, so again we see potential for manual tweakage there too.

PS. I don't know how to limit the minimum match length for the optimal parser, so no numbers have been presented there.

In the greedy and lazy compressors, automatically increase the minimum match length from the default of 3 if the data doesn't contain many different literals. This greatly improves the compression ratio of levels 1-9 on certain types of data, such as DNA sequencing data, while not worsening the ratio on other types of data. The near-optimal compressor (used by compression levels 10-12) continues to use a minimum match length of 3, since it already did a much better job at deciding when short matches are worthwhile. Resolves #57

In the greedy and lazy compressors, automatically increase the minimum match length from the default of 3 if the data doesn't contain many different literals. This greatly improves the compression ratio of levels 1-9 on certain types of data, such as DNA sequencing data, while not worsening the ratio on other types of data. The near-optimal compressor (used by compression levels 10-12) continues to use a minimum match length of 3, since it already did a better job at deciding when short matches are worthwhile. (The method for setting the initial costs needs improvement; later commits address that.) Resolves #57

ebiggers · 2022-01-01T02:49:22Z

I implemented some heuristics which handle setting the minimum match length automatically. Commit 69a7ca0 is the main one.

Here are the results for compressing all the .names files from https://github.com/jkbonfield/htscodecs-corpus/tree/master/names (as suggested by #57 (comment)) concatenated to each other (original size 33986678 bytes). Table entries contain compressed size in bytes / compression time in milliseconds; "new" is commit 3dca7de and "old" is commit 047aa84:

Level	libdeflate (new)	libdeflate (old)	zlib
1	5212311 / 124	5384620 / 114	6007496 / 180
2	4965955 / 152	5283019 / 125	5856632 / 200
3	4934629 / 168	5229218 / 140	5598722 / 293
4	4850526 / 185	5142765 / 164	5223622 / 317
5	4608971 / 221	4902344 / 206	5011470 / 463
6	4505713 / 280	4804306 / 271	4927452 / 765
7	4453372 / 417	4758233 / 395	4749052 / 1049
8	4326029 / 760	4975789 / 1225	4658839 / 2557
9	4305630 / 1222	4643199 / 1759	4658374 / 3009
10	4044222 / 3471	4314207 / 2749	N/A
11	4023112 / 4762	4100868 / 4899	N/A
12	4019345 / 5893	4048534 / 6219	N/A

Here are the results for the seq file mentioned in #57 (comment). (Original size 109258164 bytes; it contains 301-character DNA sequences.)

Level	libdeflate (new)	libdeflate (old)	zlib
1	8762885 / 284	8807034 / 281	11185034 / 464
2	7389469 / 317	7672567 / 291	10172931 / 526
3	6519681 / 380	7453610 / 314	9300280 / 811
4	6107118 / 502	7310010 / 382	7606177 / 816
5	6042739 / 511	6765495 / 407	6904720 / 1318
6	5916795 / 906	6600458 / 569	6658313 / 2961
7	5869025 / 2268	6494286 / 1137	6592924 / 5258
8	5769527 / 4326	6514850 / 4014	6448238 / 13671
9	5764232 / 5057	6217206 / 4907	6427283 / 14969
10	5727261 / 7158	6086587 / 6346	N/A
11	5669294 / 11322	5977565 / 8700	N/A
12	5601831 / 21063	5909305 / 12232	N/A

Results for TAIR10_chr1.fas (Arabidopsis thaliana chromosome 1; original size 30812908 bytes):

Level	libdeflate (new)	libdeflate (old)	zlib
1	9622710 / 163	9622848 / 167	10434288 / 370
2	9445449 / 238	9505670 / 210	10055812 / 498
3	8621168 / 390	9370333 / 264	9650765 / 1016
4	8125898 / 689	9230411 / 376	9649402 / 733
5	8122981 / 697	9212240 / 431	9396040 / 1863
6	8080866 / 1476	9047177 / 679	9081950 / 5066
7	8120894 / 3832	8860922 / 1546	8931774 / 8950
8	8137945 / 6482	8642301 / 3975	8730647 / 25025
9	8135049 / 7257	8358494 / 5054	8723843 / 26177
10	7951675 / 4353	8343450 / 5255	N/A
11	7951495 / 4999	8304799 / 6008	N/A
12	7950784 / 5585	8296509 / 6865	N/A

Results for a FASTQ file containing Illumina sequencing reads (original size 146798341 bytes):

Level	libdeflate (new)	libdeflate (old)	zlib
1	66241681 / 1008	68846711 / 1008	71346659 / 1993
2	65232261 / 1168	67777691 / 1142	70120245 / 2379
3	64525976 / 1458	66962260 / 1426	68239283 / 3870
4	64210758 / 1612	66104860 / 1879	66418997 / 3361
5	63297095 / 1847	65360530 / 2071	65578003 / 6554
6	62579781 / 2594	64673920 / 2854	64258311 / 14491
7	61960388 / 4458	64029845 / 4601	63481420 / 21404
8	61733834 / 7801	62099740 / 12120	63056558 / 37828
9	61683688 / 9269	60705152 / 15939	62951205 / 45619
10	59923123 / 16291	60470044 / 16975	N/A
11	59900738 / 19499	60032207 / 20433	N/A
12	59889501 / 23012	59956692 / 23556	N/A

Note: I improved the algorithms in other ways than the min_match_len heuristics, but that is the main one that matters here. In cases where the min_match_len improved the compression ratio a lot at levels 1-7, it is expected for there to be a performance regression, due to the increased number of literals used. Also, the regression in compression ratio at level 9 on the last file is expected, since the algorithm used at levels 8-9 changed to a faster one that avoids very bad results with certain types of files.

jkbonfield · 2022-01-03T14:59:50Z

Many thanks. That looks like great work. :-)

ebiggers added the enhancement label Nov 28, 2019

ebiggers mentioned this issue Jul 3, 2020

Allow controlling the setting at which libdeflate only emits an uncompressed deflate stream #67

Open

ebiggers mentioned this issue Oct 10, 2020

deflate_compress: Significant performance drop + increase in ratio compared to zlib on levels 8 and 9 with some data #85

Closed

ebiggers closed this as completed in 69a7ca0 Dec 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User control over minimum match length #57

User control over minimum match length #57

jkbonfield commented Oct 25, 2019 •

edited

Loading

adamkewley commented Oct 25, 2019

chrchang commented Oct 25, 2019

adamkewley commented Oct 25, 2019

ebiggers commented Oct 25, 2019

jkbonfield commented Oct 25, 2019

jkbonfield commented Oct 25, 2019 •

edited

Loading

jkbonfield commented Oct 28, 2019 •

edited

Loading

ebiggers commented Jan 1, 2022

jkbonfield commented Jan 3, 2022

User control over minimum match length #57

User control over minimum match length #57

Comments

jkbonfield commented Oct 25, 2019 • edited Loading

adamkewley commented Oct 25, 2019

chrchang commented Oct 25, 2019

adamkewley commented Oct 25, 2019

ebiggers commented Oct 25, 2019

jkbonfield commented Oct 25, 2019

jkbonfield commented Oct 25, 2019 • edited Loading

jkbonfield commented Oct 28, 2019 • edited Loading

ebiggers commented Jan 1, 2022

jkbonfield commented Jan 3, 2022

jkbonfield commented Oct 25, 2019 •

edited

Loading

jkbonfield commented Oct 25, 2019 •

edited

Loading

jkbonfield commented Oct 28, 2019 •

edited

Loading