-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User control over minimum match length #57
Comments
Just so I understand, this optimization is good for DNA files because having a smaller match length causes individual literals to be encoded using a more efficient Huffman encoding (because the resulting tree will only have LZ matches > ~8 in length in the lcode tree), and DNA reads are mostly random-looking data? |
This issue concerns Illumina-sequencing read identifiers, not the actual DNA data in the read. |
Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting. |
Do you have any sample files available I could test with? I'd like to investigate whether this case could be handled better automatically using some heuristics. Adding more options increases API complexity so is generally something I'd like to avoid. |
Many thanks for the quick reply. A heuristic if possible would be better still. I was just thinking about what would be simple (and offering to do it if it's something you'd want). I don't understand it well enough to work out where to put heuristics though. Take a look at the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/names A quick run through all the *.names files with and without a +4 hack as above: Original
With change:
It generally helps a lot, but not universally. |
I haven't analysed quite what it's doing, but my thinking was the form of these strings is basically common prefix (instrument name) then It's quite easy for parsers to generate an LZ match for "Y newline prefix:lane" instead of just "prefix:lane", possibly at the expense of making a longer distance when it's not really justified for random chance match. (I assume that's what the optimal parser is trying to judge though.) Changing the min match length doesn't stop that so it's not what my change achieved clearly. I did try hacking it to require a newline starting character but that turned out to be tragic, so canned that idea quickly. |
Another test, this time with DNA sequence data, also shows Zlib's Z_DEFAULT_STRATEGY vs Z_FILTERED to be a win. I'll put the data file in ftp://ftp.sanger.ac.uk/pub/users/jkb/seq
That shows Zlib with Z_DEFAULT_STRATEGY vs Z_FILTERED, Libdeflate as cloned (optimal parsing enabled for level 8-12), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled (levels 8,9 only), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled and +7 on the existing min_match len of 3, and finally max compression of 7zip in gzip mode. You can see by default libdeflate beats zlib's defaults. Zlib gains a bit using Z_FILTERED, but it's still beaten when we use higher levels of libdeflate. However libdefault with a longer minimum match length trounces everything bar 7za and it gets pretty close to matching that too. For what it's worth, I see the same benefits in zlib. Z_FILTERED is changing min match from 3 to 5. Bumping it to 10 manually shows level 6 gives 5935354 and level 9 5773148, so again we see potential for manual tweakage there too. PS. I don't know how to limit the minimum match length for the optimal parser, so no numbers have been presented there. |
In the greedy and lazy compressors, automatically increase the minimum match length from the default of 3 if the data doesn't contain many different literals. This greatly improves the compression ratio of levels 1-9 on certain types of data, such as DNA sequencing data, while not worsening the ratio on other types of data. The near-optimal compressor (used by compression levels 10-12) continues to use a minimum match length of 3, since it already did a much better job at deciding when short matches are worthwhile. Resolves #57
In the greedy and lazy compressors, automatically increase the minimum match length from the default of 3 if the data doesn't contain many different literals. This greatly improves the compression ratio of levels 1-9 on certain types of data, such as DNA sequencing data, while not worsening the ratio on other types of data. The near-optimal compressor (used by compression levels 10-12) continues to use a minimum match length of 3, since it already did a much better job at deciding when short matches are worthwhile. Resolves #57
In the greedy and lazy compressors, automatically increase the minimum match length from the default of 3 if the data doesn't contain many different literals. This greatly improves the compression ratio of levels 1-9 on certain types of data, such as DNA sequencing data, while not worsening the ratio on other types of data. The near-optimal compressor (used by compression levels 10-12) continues to use a minimum match length of 3, since it already did a better job at deciding when short matches are worthwhile. (The method for setting the initial costs needs improvement; later commits address that.) Resolves #57
I implemented some heuristics which handle setting the minimum match length automatically. Commit 69a7ca0 is the main one. Here are the results for compressing all the
Here are the results for the
Results for
Results for a FASTQ file containing Illumina sequencing reads (original size 146798341 bytes):
Note: I improved the algorithms in other ways than the min_match_len heuristics, but that is the main one that matters here. In cases where the min_match_len improved the compression ratio a lot at levels 1-7, it is expected for there to be a performance regression, due to the increased number of literals used. Also, the regression in compression ratio at level 9 on the last file is expected, since the algorithm used at levels 8-9 changed to a faster one that avoids very bad results with certain types of files. |
Many thanks. That looks like great work. :-) |
One of the zlib features is the strategy option, which permits Z_DEFAULT_STRATEGY vs Z_FILTERED. The latter changes the minimum match length which correspondingly then favours huffman over LZ for small matches. On some data sets it can have a big impact.
The same logic can work for libdeflate. Eg with and without this tiny patch:
This data is a series of DNA sequencing read identifiers, looking like:
There are periodic duplicates as the data is paired, typically 50-150 lines apart.
There are lots of spurious matches in the rather random looking numbers (X and Y locations on a slide) which we're not interested in deflating, while the main prefix is critical. I think exposing the minimum match length, albeit in multiple places and I haven't a clue on the btree version, is a simple tweak to gain more out of the library.
Does this sound like something you'd consider adding?
Edit: with optimal parsing disabled and gzip -9 it gets "4977440" using min match of 7. Pretty close to zlib (I think that has min match of 5 though).
Edit: Fixed formatting.
The text was updated successfully, but these errors were encountered: