Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block finder for Fixed Huffman blocks #20

Open
mxmlnkn opened this issue Jun 16, 2023 · 0 comments
Open

Block finder for Fixed Huffman blocks #20

mxmlnkn opened this issue Jun 16, 2023 · 0 comments
Labels
enhancement New feature or request low priority This can wait

Comments

@mxmlnkn
Copy link
Owner

mxmlnkn commented Jun 16, 2023

Deflate blocks with Fixed Huffman codings are currently ignored because only three bits (final bit, block type = 0b01) can be checked and the final block bit check might also be dropped as mentioned in #8 (comment). Testing only 2 bits would only filter 75% of the bit positions resulting in an unusably large false positive rate.

However, there is some more redundancy because the Fixed Huffman tree is non-optimal:

Literal/length values 286-287 will never actually occur in the compressed data, but participate in the code construction.

Therefore, we can do further filtering without having to wait for the EOB symbol and checking the header of the following block. This reduce false positives by 2/287 per decoded symbol (7-9 bits). After checking 128 symbols from a random data input like this, the check would only succeed 41%. This begins to become significant but is vastly more expensive than the header bit checks. Note how we need to check 128 * ~8 bits to filter 60% vs. checking a single header bit to filter 50%! We would need to check roughly 1024 or more symbols to get a usable false positive rate. Note that blocks are mostly sized 16-128 KiB, so this would still be a magnitude or two cheaper than decompressing the whole block and checking the next header. Note also that these are only very rough estimates and the EOB byte might have to be added to this false positive rate calculation to yield 3/287 filtering per symbol, because after after encountering the EOB bit, checking for a valid following block also becomes comparably cheap.

There are multiple things which might make such a check much cheaper than it sounds:

  • Because we are only interested in whether the following symbol is one of the forbidden ones or not, we don't need to store the decoded symbol, which reduces the Huffman decoding LUT by less than half (the other half being the number of bits to skip to the next symbol). This makes double symbol caching more feasible and might even enable triple-symbol validity check (the actual code lengths would only be queried from the other double-symbol-cached code length LUT when the triple-symbol validity LUT returns true (single bit). Note though, that I remember bit-LUTs being slower than 1-bit stored per byte-sized entry LUTs, probably because testing for a single bit inside a bytes makes querying the LUT more expensive. This approach might be more trouble than it is worth.
    The LUT for the Fixed Huffmann tree can be optimized even further because it is highly regular:

    So in theory you could set up a lookup table with 32 entries to quickly read in the codes

    00000 00   .. 00101 11     7 bits  + 256   -> (256..279)
    00110 000  .. 10111 111    8 bits  -  48   -> (  0..144)
    11000 000  .. 11000 111    8 bits  +  78   -> (280..287)
    11001 0000 .. 11111 1111   9 bits  - 256   -> (144..255)
    
  • Alternatively, we could avoid duplicate work by applying the knowledge of a single failed 1024+ symbol check to mean that any deflate block boundary starting at any of the encountered code boundaries will also fail! This could reduce the amount of work thousand-fold! And all of this could be stored in a single bit-vector and could therefore be compressed well. The index of such a vector would be the bit position in the stream and the value whether it encountered an invalid symbol when starting to decode the Huffman codes from there.
    Assuming that Huffman decoding is not too self-stabilizing, this would mean that after failing to start from approx. 8 different bit positions, we would end up with most of the bit positions being sieved out. However, it might happen, that decoding from another position might end up on a previously processed "path" of Huffman codes... Experiments for this would be very much helpful.
    All of these might be accelerated with SIMD(-like) instructions, applying bit-wise or/and on 64 or more bits at once and for finding candidate start positions and such. For example for cross-referencing possible Fixed block start offsets (test for 0b01) with all those crossed-out positions from a failed Huffman decoding run. And after 8 failed runs, we might not even check every bit but we could check whole bytes against being zero assuming that almost all positions have been crossed out.

As interesting as all of this is... I don't even have a file against which to test this if it works because Fixed Huffman blocks are rare. I would have to call zlib myself with the right arguments or create a single Fixed Huffman block using a very small file or igzip -0 and then concatenate those several times.

The only use-case it might solve would be files compressed with igzip -0, which create a very large single Fixed Huffman block. However, for that I would have to skip the deflate block header tests because I would have to start in the middle of such a deflate block to parallelize it. Currently, it does not fit into the whole architecture of rapidgzip, which assumed start at deflate block boundaries. We cannot start in the middle of a Dynamic Huffman block because we don't know the tree but this is not a problem for a Fixed Huffman block stream because we know the tree!

@mxmlnkn mxmlnkn added enhancement New feature or request low priority This can wait labels Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low priority This can wait
Projects
None yet
Development

No branches or pull requests

1 participant