-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: S2++ #846
WIP: S2++ #846
Conversation
If the first bytes of a block is `0x40, 0x00` (repeat, length 4), this indicates that all [Copy with 4-byte offset (11)](https://github.com/google/snappy/blob/main/format_description.txt#L106) are all 3 bytes instead for the remainder of the block. There can be no literals before this tag and no repeats before a match as specified above. This will only trigger on this exact tag. > These are like the copies with 2-byte offsets (see previous subsection), > except that the offset is stored as a 24-bit integer instead of a > 16-bit integer (and thus will occupy three bytes). When in this mode the maximum backreference offset is 16777215. This *cannot* be combined with dictionaries.
Attempted offset delta encoding -16 to 16, length 1-16. Extremely small hit rate. Not worth the complexity. |
Experiment with using 1 bit from copy long offset to indicate repeats. Limits long offsets to length 32, down from 64, forcing a repeat. Repeat length are encoded as:
Copy lengths are encoded as
So gains mainly depend on how many repeats compared to long offsets (with long length) there is. Only sofia has a reasonable regression. Only many long offsets with length 32->64 should expect a regression. This can remove the change from literals=63, which makes the change cleaner. OP updated. |
Added variable length encoding to TagCopy2 as well. Good improvement and simplifies encoding decisions. |
Using 1 more bit for length in Tagcopy4 gives a very reasonable improvement. Hard to make simple, though. |
Experimenting with using Copy1 length 11 as indicator for extra (length-64): Negative means percentage smaller output: (combined table below) Undecided... |
Using 10 bits (max 1024) for offset: Negative means percentage smaller output: (table below) Again, inconclusive.... |
Using 10 bit lengths + 4 bits length and last length, read 1 additional byte (16 base)
|
Offset before length extension? Probably faster for decoding - but maybe more tedious for encoding? |
Minimum offset 1 will eliminate a lot of 0 checks. |
Using one more bit for extra Copy4 length +28 is just too good to leave. |
Length after offset. |
Allow 0-3 literals in copy depending on uncompressed position. Table updated. |
(project will be published under minio) |
Aim
Improve encoding method of S2, which is read backwards compatible with the following:
Only changes that provide significant improvements with no decompression speed penalty will be considered.
No reduction in seek functionality is accepted.
Method
Fixes the biggest mistake in Snappy (though extremely rarely used in Snappy) - and also implements more efficient repeat codes.
If the first bytes of a block is
0x80, 0x00, 0x00
(copy, 2 byte offset = 0),this indicates that all Copy with 2-byte offset (10)
and Copy with 4-byte offset (11) tags change.
There can be no literals before this tag and no repeats before a match as specified above.
This will only trigger on this exact tag.
Discussion
Blocks below 64K do not need to add this, and it will just be 3 wasted bytes.65536 could be added to the base value, but having a max 16MB backreference max seems neater.Using a 3 byte indicator, since block can start with an initial repeat. Having this as the first block will always be invalid in current decoders.
Seems like the encoder can unconditionally enable this when block is >64K. Sizes below are with it enabled for all blocks, 4MB blocks.Pretty much always better unless just storing.Consider if old repeat codes should be disabled if this mode (probably).YesSizes
Percentages are calculated as reduction in output size and reduction as percentage of input size.