Skip to content

Commit

Permalink
Merge pull request #3514 from facebook/spec_huffman
Browse files Browse the repository at this point in the history
Clarify zstd specification for Huffman blocks
  • Loading branch information
Cyan4973 authored Feb 23, 2023
2 parents 395a2c5 + 832f559 commit e1ab691
Showing 1 changed file with 16 additions and 7 deletions.
23 changes: 16 additions & 7 deletions doc/zstd_compression_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Distribution of this document is unlimited.

### Version

0.3.7 (2020-12-09)
0.3.8 (2023-02-18)


Introduction
Expand Down Expand Up @@ -470,6 +470,7 @@ This field uses 2 lowest bits of first byte, describing 4 different block types
repeated `Regenerated_Size` times.
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
starting with a Huffman tree description.
In this mode, there are at least 2 different literals represented in the Huffman tree description.
See details below.
- `Treeless_Literals_Block` - This is a Huffman-compressed block,
using Huffman tree _from previous Huffman-compressed literals block_.
Expand Down Expand Up @@ -566,6 +567,7 @@ or from a dictionary.

### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
The size of `Huffman_Tree_Description` is determined during decoding process,
it must be used to determine where streams begin.
Expand Down Expand Up @@ -1197,7 +1199,7 @@ Huffman Coding
--------------
Zstandard Huffman-coded streams are read backwards,
similar to the FSE bitstreams.
Therefore, to find the start of the bitstream, it is therefore to
Therefore, to find the start of the bitstream, it is required to
know the offset of the last byte of the Huffman-coded stream.

After writing the last bit containing information, the compressor
Expand Down Expand Up @@ -1239,9 +1241,15 @@ Transformation from `Weight` to `Number_of_Bits` follows this formula :
```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
```
The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
When a literal value is not present, it receives a `Weight` of 0.
The least frequent symbol receives a `Weight` of 1.
Consequently, the `Weight` 1 is necessarily present.
The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
The last symbol's `Weight` is deduced from previously retrieved Weights,
by completing to the nearest power of 2. It's necessarily non 0.
If it's not possible to reach a clean power of 2 with a single `Weight` value,
the Huffman Tree Description is considered invalid.
This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
`Max_Number_of_Bits` must be <= 11,
otherwise the representation is considered corrupted.

Expand All @@ -1254,7 +1262,7 @@ Let's presume the following Huffman tree must be described :

The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency).
Value `5` will not be listed, as it can be determined from values for 0-4,
Literal value `5` will not be listed, as it can be determined from previous values 0-4,
nor will values above `5` as they are all 0.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is :
Expand All @@ -1274,7 +1282,7 @@ The `Weight` of `5` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.

#### Huffman Tree header

Expand Down Expand Up @@ -1683,6 +1691,7 @@ or at least provide a meaningful error code explaining for which reason it canno

Version changes
---------------
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
- 0.3.6 : clarifications for Dictionary_ID
- 0.3.5 : clarifications for Block_Maximum_Size
Expand Down

0 comments on commit e1ab691

Please sign in to comment.