Fix dictionary training huffman segfault and small speed improvement #2773

senhuang42 · 2021-09-09T16:09:49Z

Since dictionary training can result in very large counts for a single symbol, we increase the number of log2 buckets from 17 to 32, and increase the sorting scratch space to 192 elements.

This actually increases speed since we have more distinct count buckets too, increased from 114 to 166.
So now:

gcc
4KB blocks: enwik7: 153MB/s -> 159MB/s
4KB blocks: silesia: 194MB/s -> 200MB/s

Test: Also add a test to check that we can train on a high and low compressibility large corpus without failing. (This test fails on dev)

terrelln · 2021-09-09T18:33:15Z

tests/regression/results.csv

@@ -2,18 +2,18 @@ Data,                               Config,                             Method,
 silesia.tar,                        level -5,                           compress simple,                    6738593
 silesia.tar,                        level -3,                           compress simple,                    6446372
 silesia.tar,                        level -1,                           compress simple,                    6186042
-silesia.tar,                        level 0,                            compress simple,                    4861423
+silesia.tar,                        level 0,                            compress simple,                    4861424


Why does this change?

Isn't level 0 == level 3, and therefore uses huffman?

Indeed, level 0 == level 3, and level 3 changes by the same amount.

Though it's somewhat unclear why this PR would introduce a (very minor) 1-byte regression on multiple samples.
The expectation is that it does impact huffman sorting speed only, but not the outcome of the sort operation,
which means the output should be identical ?

Oh, I see now that the question was probably referring to the change itself rather than level 0 specifically.

The changes do affect the outcome of the sort operation. Because we now accept a larger range [0, 166] (previously [0, 114]) of symbols that get distinct bucketing, any symbols with counts in [115, 165] will not be sorted via quicksort, but directly inserted into buckets. So they can end up in a different ordering than before, which can affect the header generation.

facebook-github-bot added the CLA Signed label Sep 9, 2021

senhuang42 changed the title ~~Fix dictionary training huffman segfault~~ Fix dictionary training huffman segfault and small speed improvement Sep 9, 2021

terrelln reviewed Sep 9, 2021

View reviewed changes

terrelln approved these changes Sep 9, 2021

View reviewed changes

senhuang42 added 3 commits September 13, 2021 12:29

Use 32 buckets for log2 bucketing in huffman sort

1daf3c8

Add a dictionary training large corpus test

4a498fb

Update regression test

d45d0ad

senhuang42 force-pushed the huffman_bugix branch from a457a12 to d45d0ad Compare September 13, 2021 16:41

Cyan4973 approved these changes Sep 13, 2021

View reviewed changes

senhuang42 merged commit 29f595e into facebook:dev Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dictionary training huffman segfault and small speed improvement #2773

Fix dictionary training huffman segfault and small speed improvement #2773

senhuang42 commented Sep 9, 2021 •

edited

Loading

terrelln Sep 9, 2021

senhuang42 Sep 9, 2021 •

edited

Loading

Cyan4973 Sep 9, 2021

senhuang42 Sep 9, 2021

Fix dictionary training huffman segfault and small speed improvement #2773

Fix dictionary training huffman segfault and small speed improvement #2773

Conversation

senhuang42 commented Sep 9, 2021 • edited Loading

terrelln Sep 9, 2021

Choose a reason for hiding this comment

senhuang42 Sep 9, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Sep 9, 2021

Choose a reason for hiding this comment

senhuang42 Sep 9, 2021

Choose a reason for hiding this comment

senhuang42 commented Sep 9, 2021 •

edited

Loading

senhuang42 Sep 9, 2021 •

edited

Loading