Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dictionary training huffman segfault and small speed improvement #2773

Merged
merged 3 commits into from
Sep 14, 2021

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Sep 9, 2021

Since dictionary training can result in very large counts for a single symbol, we increase the number of log2 buckets from 17 to 32, and increase the sorting scratch space to 192 elements.

This actually increases speed since we have more distinct count buckets too, increased from 114 to 166.
So now:

gcc
4KB blocks: enwik7: 153MB/s -> 159MB/s
4KB blocks: silesia: 194MB/s -> 200MB/s

Test: Also add a test to check that we can train on a high and low compressibility large corpus without failing. (This test fails on dev)

@senhuang42 senhuang42 changed the title Fix dictionary training huffman segfault Fix dictionary training huffman segfault and small speed improvement Sep 9, 2021
@@ -2,18 +2,18 @@ Data, Config, Method,
silesia.tar, level -5, compress simple, 6738593
silesia.tar, level -3, compress simple, 6446372
silesia.tar, level -1, compress simple, 6186042
silesia.tar, level 0, compress simple, 4861423
silesia.tar, level 0, compress simple, 4861424
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change?

Copy link
Contributor Author

@senhuang42 senhuang42 Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't level 0 == level 3, and therefore uses huffman?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, level 0 == level 3, and level 3 changes by the same amount.

Though it's somewhat unclear why this PR would introduce a (very minor) 1-byte regression on multiple samples.
The expectation is that it does impact huffman sorting speed only, but not the outcome of the sort operation,
which means the output should be identical ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see now that the question was probably referring to the change itself rather than level 0 specifically.

The changes do affect the outcome of the sort operation. Because we now accept a larger range [0, 166] (previously [0, 114]) of symbols that get distinct bucketing, any symbols with counts in [115, 165] will not be sorted via quicksort, but directly inserted into buckets. So they can end up in a different ordering than before, which can affect the header generation.

@senhuang42 senhuang42 merged commit 29f595e into facebook:dev Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants