-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dictionary training huffman segfault and small speed improvement #2773
Conversation
@@ -2,18 +2,18 @@ Data, Config, Method, | |||
silesia.tar, level -5, compress simple, 6738593 | |||
silesia.tar, level -3, compress simple, 6446372 | |||
silesia.tar, level -1, compress simple, 6186042 | |||
silesia.tar, level 0, compress simple, 4861423 | |||
silesia.tar, level 0, compress simple, 4861424 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't level 0 == level 3, and therefore uses huffman?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, level 0 == level 3
, and level 3
changes by the same amount.
Though it's somewhat unclear why this PR would introduce a (very minor) 1-byte regression on multiple samples.
The expectation is that it does impact huffman sorting speed only, but not the outcome of the sort operation,
which means the output should be identical ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see now that the question was probably referring to the change itself rather than level 0 specifically.
The changes do affect the outcome of the sort operation. Because we now accept a larger range [0, 166] (previously [0, 114]) of symbols that get distinct bucketing, any symbols with counts in [115, 165] will not be sorted via quicksort, but directly inserted into buckets. So they can end up in a different ordering than before, which can affect the header generation.
a457a12
to
d45d0ad
Compare
Since dictionary training can result in very large counts for a single symbol, we increase the number of log2 buckets from 17 to 32, and increase the sorting scratch space to 192 elements.
This actually increases speed since we have more distinct count buckets too, increased from 114 to 166.
So now:
Test: Also add a test to check that we can train on a high and low compressibility large corpus without failing. (This test fails on dev)