-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUF] Improve Huffman sorting algorithm #2732
Conversation
8892420
to
5151272
Compare
lib/compress/huf_compress.c
Outdated
} else if (bucketSize <= 128) { | ||
HUF_insertionSort(huffNode + bucketStartIdx, bucketSize); | ||
} else { | ||
HUF_simpleQuickSort(huffNode + bucketStartIdx, 0, bucketSize-1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend just using a single sort function. Generally, fast quick sorts will look at array size, and for small array sizes will fall back to insertion sort. So the quick sort could detect arrays of < K (K ~5 maybe) elements and fall back to insertion sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think it's a net positive change, but it's hard to tell, really. 6 elements experimentally seems like an okay cutoff point. With using insertion sort within quicksort for calls of less than 6 elements, we see the following (cfeccb43 being the version with your suggestion):
Benchmark | Config | Dataset | Ratio (5accf234) | Ratio (cfeccb43) | Ratio (cfeccb43 - 5accf234) | Speed MB/s (5accf234) | Speed MB/s (cfeccb43) | Speed MB/s (cfeccb43 - 5accf234) |
---|---|---|---|---|---|---|---|---|
compress | level_1 | enwik7 | 2.43 | 2.43 | -0.0% | 260.90 | 259.36 | -0.6% |
compress | level_1 | enwik7_1k | 1.75 | 1.75 | 0.0% | 115.55 | 117.16 | 1.4% |
compress | level_1 | silesia | 2.88 | 2.88 | 0.0% | 349.45 | 347.77 | -0.5% |
compress | level_1 | silesia_16k | 2.62 | 2.62 | -0.0% | 247.51 | 249.12 | 0.7% |
compress | level_1 | silesia_4k | 2.31 | 2.31 | 0.0% | 204.78 | 205.35 | 0.3% |
compress | level_3 | enwik7 | 2.79 | 2.79 | 0.0% | 144.61 | 142.33 | -1.6% |
compress | level_3 | enwik7_1k | 1.75 | 1.75 | -0.0% | 87.10 | 87.85 | 0.9% |
compress | level_3 | silesia | 3.18 | 3.18 | 0.0% | 190.72 | 187.68 | -1.6% |
compress | level_3 | silesia_16k | 2.67 | 2.67 | 0.0% | 182.81 | 183.10 | 0.2% |
compress | level_3 | silesia_4k | 2.35 | 2.35 | -0.0% | 154.32 | 154.96 | 0.4% |
compress | level_7 | enwik7 | 3.05 | 3.05 | 0.0% | 66.42 | 66.32 | -0.1% |
compress | level_7 | enwik7_1k | 1.79 | 1.79 | -0.0% | 52.59 | 53.06 | 0.9% |
compress | level_7 | silesia | 3.45 | 3.45 | -0.0% | 87.01 | 86.72 | -0.3% |
compress | level_7 | silesia_16k | 2.80 | 2.80 | -0.0% | 51.72 | 52.20 | 0.9% |
compress | level_7 | silesia_4k | 2.42 | 2.42 | 0.0% | 62.42 | 62.49 | 0.1% |
compress_literals | level_1 | enwik7 | 1.49 | 1.49 | -0.0% | 783.73 | 778.03 | -0.7% |
compress_literals | level_1 | enwik7_1k | 1.46 | 1.46 | 0.0% | 264.67 | 271.91 | 2.7% |
compress_literals | level_1 | silesia | 1.30 | 1.30 | 0.0% | 604.35 | 604.73 | 0.1% |
compress_literals | level_1 | silesia_16k | 1.29 | 1.29 | 0.0% | 558.75 | 557.81 | -0.2% |
compress_literals | level_1 | silesia_1k | 1.28 | 1.28 | 0.0% | 265.02 | 272.08 | 2.7% |
compress_literals | level_1 | silesia_4k | 1.29 | 1.29 | 0.0% | 276.64 | 283.87 | 2.6% |
compress_literals | level_3 | enwik7 | 1.34 | 1.34 | 0.0% | 640.60 | 630.37 | -1.6% |
compress_literals | level_3 | enwik7_1k | 1.40 | 1.40 | -0.0% | 201.50 | 205.74 | 2.1% |
compress_literals | level_3 | silesia | 1.17 | 1.17 | -0.0% | 499.56 | 501.60 | 0.4% |
compress_literals | level_3 | silesia_16k | 1.21 | 1.21 | 0.0% | 492.30 | 492.93 | 0.1% |
compress_literals | level_3 | silesia_1k | 1.23 | 1.23 | 0.0% | 223.39 | 228.90 | 2.5% |
compress_literals | level_3 | silesia_4k | 1.24 | 1.24 | 0.0% | 239.28 | 244.97 | 2.4% |
compress_literals | level_7 | enwik7 | 1.29 | 1.29 | 0.0% | 567.40 | 557.35 | -1.8% |
compress_literals | level_7 | enwik7_1k | 1.39 | 1.39 | -0.0% | 189.78 | 193.95 | 2.2% |
compress_literals | level_7 | silesia | 1.15 | 1.15 | -0.0% | 461.80 | 464.98 | 0.7% |
compress_literals | level_7 | silesia_16k | 1.21 | 1.21 | -0.0% | 477.96 | 478.90 | 0.2% |
compress_literals | level_7 | silesia_1k | 1.22 | 1.22 | 0.0% | 215.32 | 220.73 | 2.5% |
compress_literals | level_7 | silesia_4k | 1.23 | 1.23 | 0.0% | 232.33 | 237.94 | 2.4% |
The main issue just seems to be that e2e compression on normal sized blocks seems to be consistently slower by a bit. It doesn't really make sense that that should happen since literals compression is typically still faster by a bit, so that might just be mostly noise, but it happening 6/6 times we compress 128K blocks is worrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: 32 elements actually seems like a better breakpoint. I've compared 8, 16, and 32 element thresholds. You can see the raw data here: https://pastebin.com/raw/jWd7Gu8R
I will try and see if some of the insertion sort optimizations in glibc's qsort()
provide any additional benefit
U32 base; | ||
U32 curr; | ||
U16 base; | ||
U16 curr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note : accessing 32-bit values tends to be faster than accessing 16-bit values on modern cpu architectures.
Not sure if this impact is large enough to be measurable though, but it's easy enough to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I'm actually seeing (slightly) better performance when using 16-bit values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is also a benefit, which is a smaller memory footprint of the associated array, resulting in reduced L1 cache occupation.
So the cache benefit might outpace the access deficit (movzwl
vs movl
).
In any case, it's the measurement which matters, and ensuring that the 16-bit has better (or equivalent) speed is the right move.
dd5d6f5
to
fa214e2
Compare
6836e0f
to
58a72b6
Compare
58a72b6
to
ad76768
Compare
lib/compress/huf_compress.c
Outdated
static U32 HUF_getIndex(U32 const count) { | ||
return (count < RANK_POSITION_DISTINCT_COUNT_CUTOFF) | ||
? count | ||
: BIT_highbit32(count+1) + RANK_POSITION_LOG_BUCKETS_BEGIN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need a +1
here, right? Since we know the count isn't zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I did this, I noticed between a 0.4-0.6% slowdown on level 1, 4KB block silesia.tar so I left it in, so it seems like a maybe a code alignment issue? I guess it would be better to not leave it in just for the sake of that perturbation.
fc47d62
to
8c21c18
Compare
@@ -2,19 +2,19 @@ Data, Config, Method, | |||
silesia.tar, level -5, compress simple, 6738593 | |||
silesia.tar, level -3, compress simple, 6446372 | |||
silesia.tar, level -1, compress simple, 6186042 | |||
silesia.tar, level 0, compress simple, 4861425 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of this PR is that it improves speed of the Huffman Sorting stage,
resulting in better compression, visible on small blocks.
However, regression tests show that it also (slightly) alters compression ratio results,
suggesting that there is more to this PR than just improving the speed of the sorting stage.
Can you explain this effect ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we use a hybrid quicksort/insertion sort instead of raw insertion sort, the sorting algorithm is not stable, so symbols with the same frequency might shift around into different positions than they used to be in just a raw insertion sort. It doesn't matter for correctness though (since huffman just theoretically cares about the frequency), and is still deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is still deterministic
I think that's the important point. As long as it's deterministic, I'm fine with the very small impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the counts are exactly the same for symbols that are shuffled, we would expect the difference in compressed size to come from the Huffman header, not the stream. I guess from weights appearing in different order in the FSE compressed weights.
I wonder if there is some (small) opportunity to gain compression ratio by using a heuristic to predict which which symbol to prefer to give a higher weight to, to get a better FSE ratio. E.g. prefer to give higher weights to smaller symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also expect the difference to come from the FSE encoding of the Huffman header.
The problem here is that the succession of state values is rather chaotic, and therefore doesn't lend itself well to any kind of "cheap heuristic" prediction.
If left with only brute-force to discover the smallest succession of states, and considering the tiny differences at stake, the efforts looks outsized compared to the benefit.
8c21c18
to
aa19574
Compare
This PR improves the current huffman sorting strategy, mostly on small blocks, by doing the following:
Using @terrelln's nifty benchmarking tool, we see the following improvements: