Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUF] Improve Huffman sorting algorithm #2732

Merged
merged 1 commit into from
Aug 4, 2021

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Jul 26, 2021

This PR improves the current huffman sorting strategy, mostly on small blocks, by doing the following:

  • Decoupling bucketing and sorting into separate steps.
  • Skip attempts to sort buckets that are 0 or 1-sized. (e.g. silesia/dickens has a lot of 0-count symbols).
  • Use 128 buckets that are split into distinct count buckets that don't need to be sorted, and the typical log2 bucketing
    • This reduces the number of buckets we actually need to sort to at most 11, and reduces the avg nb of elements per bucket.
  • Use quicksort

Using @terrelln's nifty benchmarking tool, we see the following improvements:

Benchmark Config Dataset Ratio (dev) Ratio (huf_sort_improvement) Ratio (huf_sort_improvement - dev) Speed MB/s (dev) Speed MB/s (huf_sort_improvement) Speed MB/s (huf_sort_improvement - dev)
compress level_1 silesia 2.88 2.88 -0.0% 347.03 347.49 0.1%
compress level_1 silesia_4k 2.31 2.31 0.0% 183.02 200.93 9.8%
compress_literals level_1 silesia 1.28 1.28 -0.0% 755.42 780.19 3.3%
compress_literals level_1 silesia_4k 1.30 1.30 0.0% 233.50 330.14 41.4%
compress_literals level_7 silesia 1.11 1.11 -0.0% 677.14 746.20 10.2%
  • Nearly 10% speedup for 4KB block compression, 40% literal compression speedup.
  • Neutral e2e compression speed on 128K blocks, with improvements to literal compression.
  • Very small perturbations to compressed size (since quicksort is not stable).

Comment on lines 518 to 595
} else if (bucketSize <= 128) {
HUF_insertionSort(huffNode + bucketStartIdx, bucketSize);
} else {
HUF_simpleQuickSort(huffNode + bucketStartIdx, 0, bucketSize-1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend just using a single sort function. Generally, fast quick sorts will look at array size, and for small array sizes will fall back to insertion sort. So the quick sort could detect arrays of < K (K ~5 maybe) elements and fall back to insertion sort.

Copy link
Contributor Author

@senhuang42 senhuang42 Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it's a net positive change, but it's hard to tell, really. 6 elements experimentally seems like an okay cutoff point. With using insertion sort within quicksort for calls of less than 6 elements, we see the following (cfeccb43 being the version with your suggestion):

Benchmark Config Dataset Ratio (5accf234) Ratio (cfeccb43) Ratio (cfeccb43 - 5accf234) Speed MB/s (5accf234) Speed MB/s (cfeccb43) Speed MB/s (cfeccb43 - 5accf234)
compress level_1 enwik7 2.43 2.43 -0.0% 260.90 259.36 -0.6%
compress level_1 enwik7_1k 1.75 1.75 0.0% 115.55 117.16 1.4%
compress level_1 silesia 2.88 2.88 0.0% 349.45 347.77 -0.5%
compress level_1 silesia_16k 2.62 2.62 -0.0% 247.51 249.12 0.7%
compress level_1 silesia_4k 2.31 2.31 0.0% 204.78 205.35 0.3%
compress level_3 enwik7 2.79 2.79 0.0% 144.61 142.33 -1.6%
compress level_3 enwik7_1k 1.75 1.75 -0.0% 87.10 87.85 0.9%
compress level_3 silesia 3.18 3.18 0.0% 190.72 187.68 -1.6%
compress level_3 silesia_16k 2.67 2.67 0.0% 182.81 183.10 0.2%
compress level_3 silesia_4k 2.35 2.35 -0.0% 154.32 154.96 0.4%
compress level_7 enwik7 3.05 3.05 0.0% 66.42 66.32 -0.1%
compress level_7 enwik7_1k 1.79 1.79 -0.0% 52.59 53.06 0.9%
compress level_7 silesia 3.45 3.45 -0.0% 87.01 86.72 -0.3%
compress level_7 silesia_16k 2.80 2.80 -0.0% 51.72 52.20 0.9%
compress level_7 silesia_4k 2.42 2.42 0.0% 62.42 62.49 0.1%
compress_literals level_1 enwik7 1.49 1.49 -0.0% 783.73 778.03 -0.7%
compress_literals level_1 enwik7_1k 1.46 1.46 0.0% 264.67 271.91 2.7%
compress_literals level_1 silesia 1.30 1.30 0.0% 604.35 604.73 0.1%
compress_literals level_1 silesia_16k 1.29 1.29 0.0% 558.75 557.81 -0.2%
compress_literals level_1 silesia_1k 1.28 1.28 0.0% 265.02 272.08 2.7%
compress_literals level_1 silesia_4k 1.29 1.29 0.0% 276.64 283.87 2.6%
compress_literals level_3 enwik7 1.34 1.34 0.0% 640.60 630.37 -1.6%
compress_literals level_3 enwik7_1k 1.40 1.40 -0.0% 201.50 205.74 2.1%
compress_literals level_3 silesia 1.17 1.17 -0.0% 499.56 501.60 0.4%
compress_literals level_3 silesia_16k 1.21 1.21 0.0% 492.30 492.93 0.1%
compress_literals level_3 silesia_1k 1.23 1.23 0.0% 223.39 228.90 2.5%
compress_literals level_3 silesia_4k 1.24 1.24 0.0% 239.28 244.97 2.4%
compress_literals level_7 enwik7 1.29 1.29 0.0% 567.40 557.35 -1.8%
compress_literals level_7 enwik7_1k 1.39 1.39 -0.0% 189.78 193.95 2.2%
compress_literals level_7 silesia 1.15 1.15 -0.0% 461.80 464.98 0.7%
compress_literals level_7 silesia_16k 1.21 1.21 -0.0% 477.96 478.90 0.2%
compress_literals level_7 silesia_1k 1.22 1.22 0.0% 215.32 220.73 2.5%
compress_literals level_7 silesia_4k 1.23 1.23 0.0% 232.33 237.94 2.4%

The main issue just seems to be that e2e compression on normal sized blocks seems to be consistently slower by a bit. It doesn't really make sense that that should happen since literals compression is typically still faster by a bit, so that might just be mostly noise, but it happening 6/6 times we compress 128K blocks is worrying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: 32 elements actually seems like a better breakpoint. I've compared 8, 16, and 32 element thresholds. You can see the raw data here: https://pastebin.com/raw/jWd7Gu8R

I will try and see if some of the insertion sort optimizations in glibc's qsort() provide any additional benefit

U32 base;
U32 curr;
U16 base;
U16 curr;
Copy link
Contributor

@Cyan4973 Cyan4973 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note : accessing 32-bit values tends to be faster than accessing 16-bit values on modern cpu architectures.

Not sure if this impact is large enough to be measurable though, but it's easy enough to test.

Copy link
Contributor Author

@senhuang42 senhuang42 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'm actually seeing (slightly) better performance when using 16-bit values.

Copy link
Contributor

@Cyan4973 Cyan4973 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is also a benefit, which is a smaller memory footprint of the associated array, resulting in reduced L1 cache occupation.
So the cache benefit might outpace the access deficit (movzwl vs movl).

In any case, it's the measurement which matters, and ensuring that the 16-bit has better (or equivalent) speed is the right move.

@senhuang42 senhuang42 force-pushed the huf_sort_improvement branch 4 times, most recently from dd5d6f5 to fa214e2 Compare July 27, 2021 19:57
@senhuang42 senhuang42 changed the title [RFC] Improve Huffman sorting algorithm [HUF] Improve Huffman sorting algorithm Jul 27, 2021
@senhuang42 senhuang42 force-pushed the huf_sort_improvement branch 2 times, most recently from 6836e0f to 58a72b6 Compare July 28, 2021 20:15
static U32 HUF_getIndex(U32 const count) {
return (count < RANK_POSITION_DISTINCT_COUNT_CUTOFF)
? count
: BIT_highbit32(count+1) + RANK_POSITION_LOG_BUCKETS_BEGIN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need a +1 here, right? Since we know the count isn't zero.

Copy link
Contributor Author

@senhuang42 senhuang42 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did this, I noticed between a 0.4-0.6% slowdown on level 1, 4KB block silesia.tar so I left it in, so it seems like a maybe a code alignment issue? I guess it would be better to not leave it in just for the sake of that perturbation.

@senhuang42 senhuang42 force-pushed the huf_sort_improvement branch 9 times, most recently from fc47d62 to 8c21c18 Compare August 2, 2021 14:18
@@ -2,19 +2,19 @@ Data, Config, Method,
silesia.tar, level -5, compress simple, 6738593
silesia.tar, level -3, compress simple, 6446372
silesia.tar, level -1, compress simple, 6186042
silesia.tar, level 0, compress simple, 4861425
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this PR is that it improves speed of the Huffman Sorting stage,
resulting in better compression, visible on small blocks.

However, regression tests show that it also (slightly) alters compression ratio results,
suggesting that there is more to this PR than just improving the speed of the sorting stage.

Can you explain this effect ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we use a hybrid quicksort/insertion sort instead of raw insertion sort, the sorting algorithm is not stable, so symbols with the same frequency might shift around into different positions than they used to be in just a raw insertion sort. It doesn't matter for correctness though (since huffman just theoretically cares about the frequency), and is still deterministic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is still deterministic

I think that's the important point. As long as it's deterministic, I'm fine with the very small impact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the counts are exactly the same for symbols that are shuffled, we would expect the difference in compressed size to come from the Huffman header, not the stream. I guess from weights appearing in different order in the FSE compressed weights.

I wonder if there is some (small) opportunity to gain compression ratio by using a heuristic to predict which which symbol to prefer to give a higher weight to, to get a better FSE ratio. E.g. prefer to give higher weights to smaller symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also expect the difference to come from the FSE encoding of the Huffman header.

The problem here is that the succession of state values is rather chaotic, and therefore doesn't lend itself well to any kind of "cheap heuristic" prediction.
If left with only brute-force to discover the smallest succession of states, and considering the tiny differences at stake, the efforts looks outsized compared to the benefit.

@senhuang42 senhuang42 merged commit f9b0340 into facebook:dev Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants