[HUF] Improve Huffman sorting algorithm #2732

senhuang42 · 2021-07-26T18:19:52Z

This PR improves the current huffman sorting strategy, mostly on small blocks, by doing the following:

Decoupling bucketing and sorting into separate steps.
Skip attempts to sort buckets that are 0 or 1-sized. (e.g. silesia/dickens has a lot of 0-count symbols).
Use 128 buckets that are split into distinct count buckets that don't need to be sorted, and the typical log2 bucketing
- This reduces the number of buckets we actually need to sort to at most 11, and reduces the avg nb of elements per bucket.
Use quicksort

Using @terrelln's nifty benchmarking tool, we see the following improvements:

Benchmark	Config	Dataset	Ratio (dev)	Ratio (huf_sort_improvement)	Ratio (huf_sort_improvement - dev)	Speed MB/s (dev)	Speed MB/s (huf_sort_improvement)	Speed MB/s (huf_sort_improvement - dev)
compress	level_1	silesia	2.88	2.88	-0.0%	347.03	347.49	0.1%
compress	level_1	silesia_4k	2.31	2.31	0.0%	183.02	200.93	9.8%
compress_literals	level_1	silesia	1.28	1.28	-0.0%	755.42	780.19	3.3%
compress_literals	level_1	silesia_4k	1.30	1.30	0.0%	233.50	330.14	41.4%
compress_literals	level_7	silesia	1.11	1.11	-0.0%	677.14	746.20	10.2%

Nearly 10% speedup for 4KB block compression, 40% literal compression speedup.
Neutral e2e compression speed on 128K blocks, with improvements to literal compression.
Very small perturbations to compressed size (since quicksort is not stable).

terrelln · 2021-07-26T19:49:19Z

lib/compress/huf_compress.c

+        } else if (bucketSize <= 128) {
+            HUF_insertionSort(huffNode + bucketStartIdx, bucketSize);
+        } else {
+            HUF_simpleQuickSort(huffNode + bucketStartIdx, 0, bucketSize-1);
+        }


I'd recommend just using a single sort function. Generally, fast quick sorts will look at array size, and for small array sizes will fall back to insertion sort. So the quick sort could detect arrays of < K (K ~5 maybe) elements and fall back to insertion sort.

So I think it's a net positive change, but it's hard to tell, really. 6 elements experimentally seems like an okay cutoff point. With using insertion sort within quicksort for calls of less than 6 elements, we see the following (cfeccb43 being the version with your suggestion):

Benchmark Config Dataset Ratio (5accf234) Ratio (cfeccb43) Ratio (cfeccb43 - 5accf234) Speed MB/s (5accf234) Speed MB/s (cfeccb43) Speed MB/s (cfeccb43 - 5accf234)

compress level_1 enwik7 2.43 2.43 -0.0% 260.90 259.36 -0.6%

compress level_1 enwik7_1k 1.75 1.75 0.0% 115.55 117.16 1.4%

compress level_1 silesia 2.88 2.88 0.0% 349.45 347.77 -0.5%

compress level_1 silesia_16k 2.62 2.62 -0.0% 247.51 249.12 0.7%

compress level_1 silesia_4k 2.31 2.31 0.0% 204.78 205.35 0.3%

compress level_3 enwik7 2.79 2.79 0.0% 144.61 142.33 -1.6%

compress level_3 enwik7_1k 1.75 1.75 -0.0% 87.10 87.85 0.9%

compress level_3 silesia 3.18 3.18 0.0% 190.72 187.68 -1.6%

compress level_3 silesia_16k 2.67 2.67 0.0% 182.81 183.10 0.2%

compress level_3 silesia_4k 2.35 2.35 -0.0% 154.32 154.96 0.4%

compress level_7 enwik7 3.05 3.05 0.0% 66.42 66.32 -0.1%

compress level_7 enwik7_1k 1.79 1.79 -0.0% 52.59 53.06 0.9%

compress level_7 silesia 3.45 3.45 -0.0% 87.01 86.72 -0.3%

compress level_7 silesia_16k 2.80 2.80 -0.0% 51.72 52.20 0.9%

compress level_7 silesia_4k 2.42 2.42 0.0% 62.42 62.49 0.1%

compress_literals level_1 enwik7 1.49 1.49 -0.0% 783.73 778.03 -0.7%

compress_literals level_1 enwik7_1k 1.46 1.46 0.0% 264.67 271.91 2.7%

compress_literals level_1 silesia 1.30 1.30 0.0% 604.35 604.73 0.1%

compress_literals level_1 silesia_16k 1.29 1.29 0.0% 558.75 557.81 -0.2%

compress_literals level_1 silesia_1k 1.28 1.28 0.0% 265.02 272.08 2.7%

compress_literals level_1 silesia_4k 1.29 1.29 0.0% 276.64 283.87 2.6%

compress_literals level_3 enwik7 1.34 1.34 0.0% 640.60 630.37 -1.6%

compress_literals level_3 enwik7_1k 1.40 1.40 -0.0% 201.50 205.74 2.1%

compress_literals level_3 silesia 1.17 1.17 -0.0% 499.56 501.60 0.4%

compress_literals level_3 silesia_16k 1.21 1.21 0.0% 492.30 492.93 0.1%

compress_literals level_3 silesia_1k 1.23 1.23 0.0% 223.39 228.90 2.5%

compress_literals level_3 silesia_4k 1.24 1.24 0.0% 239.28 244.97 2.4%

compress_literals level_7 enwik7 1.29 1.29 0.0% 567.40 557.35 -1.8%

compress_literals level_7 enwik7_1k 1.39 1.39 -0.0% 189.78 193.95 2.2%

compress_literals level_7 silesia 1.15 1.15 -0.0% 461.80 464.98 0.7%

compress_literals level_7 silesia_16k 1.21 1.21 -0.0% 477.96 478.90 0.2%

compress_literals level_7 silesia_1k 1.22 1.22 0.0% 215.32 220.73 2.5%

compress_literals level_7 silesia_4k 1.23 1.23 0.0% 232.33 237.94 2.4%

The main issue just seems to be that e2e compression on normal sized blocks seems to be consistently slower by a bit. It doesn't really make sense that that should happen since literals compression is typically still faster by a bit, so that might just be mostly noise, but it happening 6/6 times we compress 128K blocks is worrying.

Update: 32 elements actually seems like a better breakpoint. I've compared 8, 16, and 32 element thresholds. You can see the raw data here: https://pastebin.com/raw/jWd7Gu8R

I will try and see if some of the insertion sort optimizations in glibc's qsort() provide any additional benefit

Cyan4973 · 2021-07-27T12:30:03Z

lib/compress/huf_compress.c

-    U32 base;
-    U32 curr;
+    U16 base;
+    U16 curr;


note : accessing 32-bit values tends to be faster than accessing 16-bit values on modern cpu architectures.

Not sure if this impact is large enough to be measurable though, but it's easy enough to test.

Interesting, I'm actually seeing (slightly) better performance when using 16-bit values.

Well, there is also a benefit, which is a smaller memory footprint of the associated array, resulting in reduced L1 cache occupation.
So the cache benefit might outpace the access deficit (movzwl vs movl).

In any case, it's the measurement which matters, and ensuring that the 16-bit has better (or equivalent) speed is the right move.

tests/regression/results.csv

terrelln · 2021-07-28T21:42:39Z

lib/compress/huf_compress.c

+static U32 HUF_getIndex(U32 const count) {
+    return (count < RANK_POSITION_DISTINCT_COUNT_CUTOFF)
+        ? count
+        : BIT_highbit32(count+1) + RANK_POSITION_LOG_BUCKETS_BEGIN;


You shouldn't need a +1 here, right? Since we know the count isn't zero.

When I did this, I noticed between a 0.4-0.6% slowdown on level 1, 4KB block silesia.tar so I left it in, so it seems like a maybe a code alignment issue? I guess it would be better to not leave it in just for the sake of that perturbation.

Cyan4973 · 2021-08-02T14:41:23Z

tests/regression/results.csv

@@ -2,19 +2,19 @@ Data,                               Config,                             Method,
 silesia.tar,                        level -5,                           compress simple,                    6738593
 silesia.tar,                        level -3,                           compress simple,                    6446372
 silesia.tar,                        level -1,                           compress simple,                    6186042
-silesia.tar,                        level 0,                            compress simple,                    4861425


My understanding of this PR is that it improves speed of the Huffman Sorting stage,
resulting in better compression, visible on small blocks.

However, regression tests show that it also (slightly) alters compression ratio results,
suggesting that there is more to this PR than just improving the speed of the sorting stage.

Can you explain this effect ?

Now that we use a hybrid quicksort/insertion sort instead of raw insertion sort, the sorting algorithm is not stable, so symbols with the same frequency might shift around into different positions than they used to be in just a raw insertion sort. It doesn't matter for correctness though (since huffman just theoretically cares about the frequency), and is still deterministic.

is still deterministic

I think that's the important point. As long as it's deterministic, I'm fine with the very small impact.

Since the counts are exactly the same for symbols that are shuffled, we would expect the difference in compressed size to come from the Huffman header, not the stream. I guess from weights appearing in different order in the FSE compressed weights.

I wonder if there is some (small) opportunity to gain compression ratio by using a heuristic to predict which which symbol to prefer to give a higher weight to, to get a better FSE ratio. E.g. prefer to give higher weights to smaller symbols.

Yeah, I also expect the difference to come from the FSE encoding of the Huffman header.

The problem here is that the succession of state values is rather chaotic, and therefore doesn't lend itself well to any kind of "cheap heuristic" prediction.
If left with only brute-force to discover the smallest succession of states, and considering the tiny differences at stake, the efforts looks outsized compared to the benefit.

facebook-github-bot added the CLA Signed label Jul 26, 2021

senhuang42 force-pushed the huf_sort_improvement branch from 8892420 to 5151272 Compare July 26, 2021 18:24

terrelln reviewed Jul 26, 2021

View reviewed changes

Cyan4973 reviewed Jul 27, 2021

View reviewed changes

senhuang42 force-pushed the huf_sort_improvement branch 4 times, most recently from dd5d6f5 to fa214e2 Compare July 27, 2021 19:57

senhuang42 changed the title ~~[RFC] Improve Huffman sorting algorithm~~ [HUF] Improve Huffman sorting algorithm Jul 27, 2021

senhuang42 force-pushed the huf_sort_improvement branch 2 times, most recently from 6836e0f to 58a72b6 Compare July 28, 2021 20:15

senhuang42 commented Jul 28, 2021

View reviewed changes

tests/regression/results.csv Show resolved Hide resolved

senhuang42 force-pushed the huf_sort_improvement branch from 58a72b6 to ad76768 Compare July 28, 2021 20:27

terrelln reviewed Jul 28, 2021

View reviewed changes

senhuang42 force-pushed the huf_sort_improvement branch 9 times, most recently from fc47d62 to 8c21c18 Compare August 2, 2021 14:18

Cyan4973 reviewed Aug 2, 2021

View reviewed changes

Cyan4973 approved these changes Aug 2, 2021

View reviewed changes

terrelln approved these changes Aug 2, 2021

View reviewed changes

Improve Huffman sorting algorithm

aa19574

senhuang42 force-pushed the huf_sort_improvement branch from 8c21c18 to aa19574 Compare August 4, 2021 16:51

senhuang42 merged commit f9b0340 into facebook:dev Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUF] Improve Huffman sorting algorithm #2732

[HUF] Improve Huffman sorting algorithm #2732

senhuang42 commented Jul 26, 2021 •

edited

Loading

terrelln Jul 26, 2021

senhuang42 Jul 26, 2021 •

edited

Loading

senhuang42 Jul 27, 2021

Cyan4973 Jul 27, 2021 •

edited

Loading

senhuang42 Jul 27, 2021 •

edited

Loading

Cyan4973 Jul 27, 2021 •

edited

Loading

terrelln Jul 28, 2021

senhuang42 Jul 28, 2021 •

edited

Loading

Cyan4973 Aug 2, 2021

senhuang42 Aug 2, 2021

Cyan4973 Aug 2, 2021

terrelln Aug 2, 2021

Cyan4973 Aug 2, 2021

Benchmark	Config	Dataset	Ratio (5accf234)	Ratio (cfeccb43)	Ratio (cfeccb43 - 5accf234)	Speed MB/s (5accf234)	Speed MB/s (cfeccb43)	Speed MB/s (cfeccb43 - 5accf234)
compress	level_1	enwik7	2.43	2.43	-0.0%	260.90	259.36	-0.6%
compress	level_1	enwik7_1k	1.75	1.75	0.0%	115.55	117.16	1.4%
compress	level_1	silesia	2.88	2.88	0.0%	349.45	347.77	-0.5%
compress	level_1	silesia_16k	2.62	2.62	-0.0%	247.51	249.12	0.7%
compress	level_1	silesia_4k	2.31	2.31	0.0%	204.78	205.35	0.3%
compress	level_3	enwik7	2.79	2.79	0.0%	144.61	142.33	-1.6%
compress	level_3	enwik7_1k	1.75	1.75	-0.0%	87.10	87.85	0.9%
compress	level_3	silesia	3.18	3.18	0.0%	190.72	187.68	-1.6%
compress	level_3	silesia_16k	2.67	2.67	0.0%	182.81	183.10	0.2%
compress	level_3	silesia_4k	2.35	2.35	-0.0%	154.32	154.96	0.4%
compress	level_7	enwik7	3.05	3.05	0.0%	66.42	66.32	-0.1%
compress	level_7	enwik7_1k	1.79	1.79	-0.0%	52.59	53.06	0.9%
compress	level_7	silesia	3.45	3.45	-0.0%	87.01	86.72	-0.3%
compress	level_7	silesia_16k	2.80	2.80	-0.0%	51.72	52.20	0.9%
compress	level_7	silesia_4k	2.42	2.42	0.0%	62.42	62.49	0.1%
compress_literals	level_1	enwik7	1.49	1.49	-0.0%	783.73	778.03	-0.7%
compress_literals	level_1	enwik7_1k	1.46	1.46	0.0%	264.67	271.91	2.7%
compress_literals	level_1	silesia	1.30	1.30	0.0%	604.35	604.73	0.1%
compress_literals	level_1	silesia_16k	1.29	1.29	0.0%	558.75	557.81	-0.2%
compress_literals	level_1	silesia_1k	1.28	1.28	0.0%	265.02	272.08	2.7%
compress_literals	level_1	silesia_4k	1.29	1.29	0.0%	276.64	283.87	2.6%
compress_literals	level_3	enwik7	1.34	1.34	0.0%	640.60	630.37	-1.6%
compress_literals	level_3	enwik7_1k	1.40	1.40	-0.0%	201.50	205.74	2.1%
compress_literals	level_3	silesia	1.17	1.17	-0.0%	499.56	501.60	0.4%
compress_literals	level_3	silesia_16k	1.21	1.21	0.0%	492.30	492.93	0.1%
compress_literals	level_3	silesia_1k	1.23	1.23	0.0%	223.39	228.90	2.5%
compress_literals	level_3	silesia_4k	1.24	1.24	0.0%	239.28	244.97	2.4%
compress_literals	level_7	enwik7	1.29	1.29	0.0%	567.40	557.35	-1.8%
compress_literals	level_7	enwik7_1k	1.39	1.39	-0.0%	189.78	193.95	2.2%
compress_literals	level_7	silesia	1.15	1.15	-0.0%	461.80	464.98	0.7%
compress_literals	level_7	silesia_16k	1.21	1.21	-0.0%	477.96	478.90	0.2%
compress_literals	level_7	silesia_1k	1.22	1.22	0.0%	215.32	220.73	2.5%
compress_literals	level_7	silesia_4k	1.23	1.23	0.0%	232.33	237.94	2.4%

[HUF] Improve Huffman sorting algorithm #2732

[HUF] Improve Huffman sorting algorithm #2732

Conversation

senhuang42 commented Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

senhuang42 Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cyan4973 Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

senhuang42 Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

senhuang42 Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

senhuang42 commented Jul 26, 2021 •

edited

Loading

senhuang42 Jul 26, 2021 •

edited

Loading

Cyan4973 Jul 27, 2021 •

edited

Loading

senhuang42 Jul 27, 2021 •

edited

Loading

Cyan4973 Jul 27, 2021 •

edited

Loading

senhuang42 Jul 28, 2021 •

edited

Loading