Assembly implementation of 4X1 & 4X2 Huffman #2722

terrelln · 2021-07-07T03:25:43Z

Assembly in huf_decompress_x86_bmi2.S.

proba32.dat is 1MB of random symbols in [0, 32). Zstd compresses as all literals and uses 4X2 to decompress.
proba128.dat is 1MB of random symbols in [0, 128). Zstd compresses as all literals and uses 4X1 to decompress.
All files are compressed with level 1 to get the most # of literals.

The assembly is wrapped by a C function that handles the initialization & ending conditions. Once the streams get close to the end, the C takes over and uses HUF_decompress1X*() to finish, similarly to the normal 4X1 and 4X2 loops.

File	gcc-11	gcc-11 asm	gcc-11 speedup	clang-12	clang-12 asm	clang-12 speedup
proba32.dat	1746	2357	+35%	1449	2354	+62%
proba128.dat	944	1318	+39%	906	1320	+45%
silesia.tar	1159	1244	+7.3%	1131	1246	+10.1%
dickens	1064	1160	+9.0%	1036	1169	+12.8%
mozilla	1000	1064	+6.4%	988	1071	+8.4%
mr	1264	1344	+6.3%	1203	1348	+12.0%
nci	1786	1855	+3.8%	1771	1819	+2.7%
ooffice	876	989	+12.9%	814	995	+22.2%
osdb	1209	1302	+7.6%	1216	1300	+6.9%
reymont	1073	1124	+4.7%	1069	1134	+6.1%
samba	1439	1529	+6.2%	1418	1525	+7.5%
sao	862	1017	+18.0%	868	1017	+17.1%
webster	1113	1183	+6.3%	1088	1189	+9.3%
xml	1796	1851	+3.1%	1753	1826	+4.2%
x-ray	1042	1280	+22.8%	877	1276	+45.5%

Cyan4973 · 2021-07-07T12:30:00Z

Great investigation @terrelln !
This is a promising starting point !

terrelln · 2021-08-04T21:23:18Z

In the latest work:

I've sped up HUF_readDTableX2() significantly (benchmarks to come)
I speed up the 4X2 assembly a good amount
I sped up the 4X1 assembly a tiny bit
I updated HUF_selectDecoder() table with new benchmark results

I updated the perf numbers in the PR, and here is the delta from the first version:

GCC Delta:

File	gcc-11 speedup
proba32.dat	+10%
proba128.dat	+4.5%
dickens	+4.1%
mozilla	+1.3%
mr	+0.47%
nci	+3.4%
ooffice	+3.6%
osdb	-0.1%
reymont	+3.1%
samba	+4.2%
sao	+0.5%
webster	+3.4%
xml	+2.1%
x-ray	+2.3%

I still have a bunch of cleanup to do.

terrelln · 2021-09-17T18:50:21Z

I've extracted out the Makefile updates into PR #2783.

They're still present in this PR, but will disappear once the other PR is merged.

Cyan4973 · 2021-09-20T18:18:17Z

build/single_file_libs/zstd-in.c

@@ -43,6 +43,8 @@
 #define ZSTD_MULTITHREAD
 #endif
 #define ZSTD_TRACE 0
+/* TODO: Can't amalgamate ASM function */


Yeah, not "critical", but still an important follow-up

Yeah, I'm not sure what to do about that. We may be able to get the amalgamation process to re-write the .S file as inline assembly.

I'm going to be opening an issue for follow ups, and I'll make sure to mention this.

Cyan4973 · 2021-09-20T21:30:36Z

I could check the performance claims.
Decompression speed is indeed highly improved when there are a lot of literals,
with clang getting more benefits as a consequence of starting from a slower speed based.

I believe my only concern about this PR is the additional build complexity. While it's correctly dealt with for the scope of this PR, I believe it introduces some questions :

Is the current framework suitable for extensions ? Meaning if additional assembly code is added to zstd code base, will it be easy to add, or will it require additional tweaking of the build system ?
Should the build framework for assembly code receive additional efforts in the near future (beyond this PR) ?

terrelln · 2021-09-20T21:47:03Z

Is the current framework suitable for extensions ? Meaning if additional assembly code is added to zstd code base, will it be easy to add, or will it require additional tweaking of the build system ?

Yeah, assembly should now be correctly handled everywhere in the build system (where assembly is supported e.g. not Visual Studio). The Makefile only currently searches for .S files in lib/decompress, but that can easily be extended as needed.

Should the build framework for assembly code receive additional efforts in the near future (beyond this PR) ?

I've opened a follow up issue #2789 that tracks this. Basically we should support assembly in Visual Studios eventually, and enable assembly for the amalgamated build.

Cyan4973 · 2021-09-22T22:42:51Z

Note : I note that, for some reason, AppveyorCI tests seems to have become significantly longer to run after this PR (22-23mn => 31-33mn)

terrelln · 2021-09-22T22:45:37Z

Note : I note that, for some reason, AppveyorCI tests seems to have become significantly longer to run after this PR (22-23mn => 31-33mn)

Thats super odd, I wouldn't expect that. Maybe we can put up a PR that backs out this commit, and see if appveyor gets faster? To rule out coincidence.

Cyan4973 · 2021-09-22T22:50:34Z

Thats super odd, I wouldn't expect that. Maybe we can put up a PR that backs out this commit, and see if appveyor gets faster? To rule out coincidence.

Good idea !

pps83 · 2024-04-02T00:28:58Z

@terrelln FYI, this asm code can be used with yasm on VS.

First of all, yasm needs some fixes (I made these PRs: yasm#271, yasm#272). Then, yasm has to be built with CPP_PROG=cl.exe (CPP_PROG is makefile/cmake variable that defaults to cpp).
This yasm binary can be installed using VSYASM however their yasm.xml needs to patched to add support for -rcpp options to enable C preprocessor.

All this can be done with my fork of VSYASM, it has all the changes required and uses my builds that have cl.exe enabled.

In my zstd fork I added VS2022 solution, then, two more PRs required:

facebook-github-bot added the CLA Signed label Jul 7, 2021

terrelln force-pushed the huf-asm branch from 3c3a166 to ef1df58 Compare August 4, 2021 20:59

terrelln force-pushed the huf-asm branch from ef1df58 to 2a55c20 Compare September 2, 2021 00:45

felixhandte added the optimization label Sep 7, 2021

terrelln force-pushed the huf-asm branch 2 times, most recently from 574b552 to 160398d Compare September 15, 2021 21:18

terrelln force-pushed the huf-asm branch 2 times, most recently from 11b05fe to 2304a5d Compare September 17, 2021 18:44

terrelln force-pushed the huf-asm branch 2 times, most recently from a7ae571 to aef8fe4 Compare September 17, 2021 22:26

Cyan4973 reviewed Sep 20, 2021

View reviewed changes

terrelln force-pushed the huf-asm branch from aef8fe4 to 19536b3 Compare September 20, 2021 19:56

Cyan4973 approved these changes Sep 20, 2021

View reviewed changes

terrelln force-pushed the huf-asm branch 2 times, most recently from e7acd71 to 0ee7703 Compare September 20, 2021 21:37

Huffman ASM

a5f2c45

terrelln force-pushed the huf-asm branch from 0ee7703 to a5f2c45 Compare September 20, 2021 21:47

terrelln changed the title ~~[RFC] Assembly implementation of 4X1 & 4X2 Huffman~~ Assembly implementation of 4X1 & 4X2 Huffman Sep 20, 2021

terrelln merged commit 8385355 into facebook:dev Sep 21, 2021

SpringMT mentioned this pull request Dec 30, 2021

symbol lookup error after upgrade to 1.5.1 SpringMT/zstd-ruby#37

Closed

JunHe77 mentioned this pull request Jun 7, 2022

Add fast huf_dec with generic C and tuned aarch64 assembly #3155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assembly implementation of 4X1 & 4X2 Huffman #2722

Assembly implementation of 4X1 & 4X2 Huffman #2722

terrelln commented Jul 7, 2021 •

edited

Loading

Cyan4973 commented Jul 7, 2021

terrelln commented Aug 4, 2021

terrelln commented Sep 17, 2021

Cyan4973 Sep 20, 2021

terrelln Sep 20, 2021

Cyan4973 commented Sep 20, 2021

terrelln commented Sep 20, 2021

Cyan4973 commented Sep 22, 2021

terrelln commented Sep 22, 2021

Cyan4973 commented Sep 22, 2021

pps83 commented Apr 2, 2024 •

edited

Loading

Assembly implementation of 4X1 & 4X2 Huffman #2722

Assembly implementation of 4X1 & 4X2 Huffman #2722

Conversation

terrelln commented Jul 7, 2021 • edited Loading

Cyan4973 commented Jul 7, 2021

terrelln commented Aug 4, 2021

terrelln commented Sep 17, 2021

Cyan4973 Sep 20, 2021

Choose a reason for hiding this comment

terrelln Sep 20, 2021

Choose a reason for hiding this comment

Cyan4973 commented Sep 20, 2021

terrelln commented Sep 20, 2021

Cyan4973 commented Sep 22, 2021

terrelln commented Sep 22, 2021

Cyan4973 commented Sep 22, 2021

pps83 commented Apr 2, 2024 • edited Loading

terrelln commented Jul 7, 2021 •

edited

Loading

pps83 commented Apr 2, 2024 •

edited

Loading