Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly implementation of 4X1 & 4X2 Huffman #2722

Merged
merged 1 commit into from
Sep 21, 2021
Merged

Conversation

terrelln
Copy link
Contributor

@terrelln terrelln commented Jul 7, 2021

Assembly in huf_decompress_x86_bmi2.S.

  • proba32.dat is 1MB of random symbols in [0, 32). Zstd compresses as all literals and uses 4X2 to decompress.
  • proba128.dat is 1MB of random symbols in [0, 128). Zstd compresses as all literals and uses 4X1 to decompress.
  • All files are compressed with level 1 to get the most # of literals.

The assembly is wrapped by a C function that handles the initialization & ending conditions. Once the streams get close to the end, the C takes over and uses HUF_decompress1X*() to finish, similarly to the normal 4X1 and 4X2 loops.

File gcc-11 gcc-11 asm gcc-11 speedup clang-12 clang-12 asm clang-12 speedup
proba32.dat 1746 2357 +35% 1449 2354 +62%
proba128.dat 944 1318 +39% 906 1320 +45%
silesia.tar 1159 1244 +7.3% 1131 1246 +10.1%
dickens 1064 1160 +9.0% 1036 1169 +12.8%
mozilla 1000 1064 +6.4% 988 1071 +8.4%
mr 1264 1344 +6.3% 1203 1348 +12.0%
nci 1786 1855 +3.8% 1771 1819 +2.7%
ooffice 876 989 +12.9% 814 995 +22.2%
osdb 1209 1302 +7.6% 1216 1300 +6.9%
reymont 1073 1124 +4.7% 1069 1134 +6.1%
samba 1439 1529 +6.2% 1418 1525 +7.5%
sao 862 1017 +18.0% 868 1017 +17.1%
webster 1113 1183 +6.3% 1088 1189 +9.3%
xml 1796 1851 +3.1% 1753 1826 +4.2%
x-ray 1042 1280 +22.8% 877 1276 +45.5%

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 7, 2021

Great investigation @terrelln !
This is a promising starting point !

@terrelln
Copy link
Contributor Author

terrelln commented Aug 4, 2021

In the latest work:

  • I've sped up HUF_readDTableX2() significantly (benchmarks to come)
  • I speed up the 4X2 assembly a good amount
  • I sped up the 4X1 assembly a tiny bit
  • I updated HUF_selectDecoder() table with new benchmark results

I updated the perf numbers in the PR, and here is the delta from the first version:

GCC Delta:

File gcc-11 speedup
proba32.dat +10%
proba128.dat +4.5%
dickens +4.1%
mozilla +1.3%
mr +0.47%
nci +3.4%
ooffice +3.6%
osdb -0.1%
reymont +3.1%
samba +4.2%
sao +0.5%
webster +3.4%
xml +2.1%
x-ray +2.3%

I still have a bunch of cleanup to do.

@terrelln
Copy link
Contributor Author

I've extracted out the Makefile updates into PR #2783.

They're still present in this PR, but will disappear once the other PR is merged.

@terrelln terrelln force-pushed the huf-asm branch 2 times, most recently from a7ae571 to aef8fe4 Compare September 17, 2021 22:26
@@ -43,6 +43,8 @@
#define ZSTD_MULTITHREAD
#endif
#define ZSTD_TRACE 0
/* TODO: Can't amalgamate ASM function */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not "critical", but still an important follow-up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what to do about that. We may be able to get the amalgamation process to re-write the .S file as inline assembly.

I'm going to be opening an issue for follow ups, and I'll make sure to mention this.

@Cyan4973
Copy link
Contributor

I could check the performance claims.
Decompression speed is indeed highly improved when there are a lot of literals,
with clang getting more benefits as a consequence of starting from a slower speed based.

I believe my only concern about this PR is the additional build complexity. While it's correctly dealt with for the scope of this PR, I believe it introduces some questions :

  • Is the current framework suitable for extensions ? Meaning if additional assembly code is added to zstd code base, will it be easy to add, or will it require additional tweaking of the build system ?
  • Should the build framework for assembly code receive additional efforts in the near future (beyond this PR) ?

@terrelln terrelln force-pushed the huf-asm branch 2 times, most recently from e7acd71 to 0ee7703 Compare September 20, 2021 21:37
@terrelln
Copy link
Contributor Author

Is the current framework suitable for extensions ? Meaning if additional assembly code is added to zstd code base, will it be easy to add, or will it require additional tweaking of the build system ?

Yeah, assembly should now be correctly handled everywhere in the build system (where assembly is supported e.g. not Visual Studio). The Makefile only currently searches for .S files in lib/decompress, but that can easily be extended as needed.

Should the build framework for assembly code receive additional efforts in the near future (beyond this PR) ?

I've opened a follow up issue #2789 that tracks this. Basically we should support assembly in Visual Studios eventually, and enable assembly for the amalgamated build.

@terrelln terrelln changed the title [RFC] Assembly implementation of 4X1 & 4X2 Huffman Assembly implementation of 4X1 & 4X2 Huffman Sep 20, 2021
@terrelln terrelln merged commit 8385355 into facebook:dev Sep 21, 2021
@Cyan4973
Copy link
Contributor

Note : I note that, for some reason, AppveyorCI tests seems to have become significantly longer to run after this PR (22-23mn => 31-33mn)

@terrelln
Copy link
Contributor Author

Note : I note that, for some reason, AppveyorCI tests seems to have become significantly longer to run after this PR (22-23mn => 31-33mn)

Thats super odd, I wouldn't expect that. Maybe we can put up a PR that backs out this commit, and see if appveyor gets faster? To rule out coincidence.

@Cyan4973
Copy link
Contributor

Thats super odd, I wouldn't expect that. Maybe we can put up a PR that backs out this commit, and see if appveyor gets faster? To rule out coincidence.

Good idea !

@pps83
Copy link

pps83 commented Apr 2, 2024

@terrelln FYI, this asm code can be used with yasm on VS.

First of all, yasm needs some fixes (I made these PRs: yasm#271, yasm#272). Then, yasm has to be built with CPP_PROG=cl.exe (CPP_PROG is makefile/cmake variable that defaults to cpp).
This yasm binary can be installed using VSYASM however their yasm.xml needs to patched to add support for -rcpp options to enable C preprocessor.

All this can be done with my fork of VSYASM, it has all the changes required and uses my builds that have cl.exe enabled.

In my zstd fork I added VS2022 solution, then, two more PRs required:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants