Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE/Neon path for MSVC x86 and ARM #2680

Closed
wants to merge 5 commits into from
Closed

Conversation

cwoffenden
Copy link
Contributor

@cwoffenden cwoffenden commented May 21, 2021

This is taking what #2653 started and extending it to x86 and MS ARM64 targets. To do this I fake the __SSE2__ or __ARM_NEON defines for MSVC (this was preferable to having the longer tests everywhere else) and change the signature for ZSTD_Vec256_cmpMask8 (more of later).

First some benchmarks! This is x86 without the SSE2 path, on a 3990X (with 127 idle cores!):

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  43.6 MB/s , 413.7 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  42.3 MB/s , 425.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  31.1 MB/s , 454.4 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  25.8 MB/s , 469.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  22.7 MB/s , 475.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  19.7 MB/s , 475.6 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  15.0 MB/s , 475.9 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  11.0 MB/s , 485.2 MB/s

And this is with the SSE2 path enabled:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  52.6 MB/s , 424.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  50.7 MB/s , 436.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  40.3 MB/s , 466.6 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  33.5 MB/s , 482.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  28.9 MB/s , 487.9 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  27.0 MB/s , 486.4 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  19.7 MB/s , 487.0 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  17.0 MB/s , 495.9 MB/s

I took the best of five runs, and we see a 20-50% improvement. For this to work I needed to change ZSTD_Vec256_cmpMask8 to a pointer of the 256-bit type (since on 32-bit systems, depending on the version of MSVC, tested with 2010-2019, it errors with formal parameter with requested alignment of 16 won't be aligned). I worried this would affect performance by not making best use of the wider SSE registers, but after many runs comparing the x64 version with or without the change, the result was the pointer variant was always slightly faster (there was variance in the numbers but on a generally good run the pointer always bested the pass-by-value). I suspect this wouldn't be the case with a real 256-bit type.

The same run on 3990X as x64, for comparison:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\x64_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 107.5 MB/s , 600.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 101.7 MB/s , 616.4 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  72.2 MB/s , 655.0 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  56.8 MB/s , 674.8 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  47.1 MB/s , 683.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  45.2 MB/s , 681.1 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  41.9 MB/s , 681.7 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  32.8 MB/s , 691.6 MB/s

Since I had one on my desk I also threw this at a Surface Pro X with ARM64. Here's the before running the fallback path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-fallback.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  52.5 MB/s , 593.4 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  50.7 MB/s , 602.5 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  35.0 MB/s , 646.3 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  27.2 MB/s , 664.9 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  18.6 MB/s , 660.1 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  14.4 MB/s , 621.5 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  12.7 MB/s , 627.2 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614), 10.18 MB/s , 621.9 MB/s

And here's after with the Neon path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-neon.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  60.6 MB/s , 595.2 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  58.2 MB/s , 602.8 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  41.9 MB/s , 650.7 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  33.9 MB/s , 669.2 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  22.4 MB/s , 656.5 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  15.8 MB/s , 632.7 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  13.4 MB/s , 619.5 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614),  11.2 MB/s , 626.5 MB/s

Around a 10% improvement.

I also ran the same benchmark on other x86 and x64 hardware with the same result. I haven't as of yet run this on Apple ARM hardware with Clang for comparison, but I will, and then update this PR.

The fake defines I'm not 100% happy with, but it's no different (IMO) to faking __has_builtin() and others. But suggestions welcome.

@cwoffenden
Copy link
Contributor Author

Testing this PR (zstd-sse-pr) against dev (zstd-dev) building with Clang 12.0.0 on an Intel MacBook Pro I see these results, alternating runs between the two builds:

carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-dev -b5e12 silesia.tar   
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.0 MB/s ,1051.6 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 136.0 MB/s ,1042.1 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  94.1 MB/s ,1110.6 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.7 MB/s ,1159.3 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  59.4 MB/s ,1181.6 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.8 MB/s ,1121.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.1 MB/s ,1152.9 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1195.2 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-sse-pr -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.6 MB/s ,1028.0 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 137.6 MB/s ,1055.6 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  94.2 MB/s ,1141.4 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  76.5 MB/s ,1151.0 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  61.3 MB/s ,1200.6 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.7 MB/s ,1105.2 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.1 MB/s ,1176.3 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1193.4 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-dev -b5e12 silesia.tar   
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.2 MB/s ,1019.4 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 137.7 MB/s ,1092.1 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  93.3 MB/s ,1129.9 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.7 MB/s ,1159.1 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  59.7 MB/s ,1167.2 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.6 MB/s ,1168.8 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.3 MB/s ,1101.2 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.7 MB/s ,1116.0 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-sse-pr -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 146.0 MB/s ,1028.8 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 138.0 MB/s ,1083.5 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  91.8 MB/s ,1131.5 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.9 MB/s ,1192.2 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  61.2 MB/s ,1166.0 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.4 MB/s ,1119.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  46.8 MB/s ,1131.4 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1163.5 MB/s 

TL;DR: the change to ZSTD_Vec256_cmpMask8() doesn't appear to have any adverse effects.

@cwoffenden
Copy link
Contributor Author

And here's the same test on an ARM Mac. The runs on this M1 Mini had little deviation:

carl@m1 Native % lipo -i zstd-dev 
Non-fat file: zstd-dev is architecture: arm64
carl@m1 Native % ./zstd-dev -b5e12 silesia.tar 
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 184.6 MB/s ,1287.7 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 177.1 MB/s ,1324.3 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447), 124.6 MB/s ,1409.6 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  98.3 MB/s ,1446.9 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  83.0 MB/s ,1481.3 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  81.3 MB/s ,1493.2 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  75.3 MB/s ,1497.0 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  58.3 MB/s ,1528.5 MB/s
carl@m1 Native % lipo -i zstd-sse-pr 
Non-fat file: zstd-sse-pr is architecture: arm64
carl@m1 Native % ./zstd-sse-pr -b5e12 silesia.tar 
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 184.6 MB/s ,1288.5 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 177.1 MB/s ,1324.6 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447), 124.6 MB/s ,1409.5 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  98.3 MB/s ,1447.5 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  83.0 MB/s ,1481.8 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  81.3 MB/s ,1494.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  75.3 MB/s ,1497.7 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  58.2 MB/s ,1529.0 MB/s 

The only really interesting takeaway is how much the M1 trounces the 3990X in this test (though the Threadripper here is under-clocked at the moment).

@aqrit
Copy link
Contributor

aqrit commented May 24, 2021

the fake defines I'm not 100% happy with, [...] suggestions welcome

Maybe define our own ?
ZSTD_ARCH_X86_SSE2
ZSTD_ARCH_ARM_NEON

or something like that in compiler.h?

#if !defined(ZSTD_NO_INTRINSICS)
#  if defined(_M_AMD64) || (defined (_M_IX86) && defined(_M_IX86_FP) && (_M_IX86_FP >= 2))
#    define ZSTD_ARCH_X86_SSE2
#  endif
#  if defined(__SSE2__)
#    define ZSTD_ARCH_X86_SSE2
#  endif
#  if defined(__ARM_NEON)
#    define ZSTD_ARCH_ARM_NEON
#  endif
#
#
#  if defined(ZSTD_ARCH_X86_SSE2)
#    include <emmintrin.h>
#  elif defined(ZSTD_ARCH_ARM_NEON)
#    include <arm_neon.h>
#  endif
#endif

example needs MSVC NEON support ... does MSVC need arm64_neon.h ?

@cwoffenden
Copy link
Contributor Author

does MSVC need arm64_neon.h ?

@aqrit On MSVC 64-bit ARM arm_neon.h includes arm64_neon.h.

@cwoffenden
Copy link
Contributor Author

Closing since #2681 makes this redundant.

@cwoffenden cwoffenden closed this Jun 4, 2021
@cwoffenden cwoffenden deleted the sse-x86 branch June 7, 2021 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants