SSE/Neon path for MSVC x86 and ARM #2680

cwoffenden · 2021-05-21T16:35:57Z

This is taking what #2653 started and extending it to x86 and MS ARM64 targets. To do this I fake the __SSE2__ or __ARM_NEON defines for MSVC (this was preferable to having the longer tests everywhere else) and change the signature for ZSTD_Vec256_cmpMask8 (more of later).

First some benchmarks! This is x86 without the SSE2 path, on a 3990X (with 127 idle cores!):

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  43.6 MB/s , 413.7 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  42.3 MB/s , 425.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  31.1 MB/s , 454.4 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  25.8 MB/s , 469.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  22.7 MB/s , 475.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  19.7 MB/s , 475.6 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  15.0 MB/s , 475.9 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  11.0 MB/s , 485.2 MB/s

And this is with the SSE2 path enabled:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  52.6 MB/s , 424.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  50.7 MB/s , 436.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  40.3 MB/s , 466.6 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  33.5 MB/s , 482.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  28.9 MB/s , 487.9 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  27.0 MB/s , 486.4 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  19.7 MB/s , 487.0 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  17.0 MB/s , 495.9 MB/s

I took the best of five runs, and we see a 20-50% improvement. For this to work I needed to change ZSTD_Vec256_cmpMask8 to a pointer of the 256-bit type (since on 32-bit systems, depending on the version of MSVC, tested with 2010-2019, it errors with formal parameter with requested alignment of 16 won't be aligned). I worried this would affect performance by not making best use of the wider SSE registers, but after many runs comparing the x64 version with or without the change, the result was the pointer variant was always slightly faster (there was variance in the numbers but on a generally good run the pointer always bested the pass-by-value). I suspect this wouldn't be the case with a real 256-bit type.

The same run on 3990X as x64, for comparison:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\x64_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 107.5 MB/s , 600.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 101.7 MB/s , 616.4 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  72.2 MB/s , 655.0 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  56.8 MB/s , 674.8 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  47.1 MB/s , 683.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  45.2 MB/s , 681.1 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  41.9 MB/s , 681.7 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  32.8 MB/s , 691.6 MB/s

Since I had one on my desk I also threw this at a Surface Pro X with ARM64. Here's the before running the fallback path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-fallback.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  52.5 MB/s , 593.4 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  50.7 MB/s , 602.5 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  35.0 MB/s , 646.3 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  27.2 MB/s , 664.9 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  18.6 MB/s , 660.1 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  14.4 MB/s , 621.5 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  12.7 MB/s , 627.2 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614), 10.18 MB/s , 621.9 MB/s

And here's after with the Neon path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-neon.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  60.6 MB/s , 595.2 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  58.2 MB/s , 602.8 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  41.9 MB/s , 650.7 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  33.9 MB/s , 669.2 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  22.4 MB/s , 656.5 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  15.8 MB/s , 632.7 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  13.4 MB/s , 619.5 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614),  11.2 MB/s , 626.5 MB/s

Around a 10% improvement.

I also ran the same benchmark on other x86 and x64 hardware with the same result. I haven't as of yet run this on Apple ARM hardware with Clang for comparison, but I will, and then update this PR.

The fake defines I'm not 100% happy with, but it's no different (IMO) to faking __has_builtin() and others. But suggestions welcome.

cwoffenden · 2021-05-23T16:21:52Z

Testing this PR (zstd-sse-pr) against dev (zstd-dev) building with Clang 12.0.0 on an Intel MacBook Pro I see these results, alternating runs between the two builds:

carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-dev -b5e12 silesia.tar   
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.0 MB/s ,1051.6 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 136.0 MB/s ,1042.1 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  94.1 MB/s ,1110.6 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.7 MB/s ,1159.3 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  59.4 MB/s ,1181.6 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.8 MB/s ,1121.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.1 MB/s ,1152.9 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1195.2 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-sse-pr -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.6 MB/s ,1028.0 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 137.6 MB/s ,1055.6 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  94.2 MB/s ,1141.4 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  76.5 MB/s ,1151.0 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  61.3 MB/s ,1200.6 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.7 MB/s ,1105.2 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.1 MB/s ,1176.3 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1193.4 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-dev -b5e12 silesia.tar   
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.2 MB/s ,1019.4 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 137.7 MB/s ,1092.1 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  93.3 MB/s ,1129.9 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.7 MB/s ,1159.1 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  59.7 MB/s ,1167.2 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.6 MB/s ,1168.8 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.3 MB/s ,1101.2 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.7 MB/s ,1116.0 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-sse-pr -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 146.0 MB/s ,1028.8 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 138.0 MB/s ,1083.5 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  91.8 MB/s ,1131.5 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.9 MB/s ,1192.2 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  61.2 MB/s ,1166.0 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.4 MB/s ,1119.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  46.8 MB/s ,1131.4 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1163.5 MB/s

TL;DR: the change to ZSTD_Vec256_cmpMask8() doesn't appear to have any adverse effects.

cwoffenden · 2021-05-24T06:47:03Z

And here's the same test on an ARM Mac. The runs on this M1 Mini had little deviation:

carl@m1 Native % lipo -i zstd-dev 
Non-fat file: zstd-dev is architecture: arm64
carl@m1 Native % ./zstd-dev -b5e12 silesia.tar 
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 184.6 MB/s ,1287.7 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 177.1 MB/s ,1324.3 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447), 124.6 MB/s ,1409.6 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  98.3 MB/s ,1446.9 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  83.0 MB/s ,1481.3 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  81.3 MB/s ,1493.2 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  75.3 MB/s ,1497.0 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  58.3 MB/s ,1528.5 MB/s
carl@m1 Native % lipo -i zstd-sse-pr 
Non-fat file: zstd-sse-pr is architecture: arm64
carl@m1 Native % ./zstd-sse-pr -b5e12 silesia.tar 
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 184.6 MB/s ,1288.5 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 177.1 MB/s ,1324.6 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447), 124.6 MB/s ,1409.5 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  98.3 MB/s ,1447.5 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  83.0 MB/s ,1481.8 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  81.3 MB/s ,1494.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  75.3 MB/s ,1497.7 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  58.2 MB/s ,1529.0 MB/s

The only really interesting takeaway is how much the M1 trounces the 3990X in this test (though the Threadripper here is under-clocked at the moment).

aqrit · 2021-05-24T21:13:43Z

the fake defines I'm not 100% happy with, [...] suggestions welcome

Maybe define our own ?
ZSTD_ARCH_X86_SSE2
ZSTD_ARCH_ARM_NEON

or something like that in compiler.h?

#if !defined(ZSTD_NO_INTRINSICS)
#  if defined(_M_AMD64) || (defined (_M_IX86) && defined(_M_IX86_FP) && (_M_IX86_FP >= 2))
#    define ZSTD_ARCH_X86_SSE2
#  endif
#  if defined(__SSE2__)
#    define ZSTD_ARCH_X86_SSE2
#  endif
#  if defined(__ARM_NEON)
#    define ZSTD_ARCH_ARM_NEON
#  endif
#
#
#  if defined(ZSTD_ARCH_X86_SSE2)
#    include <emmintrin.h>
#  elif defined(ZSTD_ARCH_ARM_NEON)
#    include <arm_neon.h>
#  endif
#endif

example needs MSVC NEON support ... does MSVC need arm64_neon.h ?

cwoffenden · 2021-05-25T12:52:38Z

does MSVC need arm64_neon.h ?

@aqrit On MSVC 64-bit ARM arm_neon.h includes arm64_neon.h.

cwoffenden · 2021-06-04T10:58:35Z

Closing since #2681 makes this redundant.

cwoffenden added 3 commits May 21, 2021 14:04

Enable SSE for x86 (as well as x64)

f1f0790

Expanded to cover MSVC ARM64

c5a4998

Minor docs

42003ff

facebook-github-bot added the CLA Signed label May 21, 2021

cwoffenden added 2 commits May 28, 2021 10:20

Merge remote-tracking branch 'upstream/dev' into sse-x86

f78e32f

ZSTD prefixed SSE and Neon macros

cfc8a61

cwoffenden closed this Jun 4, 2021

cwoffenden deleted the sse-x86 branch June 7, 2021 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE/Neon path for MSVC x86 and ARM #2680

SSE/Neon path for MSVC x86 and ARM #2680

cwoffenden commented May 21, 2021 •

edited

Loading

cwoffenden commented May 23, 2021

cwoffenden commented May 24, 2021

aqrit commented May 24, 2021

cwoffenden commented May 25, 2021

cwoffenden commented Jun 4, 2021

SSE/Neon path for MSVC x86 and ARM #2680

SSE/Neon path for MSVC x86 and ARM #2680

Conversation

cwoffenden commented May 21, 2021 • edited Loading

cwoffenden commented May 23, 2021

cwoffenden commented May 24, 2021

aqrit commented May 24, 2021

cwoffenden commented May 25, 2021

cwoffenden commented Jun 4, 2021

cwoffenden commented May 21, 2021 •

edited

Loading