Flatten ZSTD_row_getMatchMask #2681

aqrit · 2021-05-23T17:42:16Z

Remove the SIMD abstraction layer.
Add big endian support.
Align hashTags within tagRow to a 16-byte boundary.
Optimize scalar path using SWAR.
Optimize neon path for n == 32
Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)
Improved endian detection

Benchmark:

tl;dr: compression speed of scalar path increases by about 22%.
Scalar path is now only about 4% slower than the SSE2 path.

patch scalar:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 145.7 MB/s ,1152.4 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 139.3 MB/s ,1180.3 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448),  95.3 MB/s ,1258.3 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  76.3 MB/s ,1289.6 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  61.7 MB/s ,1303.8 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  55.1 MB/s ,1297.1 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  51.0 MB/s ,1296.1 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  38.6 MB/s ,1315.1 MB/s

dev scalar:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 118.6 MB/s ,1154.1 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 114.7 MB/s ,1179.7 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448),  76.8 MB/s ,1257.8 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  61.1 MB/s ,1289.0 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  51.6 MB/s ,1303.2 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  47.3 MB/s ,1296.3 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  43.6 MB/s ,1294.8 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  31.7 MB/s ,1315.6 MB/s

patch SSE2:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 151.3 MB/s ,1148.2 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 143.4 MB/s ,1169.3 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448), 100.8 MB/s ,1247.6 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  81.5 MB/s ,1279.4 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  65.5 MB/s ,1292.1 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  57.2 MB/s ,1284.2 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  52.3 MB/s ,1282.9 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  39.7 MB/s ,1301.6 MB/s

dev SSE2:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 151.1 MB/s ,1148.5 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 143.8 MB/s ,1171.8 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448),  98.2 MB/s ,1250.2 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  79.3 MB/s ,1281.9 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  63.1 MB/s ,1294.3 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  56.8 MB/s ,1286.5 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  51.9 MB/s ,1285.7 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  40.2 MB/s ,1304.4 MB/s

benched on my desktop i7-8700K using gcc version 9.3.0

* Remove the SIMD abstraction layer. * Add big endian support. * Align `hashTags` within `tagRow` to a 16-byte boundary. * Switch SSE2 to use aligned reads. * Optimize scalar path using SWAR. * Optimize neon path for `n == 32` * Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)

aqrit · 2021-05-24T21:26:59Z

AFAICT, the test failures are all:
error: cast from 'const BYTE *' (aka 'const unsigned char *') to 'const __m128i *' increases required alignment from 1 to 16 [-Werror,-Wcast-align] const __m128i chunk = _mm_load_si128((const __m128i*)&src[0]);

which leaves me mystified on how any SIMD ever passed these tests...
Local tests with gcc -Wcast-align=strict errors out with unaligned loads and typedefs.

Do we need (or want) __attribute__((may_alias)) or __attribute__((aligned(16))), reportedly there are issues on MSVC as well?

senhuang42

Thanks for these contributions!! These NEON and scalar optimizations look great - I've left some comments that mostly have to do with style/code structure. I'll get around to benchmarking the improvements scalar mode at some point as well.

lib/compress/zstd_lazy.c

there is a fun little catch-22 with gcc: result from pmovmskb has to be cast to uint32_t to avoid a zero-extension but must be uint16_t to get gcc to generate a rotate instruction..

aqrit · 2021-06-01T20:57:40Z

Amusingly, NEON gets cheaper when rowEntries == 64

uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
    uint8x16x4_t src = vld4q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint8x16_t cmp0 = vceqq_u8(src.val[0], dup);
    uint8x16_t cmp1 = vceqq_u8(src.val[1], dup);
    uint8x16_t cmp2 = vceqq_u8(src.val[2], dup);
    uint8x16_t cmp3 = vceqq_u8(src.val[3], dup);

    uint8x16_t t0 = vsriq_n_u8(cmp1, cmp0, 1);
    uint8x16_t t1 = vsriq_n_u8(cmp3, cmp2, 1);
    uint8x16_t t2 = vsriq_n_u8(t1, t0, 2);
    uint8x16_t t3 = vsriq_n_u8(t2, t2, 4);
    uint8x8_t t4 = vshrn_n_u16(vreinterpretq_u16_u8(t3), 4);
    return vget_lane_u64(vreinterpret_u64_u8(t4), 0);
}

senhuang42 · 2021-06-02T19:16:30Z

Amusingly, NEON gets cheaper when rowEntries == 64

uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
    uint8x16x4_t src = vld4q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint8x16_t cmp0 = vceqq_u8(src.val[0], dup);
    uint8x16_t cmp1 = vceqq_u8(src.val[1], dup);
    uint8x16_t cmp2 = vceqq_u8(src.val[2], dup);
    uint8x16_t cmp3 = vceqq_u8(src.val[3], dup);

    uint8x16_t t0 = vsriq_n_u8(cmp1, cmp0, 1);
    uint8x16_t t1 = vsriq_n_u8(cmp3, cmp2, 1);
    uint8x16_t t2 = vsriq_n_u8(t1, t0, 2);
    uint8x16_t t3 = vsriq_n_u8(t2, t2, 4);
    uint8x8_t t4 = vshrn_n_u16(vreinterpretq_u16_u8(t3), 4);
    return vget_lane_u64(vreinterpret_u64_u8(t4), 0);
}

That's actually good to know - we're going to also add support for 64 row entries as well.

better work-around for the (bogus) warning: unary minus on unsigned

senhuang42

This looks good to me! @terrelln do you have any thoughts?

senhuang42 · 2021-06-09T05:51:14Z

Thanks @aqrit for this great contribution 🚀

Cyan4973 · 2021-06-09T15:29:05Z

Excellent work @aqrit !

Cyan4973 · 2021-12-21T23:48:53Z

lib/compress/zstd_lazy.c

+            const U16 hi = (U16)vgetq_lane_u8(t3, 8);
+            const U16 lo = (U16)vgetq_lane_u8(t3, 0);
+            return ZSTD_rotateRight_U16((hi << 8) | lo, head);
+        } else { /* rowEntries == 32 */


@aqrit :
This NEON code must be designed specifically for each rowEntries.
Hence, we have an implementation for 16, for 32 and then for 64.
If we would like to introduce 128 or even 256, it's necessary to write another implementation.

Ignoring the issue of generating the final mask (which width directly depends on the nb of entries),
does that seem possible to rewrite the main test loop in a way which would be generic,
and scale naturally with rowEntries (provided it's a power of 2) ?

For an example of what I mean by "an implementation which can scale with rowEntries",
please have a look at : https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_lazy.c#L992 ,
which uses SSE2.

128/256 would just loop the 64 byte path.

I've done no benchmarking but...
it seems really painful to emulate pmovmskb using ARMv7-A NEON.
AArch64 added the SIMD instruction addv which might be useful for simplifying all this.

On AArch64, is the ARMv7-A path actually faster than the scalar fallback?

facebook-github-bot added the CLA Signed label May 23, 2021

replace memcpy with MEM_readST

4d962a4

aqrit added 2 commits May 25, 2021 16:38

silence alignment warnings

81427dd

fix neon casts

708bc8a

senhuang42 reviewed May 26, 2021

View reviewed changes

aqrit added 7 commits May 26, 2021 20:16

Update zstd_lazy.c

ef13ae7

unify simd preprocessor detection (#3)

cbf2bb8

remove duplicate asserts

e6f9571

tweak rotates

158ae25

improve endian detection

5988fc6

add cast

7f5f6bf

there is a fun little catch-22 with gcc: result from pmovmskb has to be cast to uint32_t to avoid a zero-extension but must be uint16_t to get gcc to generate a rotate instruction..

more casts

4c043cc

senhuang42 mentioned this pull request Jun 3, 2021

[RFC] Rebalance compression levels #2692

Merged

fix casts

cba68d5

better work-around for the (bogus) warning: unary minus on unsigned

cwoffenden mentioned this pull request Jun 4, 2021

SSE/Neon path for MSVC x86 and ARM #2680

Closed

senhuang42 approved these changes Jun 7, 2021

View reviewed changes

senhuang42 merged commit dd4f6aa into facebook:dev Jun 9, 2021

drivehappy mentioned this pull request Nov 13, 2021

Preprocessor checks for __SSE2__ don't work under MSVC #2854

Closed

Cyan4973 reviewed Dec 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten ZSTD_row_getMatchMask #2681

Flatten ZSTD_row_getMatchMask #2681

aqrit commented May 23, 2021 •

edited

Loading

aqrit commented May 24, 2021

senhuang42 left a comment

aqrit commented Jun 1, 2021

senhuang42 commented Jun 2, 2021

senhuang42 left a comment

senhuang42 commented Jun 9, 2021

Cyan4973 commented Jun 9, 2021

Cyan4973 Dec 21, 2021 •

edited

Loading

aqrit Dec 23, 2021

Flatten ZSTD_row_getMatchMask #2681

Flatten ZSTD_row_getMatchMask #2681

Conversation

aqrit commented May 23, 2021 • edited Loading

aqrit commented May 24, 2021

senhuang42 left a comment

Choose a reason for hiding this comment

aqrit commented Jun 1, 2021

senhuang42 commented Jun 2, 2021

senhuang42 left a comment

Choose a reason for hiding this comment

senhuang42 commented Jun 9, 2021

Cyan4973 commented Jun 9, 2021

Cyan4973 Dec 21, 2021 • edited Loading

Choose a reason for hiding this comment

aqrit Dec 23, 2021

Choose a reason for hiding this comment

aqrit commented May 23, 2021 •

edited

Loading

Cyan4973 Dec 21, 2021 •

edited

Loading