Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten ZSTD_row_getMatchMask #2681

Merged
merged 12 commits into from
Jun 9, 2021
Merged

Flatten ZSTD_row_getMatchMask #2681

merged 12 commits into from
Jun 9, 2021

Conversation

aqrit
Copy link
Contributor

@aqrit aqrit commented May 23, 2021

  • Remove the SIMD abstraction layer.
  • Add big endian support.
  • Align hashTags within tagRow to a 16-byte boundary.
  • Optimize scalar path using SWAR.
  • Optimize neon path for n == 32
  • Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)
  • Improved endian detection

Benchmark:

tl;dr: compression speed of scalar path increases by about 22%.
Scalar path is now only about 4% slower than the SSE2 path.

patch scalar:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 145.7 MB/s ,1152.4 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 139.3 MB/s ,1180.3 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448),  95.3 MB/s ,1258.3 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  76.3 MB/s ,1289.6 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  61.7 MB/s ,1303.8 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  55.1 MB/s ,1297.1 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  51.0 MB/s ,1296.1 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  38.6 MB/s ,1315.1 MB/s 
dev scalar:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 118.6 MB/s ,1154.1 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 114.7 MB/s ,1179.7 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448),  76.8 MB/s ,1257.8 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  61.1 MB/s ,1289.0 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  51.6 MB/s ,1303.2 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  47.3 MB/s ,1296.3 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  43.6 MB/s ,1294.8 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  31.7 MB/s ,1315.6 MB/s 

patch SSE2:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 151.3 MB/s ,1148.2 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 143.4 MB/s ,1169.3 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448), 100.8 MB/s ,1247.6 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  81.5 MB/s ,1279.4 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  65.5 MB/s ,1292.1 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  57.2 MB/s ,1284.2 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  52.3 MB/s ,1282.9 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  39.7 MB/s ,1301.6 MB/s
dev SSE2:
 5#silesia.tar       : 211957760 ->  63789770 (3.323), 151.1 MB/s ,1148.5 MB/s 
 6#silesia.tar       : 211957760 ->  62963574 (3.366), 143.8 MB/s ,1171.8 MB/s 
 7#silesia.tar       : 211957760 ->  61467735 (3.448),  98.2 MB/s ,1250.2 MB/s 
 8#silesia.tar       : 211957760 ->  60900152 (3.480),  79.3 MB/s ,1281.9 MB/s 
 9#silesia.tar       : 211957760 ->  59914334 (3.538),  63.1 MB/s ,1294.3 MB/s 
10#silesia.tar       : 211957760 ->  59282317 (3.575),  56.8 MB/s ,1286.5 MB/s 
11#silesia.tar       : 211957760 ->  59140003 (3.584),  51.9 MB/s ,1285.7 MB/s 
12#silesia.tar       : 211957760 ->  58629417 (3.615),  40.2 MB/s ,1304.4 MB/s 

benched on my desktop i7-8700K using gcc version 9.3.0

* Remove the SIMD abstraction layer.
* Add big endian support.
* Align `hashTags` within `tagRow` to a 16-byte boundary. 
* Switch SSE2 to use aligned reads.
* Optimize scalar path using SWAR.
* Optimize neon path for `n == 32`
* Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)
@aqrit
Copy link
Contributor Author

aqrit commented May 24, 2021

AFAICT, the test failures are all:
error: cast from 'const BYTE *' (aka 'const unsigned char *') to 'const __m128i *' increases required alignment from 1 to 16 [-Werror,-Wcast-align] const __m128i chunk = _mm_load_si128((const __m128i*)&src[0]);

which leaves me mystified on how any SIMD ever passed these tests...
Local tests with gcc -Wcast-align=strict errors out with unaligned loads and typedefs.

Do we need (or want) __attribute__((may_alias)) or __attribute__((aligned(16))), reportedly there are issues on MSVC as well?

Copy link
Contributor

@senhuang42 senhuang42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these contributions!! These NEON and scalar optimizations look great - I've left some comments that mostly have to do with style/code structure. I'll get around to benchmarking the improvements scalar mode at some point as well.

lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved
lib/compress/zstd_lazy.c Show resolved Hide resolved
lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved
lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved
lib/compress/zstd_lazy.c Show resolved Hide resolved
lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved
lib/compress/zstd_lazy.c Show resolved Hide resolved
aqrit added 7 commits May 26, 2021 20:16
there is a fun little catch-22 with gcc: result from pmovmskb has to be cast to uint32_t to avoid a zero-extension
but must be uint16_t to get gcc to generate a rotate instruction..
@aqrit
Copy link
Contributor Author

aqrit commented Jun 1, 2021

Amusingly, NEON gets cheaper when rowEntries == 64

uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
    uint8x16x4_t src = vld4q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint8x16_t cmp0 = vceqq_u8(src.val[0], dup);
    uint8x16_t cmp1 = vceqq_u8(src.val[1], dup);
    uint8x16_t cmp2 = vceqq_u8(src.val[2], dup);
    uint8x16_t cmp3 = vceqq_u8(src.val[3], dup);

    uint8x16_t t0 = vsriq_n_u8(cmp1, cmp0, 1);
    uint8x16_t t1 = vsriq_n_u8(cmp3, cmp2, 1);
    uint8x16_t t2 = vsriq_n_u8(t1, t0, 2);
    uint8x16_t t3 = vsriq_n_u8(t2, t2, 4);
    uint8x8_t t4 = vshrn_n_u16(vreinterpretq_u16_u8(t3), 4);
    return vget_lane_u64(vreinterpret_u64_u8(t4), 0);
}

@senhuang42
Copy link
Contributor

Amusingly, NEON gets cheaper when rowEntries == 64

uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
    uint8x16x4_t src = vld4q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint8x16_t cmp0 = vceqq_u8(src.val[0], dup);
    uint8x16_t cmp1 = vceqq_u8(src.val[1], dup);
    uint8x16_t cmp2 = vceqq_u8(src.val[2], dup);
    uint8x16_t cmp3 = vceqq_u8(src.val[3], dup);

    uint8x16_t t0 = vsriq_n_u8(cmp1, cmp0, 1);
    uint8x16_t t1 = vsriq_n_u8(cmp3, cmp2, 1);
    uint8x16_t t2 = vsriq_n_u8(t1, t0, 2);
    uint8x16_t t3 = vsriq_n_u8(t2, t2, 4);
    uint8x8_t t4 = vshrn_n_u16(vreinterpretq_u16_u8(t3), 4);
    return vget_lane_u64(vreinterpret_u64_u8(t4), 0);
}

That's actually good to know - we're going to also add support for 64 row entries as well.

better work-around for the (bogus) warning: unary minus on unsigned
Copy link
Contributor

@senhuang42 senhuang42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! @terrelln do you have any thoughts?

@senhuang42 senhuang42 merged commit dd4f6aa into facebook:dev Jun 9, 2021
@senhuang42
Copy link
Contributor

Thanks @aqrit for this great contribution 🚀

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 9, 2021

Excellent work @aqrit !

const U16 hi = (U16)vgetq_lane_u8(t3, 8);
const U16 lo = (U16)vgetq_lane_u8(t3, 0);
return ZSTD_rotateRight_U16((hi << 8) | lo, head);
} else { /* rowEntries == 32 */
Copy link
Contributor

@Cyan4973 Cyan4973 Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aqrit :
This NEON code must be designed specifically for each rowEntries.
Hence, we have an implementation for 16, for 32 and then for 64.
If we would like to introduce 128 or even 256, it's necessary to write another implementation.

Ignoring the issue of generating the final mask (which width directly depends on the nb of entries),
does that seem possible to rewrite the main test loop in a way which would be generic,
and scale naturally with rowEntries (provided it's a power of 2) ?

For an example of what I mean by "an implementation which can scale with rowEntries",
please have a look at : https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_lazy.c#L992 ,
which uses SSE2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128/256 would just loop the 64 byte path.

I've done no benchmarking but...
it seems really painful to emulate pmovmskb using ARMv7-A NEON.
AArch64 added the SIMD instruction addv which might be useful for simplifying all this.

On AArch64, is the ARMv7-A path actually faster than the scalar fallback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants