Implement a few missing things using SSE or NEON intrinsics. #1338

ibogosavljevic · 2023-04-25T14:10:48Z

I have the following code:

int32_t MulInt(int32_t out, int32_t a, int32_t b) {
    return static_cast<int32_t>((static_cast<int64_t>(a[i]) * static_cast<int64_t>(b[i])) >> 16);
}

I tried to implement it through Highway alone, but it doesn't work. Highway currently doesn't support multiplication of int32_t to produce int64_t efficiently. Many architectures will multiply two int32_t to produce int64_t without explicit casting, and I want to take advantage of that.

I already have implementations for SSE and NEON, but I want to make them available to Highway. I want to have Highway function
V MulInt(V v2, V v2) for Vec128<int32_t> for SSE and NEON. Is there any official guide on how to supplement Highway with custom build functions?

The text was updated successfully, but these errors were encountered:

johnplatts · 2023-04-25T19:10:54Z

Here is how the vector version of the MulInt operation above can be implemented in Highway for an int32_t vector:

template<class V, HWY_IF_UI32_D(DFromV<V>),
         HWY_IF_V_SIZE_LE_V(V, HWY_MAX_BYTES / 2)>
HWY_INLINE V MulInt(V a, V b) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  const Rebind<uint64_t, decltype(d)> du64;

#if HWY_TARGET == HWY_SCALAR
  const auto a2 = a;
  const auto b2 = b;
#else
  const Twice<decltype(d)> dt;
  const auto a_u64 = PromoteTo(du64, BitCast(du, a));
  const auto b_u64 = PromoteTo(du64, BitCast(du, b));
#if HWY_IS_BIG_ENDIAN
  const auto a2 = BitCast(dt, ShiftLeft<32>(a_u64));
  const auto b2 = BitCast(dt, ShiftLeft<32>(b_u64));
#else
  const auto a2 = BitCast(dt, a_u64);
  const auto b2 = BitCast(dt, b_u64);
#endif  // HWY_IS_BIG_ENDIAN
#endif  // HWY_TARGET != HWY_SCALAR

  return BitCast(d, TruncateTo(du,
    ShiftRight<16>(BitCast(du64, MulEven(a2, b2)))));
}

#if HWY_TARGET != HWY_SCALAR
template<class V, HWY_IF_UI32_D(DFromV<V>),
         HWY_IF_V_SIZE_V(V, HWY_MAX_BYTES)>
HWY_INLINE V MulInt(V a, V b) {
  const DFromV<decltype(a)> d;
  const Half<decltype(d)> dh;
  const RebindToUnsigned<decltype(d)> du;
  const RebindToUnsigned<decltype(dh)> dh_u;
  const Repartition<uint64_t, decltype(d)> du64;

  const auto a1_u64 = PromoteTo(du64, BitCast(dh_u, LowerHalf(dh, a)));
  const auto b1_u64 = PromoteTo(du64, BitCast(dh_u, LowerHalf(dh, b)));
  const auto a2_u64 = PromoteTo(du64, BitCast(dh_u, UpperHalf(dh, a)));
  const auto b2_u64 = PromoteTo(du64, BitCast(dh_u, UpperHalf(dh, b)));

#if HWY_IS_BIG_ENDIAN
  const auto a1 = BitCast(d, ShiftLeft<32>(a1_u64));
  const auto b1 = BitCast(d, ShiftLeft<32>(b1_u64));
  const auto a2 = BitCast(d, ShiftLeft<32>(a2_u64));
  const auto b2 = BitCast(d, ShiftLeft<32>(b2_u64));
#else
  const auto a1 = BitCast(d, a1_u64);
  const auto b1 = BitCast(d, b1_u64);
  const auto a2 = BitCast(d, a2_u64);
  const auto b2 = BitCast(d, b2_u64);
#endif  // HWY_IS_BIG_ENDIAN

  const auto p1 = ShiftRight<16>(BitCast(du64, MulEven(a1, b1)));
  const auto p2 = ShiftRight<16>(BitCast(du64, MulEven(a2, b2)));
  return BitCast(d, OrderedTruncate2To(du, p1, p2));
}
#endif  // HWY_TARGET != HWY_SCALAR

ibogosavljevic · 2023-04-25T21:21:41Z

Wow, thanks a lot! I am simply not that fluent in highway to be able to write such a thing. The documentation is good, but I was still missing a few key pieces.

ibogosavljevic · 2023-04-26T07:37:24Z

I see the following compilation issue:

In file included from main.cpp:9:
./main-highway.cpp:55:21: error: use of undeclared identifier 'OrderedTruncate2To'
  return BitCast(d, OrderedTruncate2To(du, p1, p2));
                    ^
./main-highway.cpp:71:24: note: in instantiation of function template specialization 'hwy::N_SSSE3::MulInt<hwy::N_SSSE3::Vec128<int, 4>, nullptr, nullptr>' requested here
        auto res = hn::MulInt(a_val, b_val);
                       ^

Do you know what might be the problem?

jan-wassenberg · 2023-04-26T07:57:33Z

Hi @ibogosavljevic , it's always interesting to hear how the documentation works for new users and what can be improved. I'm curious what the missing pieces were?

@johnplatts nice implementation!

For the compile issue, it's likely that you are using an older release of Highway, we added that intrinsic 2 weeks after the last release. You could use Compiler Explorer to test this, copying your code into this environment. We'll do another release soon, perhaps a week or two. Or until then, you could consider using the latest Git version?

johnplatts · 2023-04-26T13:29:12Z

Hi @ibogosavljevic , it's always interesting to hear how the documentation works for new users and what can be improved. I'm curious what the missing pieces were?

@johnplatts nice implementation!

For the compile issue, it's likely that you are using an older release of Highway, we added that intrinsic 2 weeks after the last release. You could use Compiler Explorer to test this, copying your code into this environment. We'll do another release soon, perhaps a week or two. Or until then, you could consider using the latest Git version?

Here is a link to a Compiler Explorer snippet that includes the above MulInt implementation, and the snippet does compile successfully for the HWY_SSE4/HWY_NEON_WITHOUT_AES/HWY_RVV/HWY_SVE/HWY_PPC8/HWY_SCALAR targets:
https://godbolt.org/z/sK1zEv4xb

ibogosavljevic · 2023-04-26T13:59:40Z

Hi John! Yes, it works for me as well with the newest version of Highway. Thank you a lot. I have a question: I am developing a 1-day vectorization workshop, which covers Neon, AVX and possibly Highway. In the upcoming weeks, I am going to experiment a lot with all three frameworks, and also build repositories of code. I can share the insight about Highway with you, in order to improve it. This means filing issues when documentation is not clear, or reporting bugs if native implementations are significantly faster than Highway. Let me know if this is what you want. Ivica

…

On Wed, Apr 26, 2023 at 3:29 PM John Platts ***@***.***> wrote: Hi @ibogosavljevic <https://github.com/ibogosavljevic> , it's always interesting to hear how the documentation works for new users and what can be improved. I'm curious what the missing pieces were? @johnplatts <https://github.com/johnplatts> nice implementation! For the compile issue, it's likely that you are using an older release of Highway, we added that intrinsic 2 weeks after the last release. You could use Compiler Explorer to test this, copying your code into this environment <https://gcc.godbolt.org/z/zP7MYe9Yf>. We'll do another release soon, perhaps a week or two. Or until then, you could consider using the latest Git version? Here is a link to a Compiler Explorer snippet that includes the above MulInt implementation, and the snippet does compile successfully for the HWY_SSE4/HWY_NEON_WITHOUT_AES/HWY_RVV/HWY_SVE/HWY_PPC8/HWY_SCALAR targets: https://godbolt.org/z/sK1zEv4xb — Reply to this email directly, view it on GitHub <#1338 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAJBLYLTI5FTV74YL3PIKTXDEPLFANCNFSM6AAAAAAXLCESC4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jan-wassenberg · 2023-04-26T16:23:29Z

Glad to hear it works.

I can share the insight about Highway with you, in order to improve it.
This means filing issues when documentation is not clear, or reporting bugs
if native implementations are significantly faster than Highway. Let me
know if this is what you want.

This sounds wonderful. Please don't hesitate to raise issues.
If it would be helpful, I am also happy to advise on a draft of your workshop.
(We have internal slides that I'd like to open-source but it will take some time, possibly too long for your timeframe.)

It would be surprising if a native implementation were much faster, but if that happens we are open to adding new ops to bridge any gaps.

ibogosavljevic · 2023-05-05T20:02:13Z

I want to add more information about the performance of MulInt, as suggested by @johnplatts .

I compared the performance of highway vs the performance of automatically compiler vectorized code:

void convert_for_vector(int32_t* out, int32_t* a, int32_t* b, int n) {
    const hn::ScalableTag<int32_t> d;
    #pragma clang loop unroll(disable)
    for (int i = 0; i < n; i += hn::Lanes(d)) {
        auto a_val = hn::Load(d, a + i); 
        auto b_val = hn::Load(d, b + i);

        auto res = hn::MulInt<16>(a_val, b_val);
        Store(res, d, out + i);
    }
}

VS

void convert_for(int32_t* out, int32_t* a, int32_t* b, int n) {
    #pragma clang loop unroll(disable)
    for (int i = 0; i < n; ++i) {
        out[i] = static_cast<int32_t>((static_cast<int64_t>(a[i]) * static_cast<int64_t>(b[i])) >> 16);
    }
}

Compiler's version is faster than highway, 0.56s vs 0.85s.

Everything compiled with -mavx2 -O3 on newest clang.

Compiler's version assembly:

jan-wassenberg · 2023-05-08T09:11:39Z

Thanks for sharing! Sometimes the compiler has interesting tricks, so let's have a look.

First, I notice that the codegen is very similar: https://gcc.godbolt.org/z/8cvrcqbbW
But the difference may be because we're missing -maes, without which Highway doesn't use the AVX2 target.
(There is a documented set of flags for x86 targets here: https://gcc.godbolt.org/z/rGnjMevKG)

Was -maes indeed the difference? Seems that should bring us up to parity. And we can probably do better yet:
The vpmuldq = MulEven already does an int32->int64 cast, so both the Highway code and compiler's autovectorization could skip vpmovsxdq. But it's unclear how to do that with the compiler because we actually do require the cast at the language level, otherwise we get a 32-bit mul.

Here's a quick sketch of Highway code without the cast:

Repartition<uint64_t, decltype(d)> du64;
auto evens = ShiftRight<16>(MulEven(a, b));
auto a_64 = BitCast(du64, a);
auto b_64 = BitCast(du64, b);
auto odds = ShiftRight<16>(MulEven(ShiftRight<32>(a_64), ShiftRight<32>(b_64)));

If you didn't care about order that's likely going to be quicker (just store evens then odds), but if you do: we could InterleaveLower(evens, odds) but that will require further fixup because interleaving is per 128-bit block. Maybe we should add an op for that: either an InterleaveLower variant that does not stop at 128-bit boundaries, or a fixup op for afterwards to transform the results of InterleaveLower. @johnplatts , what do you think?

johnplatts · 2023-05-14T11:32:00Z

A more optimal implementation of the set-before-first operation for masks is possible on SSE4/AVX2/AVX3/RVV/PPC10.

Here is how the SetBeforeFirst operation could be implemented for masks for 128-bit or smaller vectors on SSE4/AVX2:

template<class T, size_t N>
HWY_API Mask128<T, N> SetBeforeFirst(Mask128<T, N> m) {
  const Simd<T, N, 0> d;
  const RebindToUnsigned<decltype(d)> du;
  using TU = TFromD<decltype(du)>;
  return RebindMask(d, Mask128<TU, N>{
    _mm_cmpistrm(RebindMask(du, Not(m)).raw, _mm_setzero_si128(), 0x58)});
}

Here is how the SetBeforeFirst operation could be implemented for masks on AVX3:

template<class T, size_t N>
HWY_API Mask128<T, N> SetBeforeFirst(Mask128<T, N> m) {
  using RawMask = decltype(MaskFromVec(VFromD<decltype(d)>()).raw);
  return Mask128<T, N>{static_cast<RawMask>(~(m.raw | (-m.raw)))};
}
template<class T>
HWY_API Mask256<T> SetBeforeFirst(Mask256<T> m) {
  using RawMask = decltype(MaskFromVec(VFromD<decltype(d)>()).raw);
  return Mask256<T>{static_cast<RawMask>(~(m.raw | (-m.raw)))};
}
template<class T>
HWY_API Mask512<T> SetBeforeFirst(Mask512<T> m) {
  using RawMask = decltype(MaskFromVec(VFromD<decltype(d)>()).raw);
  return Mask512<T>{static_cast<RawMask>(~(m.raw | (-m.raw)))};
}

Here is how the SetBeforeFirst operation could be implemented for masks on PPC10:

namespace detail {

template<class D, HWY_IF_T_SIZE_D(D, 1)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_genbm(bits)});
}

template<class D, HWY_IF_T_SIZE_D(D, 2)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_genhm(bits)});
}

template<class D, HWY_IF_T_SIZE_D(D, 4)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_genwm(bits)});
}

template<class D, HWY_IF_T_SIZE_D(D, 8)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_gendm(bits)});
}

}  // namespace detail

template<class T, size_t N>
HWY_API Mask128<T, N> SetBeforeFirst(Mask128<T, N> m) {
  const Simd<T, N, 0> d;
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;

  const auto v_mask = BitCast(du, VecFromMask(d, m));
  const auto mask_bits = static_cast<uint64_t>(vec_extractm(v_mask.raw));
#if HWY_IS_LITTLE_ENDIAN
  const auto first_set_mask_bit = mask_bits & (-mask_bits);
  const auto result_mask_bits = first_set_mask_bit - 1;
#else
  const auto bit_after_last_set_mask_bit =
    (uint64_t{1} << (63 - Num0BitsAboveMS1Bit_Nonzero64((mask_bits << 1) | 1)));
  const auto result_mask_bits = static_cast<uint64_t>(-bit_after_last_set_mask_bit);
#endif

  return detail::VsxGenMaskFromMaskBits(result_mask_bits);
}

Here is how the SetBeforeFirst operation could be implemented for vectors on RVV (which is simply a wrapper for the __riscv_vmsbf_m_b1, __riscv_vmsbf_m_b2, __riscv_vmsbf_m_b4, __riscv_vmsbf_m_b8, __riscv_vmsbf_m_b16, __riscv_vmsbf_m_b32, and __riscv_vmsbf_m_b64 intrinsics):

HWY_RVV_FOREACH_B(HWY_RVV_RETM_ARGM, SetBeforeFirst, sbf)

jan-wassenberg · 2023-05-15T09:17:29Z

@johnplatts sounds like you are proposing a new SetBeforeFirst op, unrelated to this particular MulInt code?
That sounds potentially useful, would welcome a pull request with your code.
But perhaps we should replace cmpistrim because it is latency 10 on SKX? I suppose a lookup table indexed by BitsFromMask would also be an option.

johnplatts · 2023-05-16T11:23:07Z

@johnplatts sounds like you are proposing a new SetBeforeFirst op, unrelated to this particular MulInt code? That sounds potentially useful, would welcome a pull request with your code. But perhaps we should replace cmpistrim because it is latency 10 on SKX? I suppose a lookup table indexed by BitsFromMask would also be an option.

There are some string operations that can be implemented in a more efficient manner on SSE4/AVX2/AVX3 (using the _mm_cmpistri, _mm_cmpistrm, _mm_cmpestri, and _mm_cmpestrm), RVV (using the __riscv_vmsbf_m_b intrinsics), SVE (using svbrkb_b_z), and PPC10 (using the vec_stril intrinsic).

Here is how the ZeroPastNullTerminator operation (which zeroes out all lanes past the null terminator) can be implemented for U8 vectors using native SIMD intrinsics on SSE4.2/AVX3/RVV/SVE/PPC10:

// SSE4.2
__m128i ZeroPastNullTerminator(__m128i v) {
  __m128i mask = _mm_cmpistrm(v, _mm_setzero_si128(), 0x58);
  return _mm_and_si128(v, mask);
}

// AVX3
__m128i ZeroPastNullTerminator(__m128i v) {
  const __mmask16 mask = _mm_cmpeq_epi8_mask(v, _mm_setzero_si128());
  return _mm_maskz_mov_epi8(static_cast<__mmask16>(mask | (-mask)), v);
}

// RVV
vuint8m1_t ZeroPastNullTerminator(vuint8m1_t v) {
  const size_t vl = __riscv_vsetvlmax_e8m1();
  const vbool8_t eq_to_zero_mask =
    __riscv_vmseq_vx_u8m1_b8(v, uint8_t{0}, vl);
  const vbool8_t valid_chars_mask =
    __riscv_vmsbf_m_b8(eq_to_zero_mask, vl);
  return __riscv_vmerge_vvm_u8m1(__riscv_vmv_v_x_u8m1(0, vl), v,
                                 valid_chars_mask, vl);
}

// SVE
svuint8_t ZeroPastNullTerminator(svuint8_t v) {
  const svbool_t pg = svptrue_b8();
  const svuint8_t v_zero = svdup_n_u8(uint8_t{0});
  const svbool_t eq_to_zero_mask =
    svcmpeq_u8(pg, v, v_zero);
  const svbool_t valid_chars_mask =
    svbrkb_b_z(pg, eq_to_zero_mask);
  return svsel_u8(valid_chars_mask, v, v_zero);
}

// PPC10
__vector unsigned char ZeroPastNullTerminator(__vector unsigned char v) {
  return vec_stril(v);
}

johnplatts · 2023-05-16T12:26:21Z

The SSE4.2 PCMPISTRI, PCMPISTRM, PCMPESTRI, and PCMPESTRM can do the following operations using a single instruction:

Equal any (equivalent to the following):

template<class V>
MFromD<DFromV<V>> EqualAny(V a, V b, size_t b_len) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  using TU = TFromD<decltype(du)>;
  auto m = MaskFromVec(Zero(d));
  for(size_t i = 0; i < b_len; i++) {
    const auto idx = IndicesFromVec(d, Set(du, static_cast<TU>(i)));
    m = Or(m, Eq(a, TableLookupLanes(b, idx)));
  }
  return m;
}

WithinRanges (equivalent to the following):

template<class V>
MFromD<DFromV<V>> WithinRanges(V a, V b, size_t b_len) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  using TU = TFromD<decltype(du)>;
  auto m = MaskFromVec(Zero(d));
  const auto b_pair_len = b_len & static_cast<size_t>(-2);
  for(size_t i = 0; i < b_pair_len; i += 2) {
    const auto idx1 = IndicesFromVec(d, Set(du, static_cast<TU>(i)));
    const auto idx2 = IndicesFromVec(d, Set(du, static_cast<TU>(i + 1)));
    const auto within_range =
      And(Ge(a, TableLookupLanes(b, idx1)), Le(a, TableLookupLanes(b, idx2)));
    m = Or(m, within_range);
  }
  return m;
}

EqualOrdered (equivalent to the following):

template<class V>
MFromD<DFromV<V>> EqualOrdered(V a, V b, size_t b_len) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  const RebindToSigned<decltype(d)> di;
  using TU = TFromD<decltype(du)>;

  const auto all_ones = BitCast(d, Set(di, -1));
  if(b_len <= 0)
    return MaskFromVec(all_ones);

  const auto iota0 = Iota(du, TU{0});
  auto m = Eq(a, Broadcast<0>(b));
  for(size_t i = 1; i < b_len; i++) {
    const auto b_shuf_idx = Set(du, static_cast<TU>(i));
    const auto mask_shuf_idx = Add(iota0, static_cast<TU>(i));
    const auto m_i = VecFromMask(b, Eq(a, TableLookupLanes(b, b_shuf_idx)));
    m = Or(m, MaskFromVec(TwoTablesLookupLanes(m_i, all_ones, mask_shuf_idx)));
  }
  return m;
}

jan-wassenberg · 2023-05-17T11:50:05Z

I like your ZeroPastNullTerminator idea. It's not clear how this could look in AVX2, though? 32 lanes is too much for one lookup table. And NEON would also struggle, its BitsFromMask is quite expensive.

That's a wider concern about the SSE4 string instructions - they are not very performance-portable. Even Intel hasn't carried them forward to the >128 bit instruction sets. But your emulations look reasonable. Seems it would be better to use those than not vectorize. Would you like to create a pull request with those operations? Perhaps we can put them in hwy/contrib/algo/string-inl.h or similar?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a few missing things using SSE or NEON intrinsics. #1338

Implement a few missing things using SSE or NEON intrinsics. #1338

ibogosavljevic commented Apr 25, 2023 •

edited

Loading

johnplatts commented Apr 25, 2023

ibogosavljevic commented Apr 25, 2023

ibogosavljevic commented Apr 26, 2023

jan-wassenberg commented Apr 26, 2023

johnplatts commented Apr 26, 2023

ibogosavljevic commented Apr 26, 2023 via email

jan-wassenberg commented Apr 26, 2023

ibogosavljevic commented May 5, 2023

jan-wassenberg commented May 8, 2023

johnplatts commented May 14, 2023 •

edited

Loading

jan-wassenberg commented May 15, 2023

johnplatts commented May 16, 2023

johnplatts commented May 16, 2023

jan-wassenberg commented May 17, 2023

Implement a few missing things using SSE or NEON intrinsics. #1338

Implement a few missing things using SSE or NEON intrinsics. #1338

Comments

ibogosavljevic commented Apr 25, 2023 • edited Loading

johnplatts commented Apr 25, 2023

ibogosavljevic commented Apr 25, 2023

ibogosavljevic commented Apr 26, 2023

jan-wassenberg commented Apr 26, 2023

johnplatts commented Apr 26, 2023

ibogosavljevic commented Apr 26, 2023 via email

jan-wassenberg commented Apr 26, 2023

ibogosavljevic commented May 5, 2023

jan-wassenberg commented May 8, 2023

johnplatts commented May 14, 2023 • edited Loading

jan-wassenberg commented May 15, 2023

johnplatts commented May 16, 2023

johnplatts commented May 16, 2023

jan-wassenberg commented May 17, 2023

ibogosavljevic commented Apr 25, 2023 •

edited

Loading

johnplatts commented May 14, 2023 •

edited

Loading