Support for complex arithmetics #2047

Ryo-not-rio · 2024-04-03T11:10:43Z

Hi,

I would like to propose the addition of complex arithmetic instructions to highway. This would allow us to take advantage of the SVE complex arithmetic instructions (svcadd, svcmla and svcdot), improving the performance of complex arithmetics on arm. I imagine the difficulty would be the need to implement and maintain equivalent functions for x86 and NEON where these instructions do not exist natively.

jan-wassenberg · 2024-04-03T14:16:19Z

We are happy to maintain contributed functions. Assuming only SVE supports these instructions natively, it is actually pretty easy to implement a fallback for other platforms because it can be done just once, without repeating for each platform, by putting it in generic_ops-inl.h.

One general principle is that we want the code to be reasonably efficient on all platforms. I wonder whether it would be better, if we did not have the SVE instructions, to organize complex numbers into two regs re and im, rather than in odd/even lanes of one vector?

Let's imagine an app willing to have a special case for SVE, and a second codepath for other platforms. Would this be faster than if we always used odd/even layout for Z numbers? If so, it sounds like an #if might be a better fit; if not, then a single function with either SVE or emulated implementation sounds reasonable.

Ryo-not-rio · 2024-04-03T14:59:02Z

I see your point, we indeed found that de-interleaving the complex numbers first was faster for highway on NEON & SVE. I'm not sure about the x86 side of things though. Even if this is the case, it would be nice to be able to access the SVE instructions from highway since they seem to perform significantly better. Either way, needs further investigation on x86 it sounds like

johnplatts · 2024-04-03T16:41:42Z

Hi,

I would like to propose the addition of complex arithmetic instructions to highway. This would allow us to take advantage of the SVE complex arithmetic instructions (svcadd, svcmla and svcdot), improving the performance of complex arithmetics on arm. I imagine the difficulty would be the need to implement and maintain equivalent functions for x86 and NEON where these instructions do not exist natively.

F32 AddSub(a, b) is equivalent to SVE svcadd_f32_x(svptrue_f32(), a, Reverse2(d, b), 90) and SSSE3 _mm_addsub_ps(a.raw, b.raw).

The F16/F32/F64 AddSub op should be re-implemented using svcadd on SVE targets as svcadd is more efficient than the default AddSub implementation in generic_ops-inl.h on SVE targets.

F16/F32/F64 MulAddSub(a, b, c) should be re-implemented as MulAdd(a, b, AddSub(Set(DFromV<decltype(c)>(), -0.0), c)) on SVE targets (which allows the MulAddSub to be carried out using a svcadd op followed by a svmad op).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 0) is equivalent to MulAdd(DupEven(b), c, a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 90) is equivalent to MulAdd(DupOdd(b), Reverse2(d, AddSub(Set(DFromV<decltype(b)>(), -0.0), c)), a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 180) is equivalent to NegMulAdd(DupEven(b), c, a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 270) is equivalent to NegMulAdd(DupOdd(b), Reverse2(d, AddSub(Set(DFromV<decltype(b)>(), -0.0), c)), a).

jan-wassenberg · 2024-04-03T18:47:50Z

Thanks @johnplatts for pointing out that we can already target svcadd with existing (Mul)AddSub.
@Ryo-not-rio , how close does that get us to what you had in mind?

johnplatts · 2024-04-03T19:19:50Z

Thanks @johnplatts for pointing out that we can already target svcadd with existing (Mul)AddSub.

I have re-implemented AddSub and MulAddSub on SVE using svcadd in pull request #2054.

Ryo-not-rio · 2024-04-04T16:42:26Z

It's good to know that svcadd is already being used in highway!
I think we're still missing a direct link to the svcmla instructions. Even when there are equivalent ways of writing things in highway, we've seen a performance hit due to the extra instructions required.
For example
svcmla_f32_m(pg, acc0, vec_a, vec_b, 90); requires an extra reverse instruction on highway

jan-wassenberg · 2024-04-09T10:24:16Z

hm. It seems that the CMLA instruction is 'exotic' in the sense that other ISAs do not provide such an instruction. Do you have any suggestion on how we could handle that without performance cliffs in one ISA?

johnplatts · 2024-04-12T00:08:41Z

hm. It seems that the CMLA instruction is 'exotic' in the sense that other ISAs do not provide such an instruction. Do you have any suggestion on how we could handle that without performance cliffs in one ISA?

Here is a link to a generic implementation of the ComplexAddRot90/270 ops (equivalent to SVE svcadd_*_x) and ComplexMulAdd[Rot90/180/270] (equivalent to SVE svcmla_*_x): https://godbolt.org/z/1zn949a5f

There are also vcaddq_rot90/270_f16/f32/f64 (equivalent to SVE svcadd_*_x) and vcmlaq[_rot90/180/270]_f16/f32/f64 intrinsics (equivalent to SVE svcmla_*_x) intrinsics available with the FCADD extension available on Armv8.3 or later.

The generic implementation of the ComplexAdd/ComplexMulAdd ops linked above is efficient on most SIMD targets, including SSSE3/SSE4/AVX2/AVX3/NEON.

SSSE3/SSE4/AVX2/AVX3 have AddSub instructions for F32/F64 vectors that are 32 bytes or smaller that helps improve the performance of the ComplexAdd/ComplexMulAdd ops.

jan-wassenberg · 2024-04-12T06:29:09Z

Thanks, those implementations look good to me! Are we proposing to add those as new ops, with single-instruction implementations for SVE?

That seems fine provided we are confident that apps would want to use those ops as defined. One remaining concern I have (because not familiar with complex arithmetic): are there perhaps other equivalent ways of implementing the desired formulas, that would be more efficient than these generic implementations when run on non-SVE?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for complex arithmetics #2047

Support for complex arithmetics #2047

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

Ryo-not-rio commented Apr 3, 2024

johnplatts commented Apr 3, 2024 •

edited

Loading

jan-wassenberg commented Apr 3, 2024

johnplatts commented Apr 3, 2024 •

edited

Loading

Ryo-not-rio commented Apr 4, 2024

jan-wassenberg commented Apr 9, 2024

johnplatts commented Apr 12, 2024

jan-wassenberg commented Apr 12, 2024

Support for complex arithmetics #2047

Support for complex arithmetics #2047

Comments

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

Ryo-not-rio commented Apr 3, 2024

johnplatts commented Apr 3, 2024 • edited Loading

jan-wassenberg commented Apr 3, 2024

johnplatts commented Apr 3, 2024 • edited Loading

Ryo-not-rio commented Apr 4, 2024

jan-wassenberg commented Apr 9, 2024

johnplatts commented Apr 12, 2024

jan-wassenberg commented Apr 12, 2024

johnplatts commented Apr 3, 2024 •

edited

Loading

johnplatts commented Apr 3, 2024 •

edited

Loading