Include vectorized rotate instructions #5

stoklund · 2017-04-20T23:38:27Z

@AndrewScheidecker mentioned in his review of #1 the possibility of including vectorized rotate instructions to match the existing scalar instructions. They would have these signatures:

i8x16.rotl(x: v128, n: i32) -> v128
i16x8.rotl(x: v128, n: i32) -> v128
i32x4.rotl(x: v128, n: i32) -> v128
i64x2.rotl(x: v128, n: i32) -> v128
i8x16.rotr(x: v128, n: i32) -> v128
i16x8.rotr(x: v128, n: i32) -> v128
i32x4.rotr(x: v128, n: i32) -> v128
i64x2.rotr(x: v128, n: i32) -> v128

The semantics would be to rotate the lanes independently by the scalar n. This can be expressed in terms of the vectorized shift operators:

«T».rotl(x, n) = v128.or(«T».shl(x, n), «T».shr_u(x, -n))
«T».rotr(x, n) = v128.or(«T».shl(x, -n), «T».shr_u(x, n))

Questions:

Are vectorized rotate instructions available in SIMD instruction sets we care about?
Are there plausible applications for vectorized rotates?

The text was updated successfully, but these errors were encountered:

AndrewScheidecker · 2017-04-24T11:39:11Z

Are vectorized rotate instructions available in SIMD instruction sets we care about?

SSE+AVX does not have vectorized rotates.

AMD's XOP extension does have vectorized rotates, but it looks like it's deprecated as their new processors no longer support it.

Are there plausible applications for vectorized rotates?

Hashing is what I had in mind.

It looks like Blake2b is be designed around the lack of a rotate in SSE(see section 2.2 in blake2.pdf). However, it does still have two 64x2 rotates by 63 bits every round, which it implements as a xor,
a shift, and an add. Given that is intended to be faster than the naive lowering of a rotation to a xor and two shifts, it probably wouldn't use a rotation operator. The other "rotations" it uses are by constant multiples of 8 so that can be implemented as swizzles.

Both SHA2 and SHA3 use rotates. However, x86 and ARM already have special instructions for SHA2, and SHA3 is designed for efficient hardware implementation. Maybe the right approach there is to eventually add higher-level SHA2 and SHA3 instructions that can leverage whatever hardware support there may be (or at least an efficient native software implementation).

stoklund · 2017-05-25T17:07:07Z

ARM does not have vectorized rotates either. Intel includes them in AVX-512.

arunetm · 2018-12-17T22:31:31Z

Closing this.

Lack of instruction support for efficient implementation and workarounds exists for potential use cases.

arunetm closed this as completed Dec 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include vectorized rotate instructions #5

Include vectorized rotate instructions #5

stoklund commented Apr 20, 2017

AndrewScheidecker commented Apr 24, 2017

stoklund commented May 25, 2017

arunetm commented Dec 17, 2018

Include vectorized rotate instructions #5

Include vectorized rotate instructions #5

Comments

stoklund commented Apr 20, 2017

AndrewScheidecker commented Apr 24, 2017

stoklund commented May 25, 2017

arunetm commented Dec 17, 2018