Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Include vectorized rotate instructions #5

Closed
stoklund opened this issue Apr 20, 2017 · 3 comments
Closed

Include vectorized rotate instructions #5

stoklund opened this issue Apr 20, 2017 · 3 comments

Comments

@stoklund
Copy link
Contributor

@AndrewScheidecker mentioned in his review of #1 the possibility of including vectorized rotate instructions to match the existing scalar instructions. They would have these signatures:

  • i8x16.rotl(x: v128, n: i32) -> v128
  • i16x8.rotl(x: v128, n: i32) -> v128
  • i32x4.rotl(x: v128, n: i32) -> v128
  • i64x2.rotl(x: v128, n: i32) -> v128
  • i8x16.rotr(x: v128, n: i32) -> v128
  • i16x8.rotr(x: v128, n: i32) -> v128
  • i32x4.rotr(x: v128, n: i32) -> v128
  • i64x2.rotr(x: v128, n: i32) -> v128

The semantics would be to rotate the lanes independently by the scalar n. This can be expressed in terms of the vectorized shift operators:

«T».rotl(x, n) = v128.or(«T».shl(x, n), «T».shr_u(x, -n))
«T».rotr(x, n) = v128.or(«T».shl(x, -n), «T».shr_u(x, n))

Questions:

  • Are vectorized rotate instructions available in SIMD instruction sets we care about?
  • Are there plausible applications for vectorized rotates?
@AndrewScheidecker
Copy link
Contributor

Are vectorized rotate instructions available in SIMD instruction sets we care about?

SSE+AVX does not have vectorized rotates.

AMD's XOP extension does have vectorized rotates, but it looks like it's deprecated as their new processors no longer support it.

Are there plausible applications for vectorized rotates?

Hashing is what I had in mind.

It looks like Blake2b is be designed around the lack of a rotate in SSE(see section 2.2 in blake2.pdf). However, it does still have two 64x2 rotates by 63 bits every round, which it implements as a xor,
a shift, and an add
. Given that is intended to be faster than the naive lowering of a rotation to a xor and two shifts, it probably wouldn't use a rotation operator. The other "rotations" it uses are by constant multiples of 8 so that can be implemented as swizzles.

Both SHA2 and SHA3 use rotates. However, x86 and ARM already have special instructions for SHA2, and SHA3 is designed for efficient hardware implementation. Maybe the right approach there is to eventually add higher-level SHA2 and SHA3 instructions that can leverage whatever hardware support there may be (or at least an efficient native software implementation).

@stoklund
Copy link
Contributor Author

ARM does not have vectorized rotates either. Intel includes them in AVX-512.

@arunetm
Copy link
Collaborator

arunetm commented Dec 17, 2018

Closing this.

Lack of instruction support for efficient implementation and workarounds exists for potential use cases.

@arunetm arunetm closed this as completed Dec 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants