Skip to content

[AArch64] does not use rev32/rev64 instructions, resulting in redundant shift operations #130469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
k-arrows opened this issue Mar 9, 2025 · 2 comments · Fixed by #136707
Closed

Comments

@k-arrows
Copy link

k-arrows commented Mar 9, 2025

Here is the code from gcc testsuite.
https://godbolt.org/z/jzdcsfxx4

typedef char __attribute__ ((vector_size (16))) v16qi;
typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
typedef unsigned int __attribute__ ((vector_size (16))) v4si;
typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
typedef unsigned int __attribute__ ((vector_size (8))) v2si;
 
v2di
G1 (v2di r)
{
  return (r >> 32) | (r << 32);
}
 
v4si
G2 (v4si r)
{
  return (r >> 16) | (r << 16);
}
 
v8hi
G3 (v8hi r)
{
  return (r >> 8) | (r << 8);
}
 
v2si
G4 (v2si r)
{
  return (r >> 16) | (r << 16);
}
 
v4hi
G5 (v4hi r)
{
  return (r >> 8) | (r << 8);
}

GCC efficiently uses rev32 or rev64 to complete the operation in a single instruction.

@llvmbot
Copy link
Member

llvmbot commented Mar 9, 2025

@llvm/issue-subscribers-backend-aarch64

Author: None (k-arrows)

Here is the code from gcc testsuite. https://godbolt.org/z/jzdcsfxx4 ```c typedef char __attribute__ ((vector_size (16))) v16qi; typedef unsigned short __attribute__ ((vector_size (16))) v8hi; typedef unsigned int __attribute__ ((vector_size (16))) v4si; typedef unsigned long long __attribute__ ((vector_size (16))) v2di; typedef unsigned short __attribute__ ((vector_size (8))) v4hi; typedef unsigned int __attribute__ ((vector_size (8))) v2si;

v2di
G1 (v2di r)
{
return (r >> 32) | (r << 32);
}

v4si
G2 (v4si r)
{
return (r >> 16) | (r << 16);
}

v8hi
G3 (v8hi r)
{
return (r >> 8) | (r << 8);
}

v2si
G4 (v2si r)
{
return (r >> 16) | (r << 16);
}

v4hi
G5 (v4hi r)
{
return (r >> 8) | (r << 8);
}


GCC efficiently uses rev32 or rev64 to complete the operation in a single instruction.
</details>

@jyli0116
Copy link
Contributor

Hi, I'm looking into this right now, could I please be assigned to the issue?

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this issue May 6, 2025
Fixes llvm#130469 

Now uses REV32/REV64 instructions to complete operation.

New Output:
```
G1:
        rev64   v0.4s, v0.4s
        ret
G2:
        rev32   v0.8h, v0.8h
        ret
G3:
        rev16   v0.16b, v0.16b
        ret
G4:
        rev32   v0.4h, v0.4h
        ret
G5:
        rev16   v0.8b, v0.8b
        ret
```

Old Output:

```
G1:
        shl     v1.2d, v0.2d, llvm#32
        usra    v1.2d, v0.2d, llvm#32
        mov     v0.16b, v1.16b
        ret

G2:
        shl     v1.4s, v0.4s, llvm#16
        usra    v1.4s, v0.4s, llvm#16
        mov     v0.16b, v1.16b
        ret

G3:
        rev16   v0.16b, v0.16b
        ret

G4:
        shl     v1.2s, v0.2s, llvm#16
        usra    v1.2s, v0.2s, llvm#16
        fmov    d0, d1
        ret

G5:
        rev16   v0.8b, v0.8b
        ret
```
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this issue May 6, 2025
Fixes llvm#130469 

Now uses REV32/REV64 instructions to complete operation.

New Output:
```
G1:
        rev64   v0.4s, v0.4s
        ret
G2:
        rev32   v0.8h, v0.8h
        ret
G3:
        rev16   v0.16b, v0.16b
        ret
G4:
        rev32   v0.4h, v0.4h
        ret
G5:
        rev16   v0.8b, v0.8b
        ret
```

Old Output:

```
G1:
        shl     v1.2d, v0.2d, llvm#32
        usra    v1.2d, v0.2d, llvm#32
        mov     v0.16b, v1.16b
        ret

G2:
        shl     v1.4s, v0.4s, llvm#16
        usra    v1.4s, v0.4s, llvm#16
        mov     v0.16b, v1.16b
        ret

G3:
        rev16   v0.16b, v0.16b
        ret

G4:
        shl     v1.2s, v0.2s, llvm#16
        usra    v1.2s, v0.2s, llvm#16
        fmov    d0, d1
        ret

G5:
        rev16   v0.8b, v0.8b
        ret
```
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this issue May 6, 2025
Fixes llvm#130469 

Now uses REV32/REV64 instructions to complete operation.

New Output:
```
G1:
        rev64   v0.4s, v0.4s
        ret
G2:
        rev32   v0.8h, v0.8h
        ret
G3:
        rev16   v0.16b, v0.16b
        ret
G4:
        rev32   v0.4h, v0.4h
        ret
G5:
        rev16   v0.8b, v0.8b
        ret
```

Old Output:

```
G1:
        shl     v1.2d, v0.2d, llvm#32
        usra    v1.2d, v0.2d, llvm#32
        mov     v0.16b, v1.16b
        ret

G2:
        shl     v1.4s, v0.4s, llvm#16
        usra    v1.4s, v0.4s, llvm#16
        mov     v0.16b, v1.16b
        ret

G3:
        rev16   v0.16b, v0.16b
        ret

G4:
        shl     v1.2s, v0.2s, llvm#16
        usra    v1.2s, v0.2s, llvm#16
        fmov    d0, d1
        ret

G5:
        rev16   v0.8b, v0.8b
        ret
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants