Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

target amdgcn-amd-amdhsa: cttz in BB after amdgcn.ballot leads to "Cannot select SETCC..." #89332

Closed
MattPD opened this issue Apr 18, 2024 · 16 comments
Labels
backend:AMDGPU llvm:SelectionDAG SelectionDAGISel as well question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

Comments

@MattPD
Copy link
Member

MattPD commented Apr 18, 2024

The following LLVM IR:

; reduced.ll
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"

define ptr addrspace(1) @__ockl_dm_alloc(i1 %0) {
__ockl_wfany_i32.exit:
  %1 = tail call i32 @llvm.amdgcn.ballot.i32(i1 %0)
  br label %2

2:                                                ; preds = %__ockl_wfany_i32.exit
  %3 = tail call i32 @llvm.cttz.i32(i32 %1, i1 false)
  ret ptr addrspace(1) null
}

; Function Attrs: convergent nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.ballot.i32(i1) #0

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.cttz.i32(i32, i1 immarg) #1

attributes #0 = { convergent nocallback nofree nounwind willreturn memory(none) }
attributes #1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

Compiled with:

llc -O0 reduced.ll

yields ICE:

LLVM ERROR: Cannot select: 0x8fc9e20: i32 = SETCC 0x8fc9a30, Constant:i32<0>, setne:ch
  0x8fc9a30: i32 = and # D:1 0x8fc9800, Constant:i32<1>
    0x8fc9800: i32,ch = CopyFromReg # D:1 0x8f54110, Register:i32 %8
      0x8fc9790: i32 = Register %8
    0x8fc99c0: i32 = Constant<1>
  0x8fc9d40: i32 = Constant<0>
In function: __ockl_dm_alloc
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /opt/compiler-explorer/clang-trunk/bin/llc -o /app/output.s -x86-asm-syntax=intel -O0 <source>
1.	Running pass 'CallGraph Pass Manager' on module '<source>'.
2.	Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@__ockl_dm_alloc'
. . .

Worth noting that the ICE occurs at -O0, https://llc.godbolt.org/z/xhG4qrYWv as well as at -O1, https://llc.godbolt.org/z/WeMvT17bb.

However, it does not occur at -O2, https://llc.godbolt.org/z/WeMvT17bb

We can see that the basic block (BB) __ockl_wfany_i32.exit is just terminated by an unconditional branch to the successor BB 2. If I pretend I'm SimplifyCFG and fold these BBs myself then the ICE doesn't occur at -O0 or -O1, either, https://llc.godbolt.org/z/qTxa444sK

This is reduced LLVM IR (using llvm-reduce). For context, the original LLVM IR has been produced from the following (adding -v -save-temps)

$ cat > my_test.cpp
int main() {
  int* ptr;
  #pragma omp target
  {
    ptr = new int();
  }
}

$ CC -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a my_test.cpp
LLVM ERROR: Cannot select: t26: i32 = SETCC t25, Constant:i32<0>, setne:ch
  t25: i32 = zero_extend # D:1 t2
    t2: i1,ch = CopyFromReg # D:1 t0, Register:i1 %454
      t1: i1 = Register %454
  t8: i32 = Constant<0>
In function: __ockl_dm_alloc
. . .

The original llc invocation from -v was llc -O0 -mtriple=amdgcn-amd-amdhsa -disable-promote-alloca-to-lds -mcpu=gfx90a -amdgpu-dump-hsa-metadata although only -O0 suffices to trigger the ICE (with the IR itself already containing target triple = "amdgcn-amd-amdhsa"); cf. the Compiler Explorer links.

@llvmbot
Copy link
Member

llvmbot commented Apr 18, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: Matt (MattPD)

The following LLVM IR:
; reduced.ll
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"

define ptr addrspace(1) @<!-- -->__ockl_dm_alloc(i1 %0) {
__ockl_wfany_i32.exit:
  %1 = tail call i32 @<!-- -->llvm.amdgcn.ballot.i32(i1 %0)
  br label %2

2:                                                ; preds = %__ockl_wfany_i32.exit
  %3 = tail call i32 @<!-- -->llvm.cttz.i32(i32 %1, i1 false)
  ret ptr addrspace(1) null
}

; Function Attrs: convergent nocallback nofree nounwind willreturn memory(none)
declare i32 @<!-- -->llvm.amdgcn.ballot.i32(i1) #<!-- -->0

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @<!-- -->llvm.cttz.i32(i32, i1 immarg) #<!-- -->1

attributes #<!-- -->0 = { convergent nocallback nofree nounwind willreturn memory(none) }
attributes #<!-- -->1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

Compiled with:

llc -O0 reduced.ll

yields ICE:

LLVM ERROR: Cannot select: 0x8fc9e20: i32 = SETCC 0x8fc9a30, Constant:i32&lt;0&gt;, setne:ch
  0x8fc9a30: i32 = and # D:1 0x8fc9800, Constant:i32&lt;1&gt;
    0x8fc9800: i32,ch = CopyFromReg # D:1 0x8f54110, Register:i32 %8
      0x8fc9790: i32 = Register %8
    0x8fc99c0: i32 = Constant&lt;1&gt;
  0x8fc9d40: i32 = Constant&lt;0&gt;
In function: __ockl_dm_alloc
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /opt/compiler-explorer/clang-trunk/bin/llc -o /app/output.s -x86-asm-syntax=intel -O0 &lt;source&gt;
1.	Running pass 'CallGraph Pass Manager' on module '&lt;source&gt;'.
2.	Running pass 'AMDGPU DAG-&gt;DAG Pattern Instruction Selection' on function '@<!-- -->__ockl_dm_alloc'
. . .

Worth noting that the ICE occurs at -O0, https://llc.godbolt.org/z/xhG4qrYWv as well as at -O1, https://llc.godbolt.org/z/WeMvT17bb.

However, it does not occur at -O2, https://llc.godbolt.org/z/WeMvT17bb

We can see that the basic block (BB) __ockl_wfany_i32.exit is just terminated by an unconditional branch to the successor BB 2. If I pretend I'm SimplifyCFG and fold these BBs myself then the ICE doesn't occur at -O0 or -O1, either, https://llc.godbolt.org/z/qTxa444sK

This is reduced LLVM IR (using llvm-reduce). For context, the original LLVM IR has been produced from the following (adding -v -save-temps)

$ cat &gt; my_test.cpp
int main() {
  int* ptr;
  #pragma omp target
  {
    ptr = new int();
  }
}

$ CC -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a my_test.cpp
LLVM ERROR: Cannot select: t26: i32 = SETCC t25, Constant:i32&lt;0&gt;, setne:ch
  t25: i32 = zero_extend # D:1 t2
    t2: i1,ch = CopyFromReg # D:1 t0, Register:i1 %454
      t1: i1 = Register %454
  t8: i32 = Constant&lt;0&gt;
In function: __ockl_dm_alloc
. . .

The original llc invocation from -v was llc -O0 -mtriple=amdgcn-amd-amdhsa -disable-promote-alloca-to-lds -mcpu=gfx90a -amdgpu-dump-hsa-metadata although only -O0 suffices to trigger the ICE (with the IR itself already containing target triple = "amdgcn-amd-amdhsa"); cf. the Compiler Explorer links.

@arsenm
Copy link
Contributor

arsenm commented Apr 19, 2024

This is trying to codegen wave32 IR on a wave64 target. While we currently handle the wave64 ballots on wave32, I don't think we try to handle the inverse case. This should be trying to use ballot.i64

@MattPD
Copy link
Member Author

MattPD commented Apr 19, 2024

FWIW, the original code comes from __ockl_dm_alloc in OCKL, https://github.com/ROCm/llvm-project/blob/d2475ed3def82814842231486c0080df1dc57a86/amd/device-libs/ockl/src/dm.cl#L967-L978

However, the actual used code is LLVM bitcode (which OCKL ships as).

To illustrate the issue in a less reduced context:
https://llc.godbolt.org/z/E57xcfsET

@__oclc_wavefrontsize64 = linkonce_odr protected local_unnamed_addr addrspace(4) constant i8 1, align 1

define ptr addrspace(1) @__ockl_dm_alloc(i1 %0) {
  %wavefrontsize64 = load i8, ptr addrspace(4) @__oclc_wavefrontsize64, align 1
  %wavefrontsize64_zero = icmp eq i8 %wavefrontsize64, 0
  br i1 %wavefrontsize64_zero, label %ballot32, label %ballot64
ballot64:
  %ballot64result = tail call i64 @llvm.amdgcn.ballot.i64(i1 %0)
  %193 = icmp eq i64 %ballot64result, 0
  br label %cttz64
ballot32:
  %ballot32result = tail call i32 @llvm.amdgcn.ballot.i32(i1 %0)
  %202 = icmp eq i32 %ballot32result, 0
  br label %cttz32
cttz64:
  %cttz64result = tail call i64 @llvm.cttz.i64(i64 %ballot64result, i1 false)
  br label %bbexit
cttz32:                                                
  %cttz32result = tail call i32 @llvm.cttz.i32(i32 %ballot32result, i1 false)
  br label %bbexit
bbexit:                                                
  ret ptr addrspace(1) null
}

Thus even if __oclc_wavefrontsize64 is known at compile-time to be 1 the code (LLVM IR) contains both forms of the llvm.amdgcn.ballot intrinsic. I presume what may explain that the ICE happens at -O0 is that higher optimization levels run the SimplifyCFG pass and the (effectively dead) wave32 IR gets cleaned up. Naturally, we cannot run SimplifyCFG at -O0 thus the IR that is going to reach the backend is going to contain both forms, which results in the cannot-select ICE.

@arsenm
Copy link
Contributor

arsenm commented Apr 24, 2024

Yes, the approach OCKL uses for the wave size is fragile this way. Really we should be treating the wavesize as a wholly incompatible ABI. These two implementations should not coexist in the same function, so this is primarily an OCKL bug

@arsenm
Copy link
Contributor

arsenm commented Apr 24, 2024

In this particular case, I think ockl can just pretend wave32 doesn't exist and only use the wave64 builtins

@MattPD
Copy link
Member Author

MattPD commented Apr 24, 2024

Thanks! OK, I'm going to see if I can pass this on to the OCKL folks from my end.

@arsenm
Copy link
Contributor

arsenm commented Apr 25, 2024

I've posted the workaround to ockl. Something like #86957 is needed to improve the diagnostic

@arsenm arsenm closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2024
@EugeneZelenko EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label Apr 25, 2024
searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue Apr 26, 2024
Wave32 and wave64 paths cannot really co-exist in the same function
or callgraph. They need to be treated as a hard ABI incompatibility.
We cannot handle the wave32 operation on wave64, but we can and do
handle the wave64 operation on wave32. Given the current linking scheme,
the most expedient fix for this not working is to pretend wave32 does
not exist and just use the wave64 ballot. The optimizer will fold
the 64-bit ballot intrinsic to the 32-bit one when it sees a 32-bit
target.

This was reported broken in
llvm#89332

Change-Id: If897f631066e257e18d8adf574cd17d3f9278ca7
searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue May 2, 2024
Wave32 and wave64 paths cannot really co-exist in the same function
or callgraph. They need to be treated as a hard ABI incompatibility.
We cannot handle the wave32 operation on wave64, but we can and do
handle the wave64 operation on wave32. Given the current linking scheme,
the most expedient fix for this not working is to pretend wave32 does
not exist and just use the wave64 ballot. The optimizer will fold
the 64-bit ballot intrinsic to the 32-bit one when it sees a 32-bit
target.

This was reported broken in
llvm#89332

Change-Id: If897f631066e257e18d8adf574cd17d3f9278ca7
rocm-ci pushed a commit to ROCm/llvm-project that referenced this issue Jun 4, 2024
Wave32 and wave64 paths cannot really co-exist in the same function
or callgraph. They need to be treated as a hard ABI incompatibility.
We cannot handle the wave32 operation on wave64, but we can and do
handle the wave64 operation on wave32. Given the current linking scheme,
the most expedient fix for this not working is to pretend wave32 does
not exist and just use the wave64 ballot. The optimizer will fold
the 64-bit ballot intrinsic to the 32-bit one when it sees a 32-bit
target.

This was reported broken in
llvm#89332

Change-Id: If897f631066e257e18d8adf574cd17d3f9278ca7
@pxl-th
Copy link

pxl-th commented Jul 28, 2024

@arsenm, what optimization pass should fold 64 bit to 32 bit?

I'm having a similar issue with ROCm 6.1.2 which includes ROCm@96b2ba3.
I'm using Julia 1.11 (which uses LLVM 16) and AMDGPU.jl package for the AMDGPU programming.

Example kernel:

function ker!(X, Y)
    idx = workitemIdx().x
    Y[1] = AMDGPU.Device.wfany(X[idx])
    return
end

Where AMDGPU.Device.wfany is just a llvmcall to __ockl_wfany_i32.

With opaque pointers enabled, the above kernel results in ICE:

  LLVM error: Cannot select: 0x55180e0: i64 = SETCC 0x2d64380, Constant:i32<0>, setne:ch, /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:50 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ]
    0x2d64380: i32,ch,glue = CopyFromReg # D:1 0x307a9c0, Register:i32 %67, 0x307a9c0:1, /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:50 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ]
      0x568b0b0: i32 = Register %67
      0x307a9c0: ch,glue = inlineasm # D:1 0x568a7f0, TargetExternalSymbol:i64'', MDNode:ch<0x5112ad8>, TargetConstant:i64<33>, TargetConstant:i32<1769482>, Register:i32 %67, TargetConstant:i32<-2147483639>, Register:i32 %68, 0x568a7f0:1, /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:50 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ]
        0x568afd0: i64 = TargetExternalSymbol''
        0x307a870: i64 = TargetConstant<33>
        0x568b190: i32 = TargetConstant<1769482>
        0x568b0b0: i32 = Register %67
        0x2d64cb0: i32 = TargetConstant<-2147483639>
        0x568a710: i32 = Register %68
        0x568a7f0: ch,glue = CopyToReg # D:1 0x568ab70:1, Register:i32 %68, 0x568ab70, /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:50 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ]
          0x568a710: i32 = Register %68
          0x568ab70: i32,ch = load<(load (s32) from %ir.13, !tbaa !131, addrspace 1)> # D:1 0x4472960, 0x568aa90, undef:i64, /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38 @[ none:0 @[ none:0 @[ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:85 @[ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:84 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ] ] ] ] ]
            0x568aa90: i64 = add # D:1 0x5615ab0, 0x2d647e0, /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38 @[ none:0 @[ none:0 @[ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:85 @[ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:84 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ] ] ] ] ]
              0x5615ab0: i64,ch = CopyFromReg 0x5615b20:1, Register:i64 %14
                0x5615c70: i64 = Register %14
              0x2d647e0: i64 = zero_extend # D:1 0x568a8d0, /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38 @[ none:0 @[ none:0 @[ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:85 @[ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:84 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ] ] ] ] ]
                0x568a8d0: i32 = shl # D:1 0x568a550, Constant:i32<2>, /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38 @[ none:0 @[ none:0 @[ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:85 @[ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:84 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ] ] ] ] ]
                  0x568a550: i32 = AssertZext # D:1 0x568a390, ValueType:ch:i10, /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38 @[ none:0 @[ none:0 @[ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:85 @[ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:84 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ] ] ] ] ]
                    0x568a390: i32,ch = CopyFromReg # D:1 0x4472960, Register:i32 %16, /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38 @[ none:0 @[ none:0 @[ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:85 @[ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:84 @[ /home/pxlth/.julia/dev/AMDGPU/test/device/wavefront.jl:78 ] ] ] ] ]
  
                  0x2d64700: i32 = Constant<2>
            0x5517f20: i64 = undef
    0x2d64460: i32 = Constant<0>
  In function: _Z11bool_kernel14ROCDeviceArrayI5Int32Li1ELi1EES_I4BoolLi1ELi1EE

Looking at the unoptimized & optimized IR we can see that 64 bit ballot is not folded:

  • unoptimized:
; Function Attrs: alwaysinline convergent norecurse nounwind
define internal fastcc zeroext i1 @__ockl_wfany_i32(i32 noundef %0) unnamed_addr #5 {
  %2 = tail call i32 asm sideeffect "", "=v,0"(i32 %0) #11, !srcloc !215
  %3 = icmp ne i32 %2, 0
  %4 = tail call i64 @llvm.amdgcn.ballot.i64(i1 %3)
  %5 = icmp ne i64 %4, 0
  ret i1 %5
}
  • optimized:
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:50 within `#wfany`
   %11 = call i32 asm sideeffect "", "=v,0"(i32 %10) #4, !dbg !132, !srcloc !135
   %12 = icmp ne i32 %11, 0, !dbg !132
   %13 = call i64 @llvm.amdgcn.ballot.i64(i1 %12), !dbg !132
; └

However, if I disable opaque pointers (LLVM 16 still supports typed pointers), then it is folded:

  • optimized:
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:50 within `#wfany`
   %12 = call i32 asm sideeffect "", "=v,0"(i32 %11) #4, !dbg !131, !srcloc !134
   %13 = call i32 @llvm.amdgcn.icmp.i32.i32(i32 noundef %12, i32 noundef 0, i32 noundef 33) #5, !dbg !131
; └

Alternatively, if I fool the compiler and set +wavefrontsize64 GCN subtarget (I'm using 7900XTX which is 32), then there is no ICE and the kernel executes normally, but the instruction is also not folded (and I don't think lying about wavefrontsize is a good idea).

@arsenm
Copy link
Contributor

arsenm commented Jul 29, 2024

@arsenm, what optimization pass should fold 64 bit to 32 bit?

InstCombine performs the wave64->wave32 fold, but that can only hide the issue. It's not a mandatory transformation.

07c5920 is the patch that permits selecting the wave64 version on wave32

GCN subtarget (I'm using 7900XTX which is 32), then there is no ICE and the kernel executes normally, but the instruction is also not folded (and I don't think lying about wavefrontsize is a good idea).

It supports wave32 and wave64. It's not wrong to use wave64 on gfx10/11/12, they support both. 32 is just the native and default wavesize. You do have to consistently use the same wavesize in a callgraph, they cannot be intermixed.

@pxl-th
Copy link

pxl-th commented Jul 29, 2024

07c5920 is the patch that permits selecting the wave64 version on wave32

I see it is for LLVM 17+. We'd have to wait for Julia 1.12 for that (1.11 itself is still in beta).

It supports wave32 and wave64. It's not wrong to use wave64 on gfx10/11/12, they support both.

I do see wrong results with wavefrontsize64 enabled on 7900xtx.
Here's a simple kernel that counts number of workitems that pass 1s.
Where sync_workgroup_count function is a llvmcall to __ockl_wgred_add_i32.

julia> using AMDGPU

julia> function ker!(x)
           i = workitemIdx().x
           x[i] = AMDGPU.Device.sync_workgroup_count(Cint(1))
           return
       end
ker! (generic function with 1 method)

I then launch it and pass array of length 33.

julia> x = ROCArray(zeros(Cint, (1, 33)));

julia> @roc groupsize=length(x) gridsize=1 ker!(x);

For wavefrontsize32 it gives correct result:

julia> x
1×33 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33  33

But for wavefrontsize64 the result is wrong:

julia> x
1×33 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

I set wavefrontsize both as a GCNSubtarget feature and as a OCLC option library.

Optimized LLVM IR for the kernel:

  • wavefrontsize32:
LLVM IR
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"

@__scratch_lds = internal addrspace(3) global [32 x i64] undef, align 8

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.y() #0

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.z() #0

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare align 4 i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #1

; Function Attrs: alwaysinline nounwind readnone willreturn
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #2

; Function Attrs: alwaysinline convergent nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.y() #1

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.x() #1

; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #4

; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.permlanex16(i32, i32, i32, i32, i1 immarg, i1 immarg) #4

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.z() #1

;  @ REPL[15]:1 within `ker!`
define amdgpu_kernel void @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [2 x i64], i8 addrspace(1)*, i64 } %0) local_unnamed_addr #5 !dbg !39 {
conversion:
;  @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
   %1 = call i32 @llvm.amdgcn.workgroup.id.z(), !dbg !43
   %2 = call i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr(), !dbg !43
   %3 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 8, !dbg !43
   %4 = bitcast i8 addrspace(4)* %3 to i32 addrspace(4)*, !dbg !43
   %5 = load i32, i32 addrspace(4)* %4, align 4, !dbg !43, !tbaa !47
   %6 = icmp ult i32 %1, %5, !dbg !43
   %7 = select i1 %6, i64 16, i64 22, !dbg !43
   %8 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %7, !dbg !43
   %9 = bitcast i8 addrspace(4)* %8 to i16 addrspace(4)*, !dbg !43
   %10 = load i16, i16 addrspace(4)* %9, align 2, !dbg !43, !tbaa !51
   %11 = zext i16 %10 to i32, !dbg !43
   %12 = call i32 @llvm.amdgcn.workgroup.id.y(), !dbg !43
   %13 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 4, !dbg !43
   %14 = bitcast i8 addrspace(4)* %13 to i32 addrspace(4)*, !dbg !43
   %15 = load i32, i32 addrspace(4)* %14, align 4, !dbg !43, !tbaa !47
   %16 = icmp ult i32 %12, %15, !dbg !43
   %17 = select i1 %16, i64 14, i64 20, !dbg !43
   %18 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %17, !dbg !43
   %19 = bitcast i8 addrspace(4)* %18 to i16 addrspace(4)*, !dbg !43
   %20 = load i16, i16 addrspace(4)* %19, align 2, !dbg !43, !tbaa !51
   %21 = zext i16 %20 to i32, !dbg !43
   %22 = call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !43
   %23 = bitcast i8 addrspace(4)* %2 to i32 addrspace(4)*, !dbg !43
   %24 = load i32, i32 addrspace(4)* %23, align 4, !dbg !43, !tbaa !47
   %25 = icmp ult i32 %22, %24, !dbg !43
   %26 = select i1 %25, i64 12, i64 18, !dbg !43
   %27 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %26, !dbg !43
   %28 = bitcast i8 addrspace(4)* %27 to i16 addrspace(4)*, !dbg !43
   %29 = load i16, i16 addrspace(4)* %28, align 2, !dbg !43, !tbaa !51
   %30 = zext i16 %29 to i32, !dbg !43
   %31 = shl nuw nsw i32 %21, 8, !dbg !43
   %32 = mul i32 %31, %30, !dbg !43
   %33 = ashr exact i32 %32, 8, !dbg !43
   %34 = mul nsw i32 %33, %11, !dbg !43
   %35 = add i32 %34, 31, !dbg !43
   %36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 1, i32 257, i32 15, i32 15, i1 true), !dbg !43
   %37 = add nsw i32 %36, 1, !dbg !43
   %38 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %37, i32 258, i32 15, i32 15, i1 true), !dbg !43
   %39 = add nsw i32 %37, %38, !dbg !43
   %40 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %39, i32 260, i32 15, i32 15, i1 true), !dbg !43
   %41 = add nsw i32 %39, %40, !dbg !43
   %42 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %41, i32 264, i32 15, i32 15, i1 true), !dbg !43
   %43 = add nsw i32 %41, %42, !dbg !43
   %44 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %43, i32 336, i32 15, i32 15, i1 true), !dbg !43
   %45 = call i32 @llvm.amdgcn.permlanex16(i32 undef, i32 %44, i32 0, i32 0, i1 false, i1 true), !dbg !43
   %46 = add nsw i32 %45, %44, !dbg !43
   %.mask = and i32 %35, -32, !dbg !43
   %47 = icmp eq i32 %.mask, 32, !dbg !43
   br i1 %47, label %conversion.__ockl_wgred_add_i32.exit_crit_edge, label %48, !dbg !43

conversion.__ockl_wgred_add_i32.exit_crit_edge:   ; preds = %conversion
; └
;  @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %.pre = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !53, !range !66
; └└└└
;  @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
   br label %__ockl_wgred_add_i32.exit, !dbg !43

48:                                               ; preds = %conversion
   %49 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0), !dbg !43
   %50 = call i32 @llvm.amdgcn.workitem.id.z(), !dbg !43, !range !67
   %51 = mul nuw nsw i32 %50, %21, !dbg !43
   %52 = call i32 @llvm.amdgcn.workitem.id.y(), !dbg !43, !range !67
   %53 = add nuw nsw i32 %51, %52, !dbg !43
   %54 = mul i32 %53, %30, !dbg !43
   %55 = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !43, !range !67
   %56 = add i32 %54, %55, !dbg !43
   %57 = lshr i32 %56, 5, !dbg !43
   %58 = icmp eq i32 %49, 0, !dbg !43
   %59 = or i32 %57, %49, !dbg !43
   %60 = icmp eq i32 %59, 0, !dbg !43
   br i1 %60, label %61, label %62, !dbg !43

61:                                               ; preds = %48
   store atomic i32 %46, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
   br label %62, !dbg !43

62:                                               ; preds = %61, %48
   fence syncscope("workgroup") release, !dbg !43
   call void @llvm.amdgcn.s.barrier(), !dbg !43
   fence syncscope("workgroup") acquire, !dbg !43
   %63 = icmp ugt i32 %56, 31, !dbg !43
   %64 = and i1 %58, %63, !dbg !43
   br i1 %64, label %65, label %67, !dbg !43

65:                                               ; preds = %62
   %66 = atomicrmw add i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*), i32 %46 syncscope("workgroup-one-as") monotonic, align 4, !dbg !43
   br label %67, !dbg !43

67:                                               ; preds = %65, %62
   fence syncscope("workgroup") release, !dbg !43
   call void @llvm.amdgcn.s.barrier(), !dbg !43
   fence syncscope("workgroup") acquire, !dbg !43
   %68 = load atomic i32, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
   fence syncscope("workgroup") release, !dbg !43
   call void @llvm.amdgcn.s.barrier(), !dbg !43
   fence syncscope("workgroup") acquire, !dbg !43
   br label %__ockl_wgred_add_i32.exit, !dbg !43

__ockl_wgred_add_i32.exit:                        ; preds = %67, %conversion.__ockl_wgred_add_i32.exit_crit_edge
; └
;  @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %.pre-phi = phi i32 [ %.pre, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %55, %67 ], !dbg !53
; └└└└
;  @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
   %69 = phi i32 [ %46, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %68, %67 ], !dbg !43
   %.fca.1.extract = extractvalue { [2 x i64], i8 addrspace(1)*, i64 } %0, 1
; └
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:90 within `#setindex!`
; │┌ @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:88 within `unsafe_store!`
; ││┌ @ none within `pointerset`
; │││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %70 = bitcast i8 addrspace(1)* %.fca.1.extract to i32 addrspace(1)*, !dbg !68
      %71 = zext i32 %.pre-phi to i64, !dbg !68
      %72 = getelementptr inbounds i32, i32 addrspace(1)* %70, i64 %71, !dbg !68
      store i32 %69, i32 addrspace(1)* %72, align 4, !dbg !68, !tbaa !80
; └└└└
;  @ REPL[15]:4 within `ker!`
  ret void, !dbg !83
}

attributes #0 = { nounwind readnone speculatable willreturn "amdgpu-unsafe-fp-atomics"="true" }
attributes #1 = { alwaysinline nounwind readnone speculatable willreturn }
attributes #2 = { alwaysinline nounwind readnone willreturn }
attributes #3 = { alwaysinline convergent nounwind willreturn }
attributes #4 = { alwaysinline convergent nounwind readnone willreturn }
attributes #5 = { "amdgpu-unsafe-fp-atomics"="true" }

!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.dbg.cu = !{!4, !6, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34, !35}
!opencl.ocl.version = !{!36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36}
!llvm.ident = !{!37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37}
!julia.kernel = !{!38}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 1, !"wchar_size", i32 4}
!3 = !{i32 7, !"PIC Level", i32 1}
!4 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!5 = !DIFile(filename: "julia", directory: ".")
!6 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!36 = !{i32 2, i32 0}
!37 = !{!"clang version 15.0.0 (/cache/yggdrasil/downloads/clones/llvm-project.git-974efd367bc513231526d317489c66cb27727ef3caa41108e3819c131a8acf57 f3d695fc2985a8dfdd5f4219d351fdeac3038867)"}
!38 = !{void ({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 }, { [2 x i64], i8 addrspace(1)*, i64 })* @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE}
!39 = distinct !DISubprogram(name: "ker!", linkageName: "julia_ker!_4401", scope: null, file: !40, line: 1, type: !41, scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!40 = !DIFile(filename: "REPL[15]", directory: ".")
!41 = !DISubroutineType(types: !42)
!42 = !{}
!43 = !DILocation(line: 17, scope: !44, inlinedAt: !46)
!44 = distinct !DISubprogram(name: "sync_workgroup_count;", linkageName: "sync_workgroup_count", scope: !45, file: !45, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!45 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl", directory: ".")
!46 = !DILocation(line: 3, scope: !39)
!47 = !{!48, !48, i64 0}
!48 = !{!"int", !49, i64 0}
!49 = !{!"omnipotent char", !50, i64 0}
!50 = !{!"Simple C/C++ TBAA"}
!51 = !{!52, !52, i64 0}
!52 = !{!"short", !49, i64 0}
!53 = !DILocation(line: 38, scope: !54, inlinedAt: !56)
!54 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !55, file: !55, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!55 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl", directory: ".")
!56 = !DILocation(line: 3, scope: !57, inlinedAt: !59)
!57 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!58 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl", directory: ".")
!59 = !DILocation(line: 3, scope: !60, inlinedAt: !61)
!60 = distinct !DISubprogram(name: "_index;", linkageName: "_index", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!61 = !DILocation(line: 87, scope: !62, inlinedAt: !63)
!62 = distinct !DISubprogram(name: "workitemIdx_x;", linkageName: "workitemIdx_x", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!63 = !DILocation(line: 122, scope: !64, inlinedAt: !65)
!64 = distinct !DISubprogram(name: "workitemIdx;", linkageName: "workitemIdx", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!65 = !DILocation(line: 2, scope: !39)
!66 = !{i32 0, i32 1023}
!67 = !{i32 0, i32 1024}
!68 = !DILocation(line: 38, scope: !54, inlinedAt: !69)
!69 = !DILocation(line: 0, scope: !70, inlinedAt: !72)
!70 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !71, file: !71, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!71 = !DIFile(filename: "none", directory: ".")
!72 = !DILocation(line: 0, scope: !73, inlinedAt: !74)
!73 = distinct !DISubprogram(name: "pointerset;", linkageName: "pointerset", scope: !71, file: !71, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!74 = !DILocation(line: 88, scope: !75, inlinedAt: !77)
!75 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !76, file: !76, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!76 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl", directory: ".")
!77 = !DILocation(line: 90, scope: !78, inlinedAt: !46)
!78 = distinct !DISubprogram(name: "#setindex!;", linkageName: "#setindex!", scope: !79, file: !79, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!79 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl", directory: ".")
!80 = !{!81, !81, i64 0, i64 0}
!81 = !{!"custom_tbaa_addrspace(1)", !82, i64 0}
!82 = !{!"custom_tbaa"}
!83 = !DILocation(line: 4, scope: !39)
  • wavefrontsize64:
LLVM IR
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"

@__scratch_lds = internal addrspace(3) global [32 x i64] undef, align 8

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.y() #0

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.z() #0

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare align 4 i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #1

; Function Attrs: alwaysinline nounwind readnone willreturn
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #2

; Function Attrs: alwaysinline nounwind readnone willreturn
declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #2

; Function Attrs: alwaysinline convergent nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.y() #1

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.x() #1

; Function Attrs: alwaysinline nounwind readonly
declare i32 @llvm.read_register.i32(metadata) #4

; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.readlane(i32, i32) #5

; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #5

; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.permlanex16(i32, i32, i32, i32, i1 immarg, i1 immarg) #5

; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.z() #1

;  @ REPL[15]:1 within `ker!`
define amdgpu_kernel void @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [2 x i64], i8 addrspace(1)*, i64 } %0) local_unnamed_addr #6 !dbg !39 {
conversion:
;  @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
   %1 = call i32 @llvm.amdgcn.workgroup.id.z(), !dbg !43
   %2 = call i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr(), !dbg !43
   %3 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 8, !dbg !43
   %4 = bitcast i8 addrspace(4)* %3 to i32 addrspace(4)*, !dbg !43
   %5 = load i32, i32 addrspace(4)* %4, align 4, !dbg !43, !tbaa !47
   %6 = icmp ult i32 %1, %5, !dbg !43
   %7 = select i1 %6, i64 16, i64 22, !dbg !43
   %8 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %7, !dbg !43
   %9 = bitcast i8 addrspace(4)* %8 to i16 addrspace(4)*, !dbg !43
   %10 = load i16, i16 addrspace(4)* %9, align 2, !dbg !43, !tbaa !51
   %11 = zext i16 %10 to i32, !dbg !43
   %12 = call i32 @llvm.amdgcn.workgroup.id.y(), !dbg !43
   %13 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 4, !dbg !43
   %14 = bitcast i8 addrspace(4)* %13 to i32 addrspace(4)*, !dbg !43
   %15 = load i32, i32 addrspace(4)* %14, align 4, !dbg !43, !tbaa !47
   %16 = icmp ult i32 %12, %15, !dbg !43
   %17 = select i1 %16, i64 14, i64 20, !dbg !43
   %18 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %17, !dbg !43
   %19 = bitcast i8 addrspace(4)* %18 to i16 addrspace(4)*, !dbg !43
   %20 = load i16, i16 addrspace(4)* %19, align 2, !dbg !43, !tbaa !51
   %21 = zext i16 %20 to i32, !dbg !43
   %22 = call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !43
   %23 = bitcast i8 addrspace(4)* %2 to i32 addrspace(4)*, !dbg !43
   %24 = load i32, i32 addrspace(4)* %23, align 4, !dbg !43, !tbaa !47
   %25 = icmp ult i32 %22, %24, !dbg !43
   %26 = select i1 %25, i64 12, i64 18, !dbg !43
   %27 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %26, !dbg !43
   %28 = bitcast i8 addrspace(4)* %27 to i16 addrspace(4)*, !dbg !43
   %29 = load i16, i16 addrspace(4)* %28, align 2, !dbg !43, !tbaa !51
   %30 = zext i16 %29 to i32, !dbg !43
   %31 = shl nuw nsw i32 %21, 8, !dbg !43
   %32 = mul i32 %31, %30, !dbg !43
   %33 = ashr exact i32 %32, 8, !dbg !43
   %34 = mul nsw i32 %33, %11, !dbg !43
   %35 = add i32 %34, 63, !dbg !43
   %36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 1, i32 257, i32 15, i32 15, i1 true), !dbg !43
   %37 = add nsw i32 %36, 1, !dbg !43
   %38 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %37, i32 258, i32 15, i32 15, i1 true), !dbg !43
   %39 = add nsw i32 %37, %38, !dbg !43
   %40 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %39, i32 260, i32 15, i32 15, i1 true), !dbg !43
   %41 = add nsw i32 %39, %40, !dbg !43
   %42 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %41, i32 264, i32 15, i32 15, i1 true), !dbg !43
   %43 = add nsw i32 %41, %42, !dbg !43
   %44 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %43, i32 336, i32 15, i32 15, i1 true), !dbg !43
   %45 = call i32 @llvm.amdgcn.permlanex16(i32 undef, i32 %44, i32 0, i32 0, i1 false, i1 true), !dbg !43
   %46 = add nsw i32 %45, %44, !dbg !43
   %47 = call i32 @llvm.amdgcn.readlane(i32 %46, i32 32), !dbg !43
   %48 = call i32 @llvm.read_register.i32(metadata !53) #7, !dbg !43
   %49 = and i32 %48, 1, !dbg !43
   %50 = icmp eq i32 %49, 0, !dbg !43
   %51 = select i1 %50, i32 0, i32 %47, !dbg !43
   %52 = call i32 @llvm.amdgcn.readlane(i32 %46, i32 0), !dbg !43
   %53 = add nsw i32 %51, %52, !dbg !43
   %.mask = and i32 %35, -64, !dbg !43
   %54 = icmp eq i32 %.mask, 64, !dbg !43
   br i1 %54, label %conversion.__ockl_wgred_add_i32.exit_crit_edge, label %55, !dbg !43

conversion.__ockl_wgred_add_i32.exit_crit_edge:   ; preds = %conversion
; └
;  @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %.pre = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !54, !range !67
; └└└└
;  @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
   br label %__ockl_wgred_add_i32.exit, !dbg !43

55:                                               ; preds = %conversion
   %56 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0), !dbg !43
   %57 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %56), !dbg !43
   %58 = call i32 @llvm.amdgcn.workitem.id.z(), !dbg !43, !range !68
   %59 = mul nuw nsw i32 %58, %21, !dbg !43
   %60 = call i32 @llvm.amdgcn.workitem.id.y(), !dbg !43, !range !68
   %61 = add nuw nsw i32 %59, %60, !dbg !43
   %62 = mul i32 %61, %30, !dbg !43
   %63 = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !43, !range !68
   %64 = add i32 %62, %63, !dbg !43
   %65 = lshr i32 %64, 6, !dbg !43
   %66 = icmp eq i32 %57, 0, !dbg !43
   %67 = or i32 %65, %57, !dbg !43
   %68 = icmp eq i32 %67, 0, !dbg !43
   br i1 %68, label %69, label %70, !dbg !43

69:                                               ; preds = %55
   store atomic i32 %53, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
   br label %70, !dbg !43

70:                                               ; preds = %69, %55
   fence syncscope("workgroup") release, !dbg !43
   call void @llvm.amdgcn.s.barrier(), !dbg !43
   fence syncscope("workgroup") acquire, !dbg !43
   %71 = icmp ugt i32 %64, 63, !dbg !43
   %72 = and i1 %66, %71, !dbg !43
   br i1 %72, label %73, label %75, !dbg !43

73:                                               ; preds = %70
   %74 = atomicrmw add i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*), i32 %53 syncscope("workgroup-one-as") monotonic, align 4, !dbg !43
   br label %75, !dbg !43

75:                                               ; preds = %73, %70
   fence syncscope("workgroup") release, !dbg !43
   call void @llvm.amdgcn.s.barrier(), !dbg !43
   fence syncscope("workgroup") acquire, !dbg !43
   %76 = load atomic i32, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
   fence syncscope("workgroup") release, !dbg !43
   call void @llvm.amdgcn.s.barrier(), !dbg !43
   fence syncscope("workgroup") acquire, !dbg !43
   br label %__ockl_wgred_add_i32.exit, !dbg !43

__ockl_wgred_add_i32.exit:                        ; preds = %75, %conversion.__ockl_wgred_add_i32.exit_crit_edge
; └
;  @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %.pre-phi = phi i32 [ %.pre, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %63, %75 ], !dbg !54
; └└└└
;  @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
   %77 = phi i32 [ %53, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %76, %75 ], !dbg !43
   %.fca.1.extract = extractvalue { [2 x i64], i8 addrspace(1)*, i64 } %0, 1
; └
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:90 within `#setindex!`
; │┌ @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:88 within `unsafe_store!`
; ││┌ @ none within `pointerset`
; │││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %78 = bitcast i8 addrspace(1)* %.fca.1.extract to i32 addrspace(1)*, !dbg !69
      %79 = zext i32 %.pre-phi to i64, !dbg !69
      %80 = getelementptr inbounds i32, i32 addrspace(1)* %78, i64 %79, !dbg !69
      store i32 %77, i32 addrspace(1)* %80, align 4, !dbg !69, !tbaa !81
; └└└└
;  @ REPL[15]:4 within `ker!`
  ret void, !dbg !84
}

attributes #0 = { nounwind readnone speculatable willreturn "amdgpu-unsafe-fp-atomics"="true" }
attributes #1 = { alwaysinline nounwind readnone speculatable willreturn }
attributes #2 = { alwaysinline nounwind readnone willreturn }
attributes #3 = { alwaysinline convergent nounwind willreturn }
attributes #4 = { alwaysinline nounwind readonly }
attributes #5 = { alwaysinline convergent nounwind readnone willreturn }
attributes #6 = { "amdgpu-unsafe-fp-atomics"="true" }
attributes #7 = { convergent }

!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.dbg.cu = !{!4, !6, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34, !35}
!opencl.ocl.version = !{!36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36}
!llvm.ident = !{!37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37}
!julia.kernel = !{!38}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 1, !"wchar_size", i32 4}
!3 = !{i32 7, !"PIC Level", i32 1}
!4 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!5 = !DIFile(filename: "julia", directory: ".")
!6 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!36 = !{i32 2, i32 0}
!37 = !{!"clang version 15.0.0 (/cache/yggdrasil/downloads/clones/llvm-project.git-974efd367bc513231526d317489c66cb27727ef3caa41108e3819c131a8acf57 f3d695fc2985a8dfdd5f4219d351fdeac3038867)"}
!38 = !{void ({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 }, { [2 x i64], i8 addrspace(1)*, i64 })* @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE}
!39 = distinct !DISubprogram(name: "ker!", linkageName: "julia_ker!_4495", scope: null, file: !40, line: 1, type: !41, scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!40 = !DIFile(filename: "REPL[15]", directory: ".")
!41 = !DISubroutineType(types: !42)
!42 = !{}
!43 = !DILocation(line: 17, scope: !44, inlinedAt: !46)
!44 = distinct !DISubprogram(name: "sync_workgroup_count;", linkageName: "sync_workgroup_count", scope: !45, file: !45, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!45 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl", directory: ".")
!46 = !DILocation(line: 3, scope: !39)
!47 = !{!48, !48, i64 0}
!48 = !{!"int", !49, i64 0}
!49 = !{!"omnipotent char", !50, i64 0}
!50 = !{!"Simple C/C++ TBAA"}
!51 = !{!52, !52, i64 0}
!52 = !{!"short", !49, i64 0}
!53 = !{!"exec_hi"}
!54 = !DILocation(line: 38, scope: !55, inlinedAt: !57)
!55 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !56, file: !56, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!56 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl", directory: ".")
!57 = !DILocation(line: 3, scope: !58, inlinedAt: !60)
!58 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!59 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl", directory: ".")
!60 = !DILocation(line: 3, scope: !61, inlinedAt: !62)
!61 = distinct !DISubprogram(name: "_index;", linkageName: "_index", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!62 = !DILocation(line: 87, scope: !63, inlinedAt: !64)
!63 = distinct !DISubprogram(name: "workitemIdx_x;", linkageName: "workitemIdx_x", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!64 = !DILocation(line: 122, scope: !65, inlinedAt: !66)
!65 = distinct !DISubprogram(name: "workitemIdx;", linkageName: "workitemIdx", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!66 = !DILocation(line: 2, scope: !39)
!67 = !{i32 0, i32 1023}
!68 = !{i32 0, i32 1024}
!69 = !DILocation(line: 38, scope: !55, inlinedAt: !70)
!70 = !DILocation(line: 0, scope: !71, inlinedAt: !73)
!71 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !72, file: !72, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!72 = !DIFile(filename: "none", directory: ".")
!73 = !DILocation(line: 0, scope: !74, inlinedAt: !75)
!74 = distinct !DISubprogram(name: "pointerset;", linkageName: "pointerset", scope: !72, file: !72, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!75 = !DILocation(line: 88, scope: !76, inlinedAt: !78)
!76 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !77, file: !77, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!77 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl", directory: ".")
!78 = !DILocation(line: 90, scope: !79, inlinedAt: !46)
!79 = distinct !DISubprogram(name: "#setindex!;", linkageName: "#setindex!", scope: !80, file: !80, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!80 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl", directory: ".")
!81 = !{!82, !82, i64 0, i64 0}
!82 = !{!"custom_tbaa_addrspace(1)", !83, i64 0}
!83 = !{!"custom_tbaa"}
!84 = !DILocation(line: 4, scope: !39)

@arsenm
Copy link
Contributor

arsenm commented Jul 29, 2024

The wave64 IR posted here looks a bit suspicious to me, like you linked the library incorrectly. For example, why is it using read_register on exec_hi? Also, you didn't provide codegen invocation and the wavesize and target-cpu aren't set in target-features in the IR attributes

@pxl-th
Copy link

pxl-th commented Jul 29, 2024

Ah I set features a bit incorrectly, fixing it gives correct results on wave64.

Also, you didn't provide codegen invocation and the wavesize and target-cpu aren't set in target-features in the IR attributes

We set target cpu when creating TargetMachine during compilation.
.asm dump does contain this information:

ASM
	.text
	.amdgcn_target "amdgcn-amd-amdhsa--gfx1100"
	.globl	_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE ; -- Begin function _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
	.p2align	8
	.type	_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE,@function
_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE: ; @_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
.Lfunc_begin0:
	.file	1 "." "/home/pxlth/.julia/dev/AMDGPU/src/t.jl"
	.loc	1 3 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/t.jl:3:0
	.cfi_sections .debug_frame
	.cfi_startproc
; %bb.0:                                ; %conversion
	.file	2 "." "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl"
	.loc	2 17 0 prologue_end             ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	s_clause 0x1
	s_load_b32 s4, s[0:1], 0x78
	s_load_b64 s[2:3], s[0:1], 0x70
	s_add_u32 s8, s0, 0x70
	s_addc_u32 s9, s1, 0
	v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v3, 1
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_mov_b32_dpp v3, v3 row_shl:1 row_mask:0xf bank_mask:0xf bound_ctrl:1
	v_add_nc_u32_e32 v3, 1, v3
	s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
	v_add_nc_u32_dpp v3, v3, v3 row_shl:2 row_mask:0xf bank_mask:0xf bound_ctrl:1
	s_waitcnt lgkmcnt(0)
	s_cmp_lt_u32 s15, s4
	s_cselect_b32 s4, 16, 22
	s_add_u32 s4, s8, s4
	s_addc_u32 s5, s9, 0
	s_cmp_lt_u32 s14, s3
	v_add_nc_u32_dpp v3, v3, v3 row_shl:4 row_mask:0xf bank_mask:0xf bound_ctrl:1
	s_cselect_b32 s3, 14, 20
	s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
	s_add_u32 s6, s8, s3
	s_addc_u32 s7, s9, 0
	s_cmp_lt_u32 s13, s2
	v_add_nc_u32_dpp v3, v3, v3 row_shl:8 row_mask:0xf bank_mask:0xf bound_ctrl:1
	s_cselect_b32 s2, 12, 18
	s_add_u32 s2, s8, s2
	s_addc_u32 s3, s9, 0
	s_clause 0x2
	global_load_u16 v1, v2, s[6:7]
	global_load_u16 v5, v2, s[2:3]
	global_load_u16 v2, v2, s[4:5]
	v_mov_b32_dpp v3, v3 row_share:0 row_mask:0xf bank_mask:0xf bound_ctrl:1
	s_waitcnt vmcnt(1)
	v_mul_lo_u32 v4, v1, v5
	s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
	v_bfe_i32 v4, v4, 0, 24
	s_waitcnt vmcnt(0)
	v_mul_lo_u32 v2, v4, v2
	v_permlanex16_b32 v4, v3, 0, 0 op_sel:[0,1]
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
	v_add_nc_u32_e32 v4, v4, v3
	v_add_nc_u32_e32 v2, 31, v2
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_and_b32_e32 v2, 0xffffffe0, v2
	v_cmp_ne_u32_e32 vcc_lo, 32, v2
	s_cbranch_vccz .LBB0_6
; %bb.1:
	.loc	2 0 0 is_stmt 0                 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
	v_bfe_u32 v2, v0, 10, 10
	v_bfe_u32 v3, v0, 20, 10
	.loc	2 17 0                          ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	s_mov_b32 s2, exec_lo
	s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
	v_mad_u32_u24 v6, v3, v1, v2
	v_and_b32_e32 v1, 0x3ff, v0
	v_mad_u64_u32 v[2:3], null, v6, v5, v[1:2]
	v_mbcnt_lo_u32_b32 v3, -1, 0
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
	v_mbcnt_hi_u32_b32 v3, -1, v3
	v_lshrrev_b32_e32 v5, 5, v2
	s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
	v_or_b32_e32 v5, v5, v3
	v_cmpx_eq_u32_e32 0, v5
	s_cbranch_execz .LBB0_3
; %bb.2:
	.loc	2 0 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
	v_mov_b32_e32 v5, 0
	.loc	2 17 0                          ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	ds_store_b32 v5, v4
.LBB0_3:
	.loc	2 0 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
	s_or_b32 exec_lo, exec_lo, s2
	.loc	2 17 0                          ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	v_cmp_eq_u32_e32 vcc_lo, 0, v3
	v_cmp_lt_u32_e64 s2, 31, v2
	s_mov_b32 s3, 0
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	s_barrier
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	buffer_gl0_inv
	s_and_b32 s4, vcc_lo, s2
	s_delay_alu instid0(SALU_CYCLE_1)
	s_and_saveexec_b32 s2, s4
	s_cbranch_execz .LBB0_5
; %bb.4:
	.loc	2 0 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
	v_mov_b32_e32 v2, 0
	.loc	2 17 0                          ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	ds_add_u32 v2, v4
.LBB0_5:
	.loc	2 0 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
	s_or_b32 exec_lo, exec_lo, s2
	v_mov_b32_e32 v2, 0
	.loc	2 17 0                          ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	s_barrier
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	buffer_gl0_inv
	ds_load_b32 v3, v2
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	s_barrier
	s_waitcnt vmcnt(0) lgkmcnt(0)
	s_waitcnt_vscnt null, 0x0
	buffer_gl0_inv
	s_and_b32 vcc_lo, exec_lo, s3
	s_cbranch_vccz .LBB0_8
	s_branch .LBB0_7
.LBB0_6:
                                        ; implicit-def: $vgpr3
                                        ; implicit-def: $vgpr1
	.loc	2 0 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
	s_cbranch_execz .LBB0_8
.LBB0_7:                                ; %conversion.__ockl_wgred_add_i32.exit_crit_edge
	v_and_b32_e32 v1, 0x3ff, v0
	v_mov_b32_e32 v3, v4
.LBB0_8:                                ; %__ockl_wgred_add_i32.exit
	.loc	2 17 0                          ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
	s_load_b64 s[0:1], s[0:1], 0x60
.Ltmp0:
	.file	3 "." "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl"
	.loc	3 38 0 is_stmt 1                ; /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38:0
	v_mov_b32_e32 v2, 0
	s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
	v_lshlrev_b64 v[0:1], 2, v[1:2]
	s_waitcnt lgkmcnt(0)
	v_add_co_u32 v0, vcc_lo, s0, v0
	s_delay_alu instid0(VALU_DEP_2)
	v_add_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo
	global_store_b32 v[0:1], v3, off
	s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
.Ltmp1:
	.loc	1 6 0                           ; /home/pxlth/.julia/dev/AMDGPU/src/t.jl:6:0
	s_endpgm
.Ltmp2:
	.section	.rodata,#alloc
	.p2align	6, 0x0
	.amdhsa_kernel _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
		.amdhsa_group_segment_fixed_size 256
		.amdhsa_private_segment_fixed_size 0
		.amdhsa_kernarg_size 168
		.amdhsa_user_sgpr_count 13
		.amdhsa_user_sgpr_dispatch_ptr 0
		.amdhsa_user_sgpr_queue_ptr 0
		.amdhsa_user_sgpr_kernarg_segment_ptr 1
		.amdhsa_user_sgpr_dispatch_id 0
		.amdhsa_user_sgpr_private_segment_size 0
		.amdhsa_wavefront_size32 1
		.amdhsa_enable_private_segment 0
		.amdhsa_system_sgpr_workgroup_id_x 1
		.amdhsa_system_sgpr_workgroup_id_y 1
		.amdhsa_system_sgpr_workgroup_id_z 1
		.amdhsa_system_sgpr_workgroup_info 0
		.amdhsa_system_vgpr_workitem_id 2
		.amdhsa_next_free_vgpr 7
		.amdhsa_next_free_sgpr 16
		.amdhsa_float_round_mode_32 0
		.amdhsa_float_round_mode_16_64 0
		.amdhsa_float_denorm_mode_32 3
		.amdhsa_float_denorm_mode_16_64 3
		.amdhsa_dx10_clamp 1
		.amdhsa_ieee_mode 1
		.amdhsa_fp16_overflow 0
		.amdhsa_workgroup_processor_mode 1
		.amdhsa_memory_ordered 1
		.amdhsa_forward_progress 0
		.amdhsa_shared_vgpr_count 0
		.amdhsa_exception_fp_ieee_invalid_op 0
		.amdhsa_exception_fp_denorm_src 0
		.amdhsa_exception_fp_ieee_div_zero 0
		.amdhsa_exception_fp_ieee_overflow 0
		.amdhsa_exception_fp_ieee_underflow 0
		.amdhsa_exception_fp_ieee_inexact 0
		.amdhsa_exception_int_div_zero 0
	.end_amdhsa_kernel
	.text
.Lfunc_end0:
	.size	_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE, .Lfunc_end0-_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
	.cfi_endproc
	.file	4 "." "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl"
	.file	5 "." "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl"
	.file	6 "." "none"
                                        ; -- End function
	.section	.AMDGPU.csdata
; Kernel info:
; codeLenInByte = 604
; NumSgprs: 18
; NumVgprs: 7
; ScratchSize: 0
; MemoryBound: 0
; FloatMode: 240
; IeeeMode: 1
; LDSByteSize: 256 bytes/workgroup (compile time only)
; SGPRBlocks: 2
; VGPRBlocks: 0
; NumSGPRsForWavesPerEU: 18
; NumVGPRsForWavesPerEU: 7
; Occupancy: 16
; WaveLimiterHint : 0
; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0
; COMPUTE_PGM_RSRC2:USER_SGPR: 13
; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0
; COMPUTE_PGM_RSRC2:TGID_X_EN: 1
; COMPUTE_PGM_RSRC2:TGID_Y_EN: 1
; COMPUTE_PGM_RSRC2:TGID_Z_EN: 1
; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 2
	.text
	.p2alignl 7, 3214868480
	.fill 96, 4, 3214868480
	.section	.debug_abbrev
	.byte	1                               ; Abbreviation Code
	.byte	17                              ; DW_TAG_compile_unit
	.byte	1                               ; DW_CHILDREN_yes
	.byte	37                              ; DW_AT_producer
	.byte	14                              ; DW_FORM_strp
	.byte	19                              ; DW_AT_language
	.byte	5                               ; DW_FORM_data2
	.byte	3                               ; DW_AT_name
	.byte	14                              ; DW_FORM_strp
	.byte	16                              ; DW_AT_stmt_list
	.byte	23                              ; DW_FORM_sec_offset
	.byte	27                              ; DW_AT_comp_dir
	.byte	14                              ; DW_FORM_strp
	.byte	17                              ; DW_AT_low_pc
	.byte	1                               ; DW_FORM_addr
	.byte	18                              ; DW_AT_high_pc
	.byte	6                               ; DW_FORM_data4
	.byte	0                               ; EOM(1)
	.byte	0                               ; EOM(2)
	.byte	2                               ; Abbreviation Code
	.byte	46                              ; DW_TAG_subprogram
	.byte	0                               ; DW_CHILDREN_no
	.byte	3                               ; DW_AT_name
	.byte	14                              ; DW_FORM_strp
	.byte	32                              ; DW_AT_inline
	.byte	11                              ; DW_FORM_data1
	.byte	0                               ; EOM(1)
	.byte	0                               ; EOM(2)
	.byte	3                               ; Abbreviation Code
	.byte	46                              ; DW_TAG_subprogram
	.byte	1                               ; DW_CHILDREN_yes
	.byte	17                              ; DW_AT_low_pc
	.byte	1                               ; DW_FORM_addr
	.byte	18                              ; DW_AT_high_pc
	.byte	6                               ; DW_FORM_data4
	.byte	3                               ; DW_AT_name
	.byte	14                              ; DW_FORM_strp
	.byte	0                               ; EOM(1)
	.byte	0                               ; EOM(2)
	.byte	4                               ; Abbreviation Code
	.byte	29                              ; DW_TAG_inlined_subroutine
	.byte	0                               ; DW_CHILDREN_no
	.byte	49                              ; DW_AT_abstract_origin
	.byte	19                              ; DW_FORM_ref4
	.byte	17                              ; DW_AT_low_pc
	.byte	1                               ; DW_FORM_addr
	.byte	18                              ; DW_AT_high_pc
	.byte	6                               ; DW_FORM_data4
	.byte	88                              ; DW_AT_call_file
	.byte	11                              ; DW_FORM_data1
	.byte	89                              ; DW_AT_call_line
	.byte	11                              ; DW_FORM_data1
	.byte	0                               ; EOM(1)
	.byte	0                               ; EOM(2)
	.byte	5                               ; Abbreviation Code
	.byte	29                              ; DW_TAG_inlined_subroutine
	.byte	1                               ; DW_CHILDREN_yes
	.byte	49                              ; DW_AT_abstract_origin
	.byte	19                              ; DW_FORM_ref4
	.byte	17                              ; DW_AT_low_pc
	.byte	1                               ; DW_FORM_addr
	.byte	18                              ; DW_AT_high_pc
	.byte	6                               ; DW_FORM_data4
	.byte	88                              ; DW_AT_call_file
	.byte	11                              ; DW_FORM_data1
	.byte	89                              ; DW_AT_call_line
	.byte	11                              ; DW_FORM_data1
	.byte	0                               ; EOM(1)
	.byte	0                               ; EOM(2)
	.byte	0                               ; EOM(3)
	.section	.debug_info
.Lcu_begin0:
	.long	.Ldebug_info_end0-.Ldebug_info_start0 ; Length of Unit
.Ldebug_info_start0:
	.short	4                               ; DWARF version number
	.long	.debug_abbrev                   ; Offset Into Abbrev. Section
	.byte	8                               ; Address Size (in bytes)
	.byte	1                               ; Abbrev [1] 0xb:0xcc DW_TAG_compile_unit
	.long	.Linfo_string0                  ; DW_AT_producer
	.short	31                              ; DW_AT_language
	.long	.Linfo_string0                  ; DW_AT_name
	.long	.Lline_table_start0             ; DW_AT_stmt_list
	.long	.Linfo_string1                  ; DW_AT_comp_dir
	.quad	.Lfunc_begin0                   ; DW_AT_low_pc
	.long	.Lfunc_end0-.Lfunc_begin0       ; DW_AT_high_pc
	.byte	2                               ; Abbrev [2] 0x2a:0x6 DW_TAG_subprogram
	.long	.Linfo_string2                  ; DW_AT_name
	.byte	1                               ; DW_AT_inline
	.byte	2                               ; Abbrev [2] 0x30:0x6 DW_TAG_subprogram
	.long	.Linfo_string3                  ; DW_AT_name
	.byte	1                               ; DW_AT_inline
	.byte	2                               ; Abbrev [2] 0x36:0x6 DW_TAG_subprogram
	.long	.Linfo_string3                  ; DW_AT_name
	.byte	1                               ; DW_AT_inline
	.byte	2                               ; Abbrev [2] 0x3c:0x6 DW_TAG_subprogram
	.long	.Linfo_string4                  ; DW_AT_name
	.byte	1                               ; DW_AT_inline
	.byte	2                               ; Abbrev [2] 0x42:0x6 DW_TAG_subprogram
	.long	.Linfo_string5                  ; DW_AT_name
	.byte	1                               ; DW_AT_inline
	.byte	2                               ; Abbrev [2] 0x48:0x6 DW_TAG_subprogram
	.long	.Linfo_string6                  ; DW_AT_name
	.byte	1                               ; DW_AT_inline
	.byte	3                               ; Abbrev [3] 0x4e:0x88 DW_TAG_subprogram
	.quad	.Lfunc_begin0                   ; DW_AT_low_pc
	.long	.Lfunc_end0-.Lfunc_begin0       ; DW_AT_high_pc
	.long	.Linfo_string7                  ; DW_AT_name
	.byte	4                               ; Abbrev [4] 0x5f:0x13 DW_TAG_inlined_subroutine
	.long	42                              ; DW_AT_abstract_origin
	.quad	.Lfunc_begin0                   ; DW_AT_low_pc
	.long	.Ltmp0-.Lfunc_begin0            ; DW_AT_high_pc
	.byte	1                               ; DW_AT_call_file
	.byte	5                               ; DW_AT_call_line
	.byte	5                               ; Abbrev [5] 0x72:0x63 DW_TAG_inlined_subroutine
	.long	72                              ; DW_AT_abstract_origin
	.quad	.Ltmp0                          ; DW_AT_low_pc
	.long	.Ltmp1-.Ltmp0                   ; DW_AT_high_pc
	.byte	1                               ; DW_AT_call_file
	.byte	5                               ; DW_AT_call_line
	.byte	5                               ; Abbrev [5] 0x85:0x4f DW_TAG_inlined_subroutine
	.long	66                              ; DW_AT_abstract_origin
	.quad	.Ltmp0                          ; DW_AT_low_pc
	.long	.Ltmp1-.Ltmp0                   ; DW_AT_high_pc
	.byte	4                               ; DW_AT_call_file
	.byte	90                              ; DW_AT_call_line
	.byte	5                               ; Abbrev [5] 0x98:0x3b DW_TAG_inlined_subroutine
	.long	60                              ; DW_AT_abstract_origin
	.quad	.Ltmp0                          ; DW_AT_low_pc
	.long	.Ltmp1-.Ltmp0                   ; DW_AT_high_pc
	.byte	5                               ; DW_AT_call_file
	.byte	88                              ; DW_AT_call_line
	.byte	5                               ; Abbrev [5] 0xab:0x27 DW_TAG_inlined_subroutine
	.long	54                              ; DW_AT_abstract_origin
	.quad	.Ltmp0                          ; DW_AT_low_pc
	.long	.Ltmp1-.Ltmp0                   ; DW_AT_high_pc
	.byte	6                               ; DW_AT_call_file
	.byte	0                               ; DW_AT_call_line
	.byte	4                               ; Abbrev [4] 0xbe:0x13 DW_TAG_inlined_subroutine
	.long	48                              ; DW_AT_abstract_origin
	.quad	.Ltmp0                          ; DW_AT_low_pc
	.long	.Ltmp1-.Ltmp0                   ; DW_AT_high_pc
	.byte	6                               ; DW_AT_call_file
	.byte	0                               ; DW_AT_call_line
	.byte	0                               ; End Of Children Mark
	.byte	0                               ; End Of Children Mark
	.byte	0                               ; End Of Children Mark
	.byte	0                               ; End Of Children Mark
	.byte	0                               ; End Of Children Mark
	.byte	0                               ; End Of Children Mark
.Ldebug_info_end0:
	.section	.debug_str,"MS",@progbits,1
.Linfo_string0:
	.asciz	"julia"                         ; string offset=0
.Linfo_string1:
	.asciz	"."                             ; string offset=6
.Linfo_string2:
	.asciz	"sync_workgroup_count;"         ; string offset=8
.Linfo_string3:
	.asciz	"macro expansion;"              ; string offset=30
.Linfo_string4:
	.asciz	"pointerset;"                   ; string offset=47
.Linfo_string5:
	.asciz	"unsafe_store!;"                ; string offset=59
.Linfo_string6:
	.asciz	"#setindex!;"                   ; string offset=74
.Linfo_string7:
	.asciz	"ker!"                          ; string offset=86
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.ident	"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
	.section	".note.GNU-stack"
	.amdgpu_metadata
---
amdhsa.kernels:
  - .args:
      - .name:           state
        .offset:         0
        .size:           88
        .value_kind:     by_value
      - .offset:         88
        .size:           24
        .value_kind:     by_value
      - .offset:         112
        .size:           8
        .value_kind:     hidden_global_offset_x
      - .offset:         120
        .size:           8
        .value_kind:     hidden_global_offset_y
      - .offset:         128
        .size:           8
        .value_kind:     hidden_global_offset_z
      - .offset:         136
        .size:           8
        .value_kind:     hidden_none
      - .offset:         144
        .size:           8
        .value_kind:     hidden_none
      - .offset:         152
        .size:           8
        .value_kind:     hidden_none
      - .offset:         160
        .size:           8
        .value_kind:     hidden_none
    .group_segment_fixed_size: 256
    .kernarg_segment_align: 8
    .kernarg_segment_size: 168
    .language:       OpenCL C
    .language_version:
      - 2
      - 0
    .max_flat_workgroup_size: 1024
    .name:           _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
    .private_segment_fixed_size: 0
    .sgpr_count:     18
    .sgpr_spill_count: 0
    .symbol:         _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE.kd
    .vgpr_count:     7
    .vgpr_spill_count: 0
    .wavefront_size: 32
amdhsa.target:   amdgcn-amd-amdhsa--gfx1100
amdhsa.version:
  - 1
  - 1
...

	.end_amdgpu_metadata
	.section	.debug_line
.Lline_table_start0:

A bit unrelated to this, but for the equivalent C++/HIP code I don't observe this information (wavesize, target-cpu) in LLVM IR as well.

#include <hip/hip_runtime.h>
#include <iostream>

__global__ void ker(int* x) {
    int i = hipThreadIdx_x;
    x[i] = __syncthreads_count(1);
}

int main() {
    int N = 33;
    size_t size = N * sizeof(int);

    int* x = new int[N];
    for (int i = 0; i < N; ++i)
        x[i] = 0;

    int* dx;
    hipMalloc(&dx, size);
    hipMemcpy(dx, x, size, hipMemcpyHostToDevice);

    int threadsPerBlock = 33;
    int blocksPerGrid = 1;
    ker<<<blocksPerGrid, threadsPerBlock>>>(dx);
    hipMemcpy(x, dx, size, hipMemcpyDeviceToHost);

    for (int i = 0; i < N; ++i)
        std::cout << " " << x[i];
    std::cout << std::endl;

    hipFree(dx);
    delete[] x;

    return 0;
}

Dumping with hipcc -emit-llvm -S --offload-device-only main.cpp.

LLVM IR
; ModuleID = 'main.cpp'
source_filename = "main.cpp"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8"
target triple = "amdgcn-amd-amdhsa"

@__scratch_lds = internal addrspace(3) global [32 x i64] poison, align 8

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z3kerPi(ptr addrspace(1) nocapture writeonly %0) local_unnamed_addr #0 {
  %2 = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !7, !noundef !8
  %3 = tail call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
  %4 = getelementptr inbounds i16, ptr addrspace(4) %3, i64 8
  %5 = load i16, ptr addrspace(4) %4, align 4, !tbaa !9
  %6 = zext i16 %5 to i32
  %7 = getelementptr inbounds i16, ptr addrspace(4) %3, i64 7
  %8 = load i16, ptr addrspace(4) %7, align 2, !tbaa !9
  %9 = zext i16 %8 to i32
  %10 = getelementptr inbounds i16, ptr addrspace(4) %3, i64 6
  %11 = load i16, ptr addrspace(4) %10, align 4, !tbaa !9
  %12 = zext i16 %11 to i32
  %13 = shl nuw nsw i32 %9, 8
  %14 = mul i32 %13, %12
  %15 = ashr exact i32 %14, 8
  %16 = mul nsw i32 %15, %6
  %17 = add i32 %16, 31
  %18 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 1, i32 257, i32 15, i32 15, i1 true)
  %19 = add nsw i32 %18, 1
  %20 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %19, i32 258, i32 15, i32 15, i1 true)
  %21 = add nsw i32 %19, %20
  %22 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %21, i32 260, i32 15, i32 15, i1 true)
  %23 = add nsw i32 %21, %22
  %24 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %23, i32 264, i32 15, i32 15, i1 true)
  %25 = add nsw i32 %23, %24
  %26 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %25, i32 336, i32 15, i32 15, i1 true)
  %27 = tail call i32 @llvm.amdgcn.permlanex16(i32 undef, i32 %26, i32 0, i32 0, i1 false, i1 true)
  %28 = add nsw i32 %27, %26
  %29 = and i32 %17, -32
  %30 = icmp eq i32 %29, 32
  br i1 %30, label %51, label %31

31:                                               ; preds = %1
  %32 = tail call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %33 = tail call i32 @llvm.amdgcn.workitem.id.z(), !range !7, !noundef !8
  %34 = mul nuw nsw i32 %33, %9
  %35 = tail call i32 @llvm.amdgcn.workitem.id.y(), !range !7, !noundef !8
  %36 = add nuw nsw i32 %34, %35
  %37 = mul i32 %36, %12
  %38 = add i32 %37, %2
  %39 = lshr i32 %38, 5
  %40 = icmp eq i32 %32, 0
  %41 = or i32 %32, %39
  %42 = icmp eq i32 %41, 0
  br i1 %42, label %43, label %44

43:                                               ; preds = %31
  store atomic i32 %28, ptr addrspace(3) @__scratch_lds syncscope("workgroup-one-as") monotonic, align 8
  br label %44

44:                                               ; preds = %43, %31
  fence syncscope("workgroup") release
  tail call void @llvm.amdgcn.s.barrier()
  fence syncscope("workgroup") acquire
  %45 = icmp ugt i32 %38, 31
  %46 = and i1 %45, %40
  br i1 %46, label %47, label %49

47:                                               ; preds = %44
  %48 = atomicrmw add ptr addrspace(3) @__scratch_lds, i32 %28 syncscope("workgroup-one-as") monotonic, align 4
  br label %49

49:                                               ; preds = %47, %44
  fence syncscope("workgroup") release
  tail call void @llvm.amdgcn.s.barrier()
  fence syncscope("workgroup") acquire
  %50 = load atomic i32, ptr addrspace(3) @__scratch_lds syncscope("workgroup-one-as") monotonic, align 8
  fence syncscope("workgroup") release
  tail call void @llvm.amdgcn.s.barrier()
  fence syncscope("workgroup") acquire
  br label %51

51:                                               ; preds = %1, %49
  %52 = phi i32 [ %50, %49 ], [ %28, %1 ]
  %53 = zext i32 %2 to i64
  %54 = getelementptr inbounds i32, ptr addrspace(1) %0, i64 %53
  store i32 %52, ptr addrspace(1) %54, align 4, !tbaa !13
  ret void
}

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.x() #1

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.y() #1

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.z() #1

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare align 4 ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() #1

; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #2

; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3

; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #4

; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.permlanex16(i32, i32, i32, i32, i1 immarg, i1 immarg) #4

attributes #0 = { convergent mustprogress nofree norecurse nounwind willreturn "amdgpu-flat-work-group-size"="1,1024" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="true" }
attributes #1 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #2 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) }
attributes #3 = { convergent mustprogress nocallback nofree nounwind willreturn }
attributes #4 = { convergent mustprogress nocallback nofree nounwind willreturn memory(none) }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!opencl.ocl.version = !{!5}
!llvm.ident = !{!6}

!0 = !{i32 4, !"amdgpu_hostcall", i32 1}
!1 = !{i32 1, !"amdgpu_code_object_version", i32 500}
!2 = !{i32 1, !"amdgpu_printf_kind", !"hostcall"}
!3 = !{i32 1, !"wchar_size", i32 4}
!4 = !{i32 8, !"PIC Level", i32 2}
!5 = !{i32 2, i32 0}
!6 = !{!"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"}
!7 = !{i32 0, i32 1024}
!8 = !{}
!9 = !{!10, !10, i64 0}
!10 = !{!"short", !11, i64 0}
!11 = !{!"omnipotent char", !12, i64 0}
!12 = !{!"Simple C/C++ TBAA"}
!13 = !{!14, !14, i64 0}
!14 = !{!"int", !15, i64 0}
!15 = !{!"omnipotent char", !16, i64 0}
!16 = !{!"Simple C++ TBAA"}

Also if I try compiling for wave64 it errors:

$ hipcc -mwavefrontsize64 main.cpp 
In file included from main.cpp:1:
/opt/rocm-6.1.2/include/hip/hip_runtime.h:41:2: error: HIP is not supported on the specified GPU ARCH with wavefront size 64
   41 | #error HIP is not supported on the specified GPU ARCH with wavefront size 64
      |  ^
main.cpp:18:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   18 |     hipMalloc(&dx, size);
      |     ^~~~~~~~~ ~~~~~~~~~
main.cpp:19:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   19 |     hipMemcpy(dx, x, size, hipMemcpyHostToDevice);
      |     ^~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
main.cpp:24:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   24 |     hipMemcpy(x, dx, size, hipMemcpyDeviceToHost);
      |     ^~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
main.cpp:30:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   30 |     hipFree(dx);
      |     ^~~~~~~ ~~
4 warnings and 1 error generated when compiling for gfx1100.

@pxl-th
Copy link

pxl-th commented Jul 29, 2024

You do have to consistently use the same wavesize in a callgraph, they cannot be intermixed.

That's within a single kernel, right?

@arsenm
Copy link
Contributor

arsenm commented Jul 29, 2024

A bit unrelated to this, but for the equivalent C++/HIP code I don't observe this information (wavesize, target-cpu) in LLVM IR as well.

It's right here:

"target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32"

Also if I try compiling for wave64 it errors:

Yes, hip currently doesn't officially support wave64

That's within a single kernel, right?

Typically the same thing, the kernel is the entry point to the callgraph. Kernels with different wave sizes can't call the same functions.

@pxl-th
Copy link

pxl-th commented Jul 29, 2024

Thanks for the replies! Appreciate it!

It's right here:

🤦‍♂️ didn't notice.

I do now set IR attributes. And above example with sync_workgroup_count now does work correctly.

However, some other started failing.
Here's example with bpermute which is llvm.amdgcn.ds.bpermute intrinsic
and the kernel shifts values in the array left by 1 item.

julia> using AMDGPU

julia> AMDGPU.device() # wave64 is set
┌────┬────────────────────┬──────────┬───────────┬────────────┐
│ Id │               Name │ GCN arch │ Wavefront │     Memory │
├────┼────────────────────┼──────────┼───────────┼────────────┤
│  1 │ Radeon RX 7900 XTX │  gfx1100 │        6423.984 GiB │
└────┴────────────────────┴──────────┴───────────┴────────────┘

julia> function ker_shfl!(x)
           i::Cint = AMDGPU.Device.activelane()
           ws = AMDGPU.Device.wavefrontsize()
           # compute next lane address.
           addr = ((i + 1) % ws) * 4 # VGPRs are 4-byte wide.
           # julia array indexing starts from 1.
           @inbounds x[i + 1] = AMDGPU.Device.bpermute(addr, i)
           return
       end
ker_shfl! (generic function with 1 method)

julia> x = ROCArray(zeros(Cint, 1, 64));

julia> @roc groupsize=length(x) ker_shfl!(x);

julia> x
1×64 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 0, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 32

Here 32nd item should be 32 and 64th item 0, but it still acts as if it is using wave32.

ws = AMDGPU.Device.wavefrontsize() returns 64 and we can verify that:

julia> function ws_ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           @inbounds x[i + 1] = AMDGPU.Device.wavefrontsize()
           return
       end
ws_ker! (generic function with 1 method)

julia> @roc groupsize=length(x) ws_ker!(x);

julia> x
1×64 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64    64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64  64
LLVM IR for the `ker_shfl!` kernel
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.wavefrontsize() #0

; Function Attrs: convergent nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.ds.bpermute(i32, i32) #1

; Function Attrs: alwaysinline nocallback nofree nosync nounwind willreturn memory(read)
declare i32 @llvm.read_register.i32(metadata) #2

; Function Attrs: alwaysinline nocallback nofree nosync nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #3

; Function Attrs: alwaysinline nocallback nofree nosync nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #3

; Function Attrs: cold nocallback nofree noreturn nounwind
declare void @llvm.amdgcn.endpgm() #4

;  @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:3 within `ker_shfl!`
define amdgpu_kernel void @_Z9ker_shfl_14ROCDeviceArrayI5Int32Li1ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [1 x i64], ptr addrspace(1), i64 } %0) local_unnamed_addr #5 !dbg !41 {
conversion:
;  @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:4 within `ker_shfl!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:108 within `activelane`
   %1 = call i32 @llvm.read_register.i32(metadata !45) #6, !dbg !46
   %2 = call i32 @llvm.read_register.i32(metadata !50) #6, !dbg !46
   %3 = call i32 @llvm.amdgcn.mbcnt.lo(i32 %2, i32 0), !dbg !46
   %4 = call i32 @llvm.amdgcn.mbcnt.hi(i32 %1, i32 %3), !dbg !46
; └
; ┌ @ number.jl:7 within `convert`
; │┌ @ boot.jl:893 within `Int32`
; ││┌ @ boot.jl:807 within `toInt32`
; │││┌ @ boot.jl:758 within `check_sign_bit`
; ││││┌ @ boot.jl:743 within `is_top_bit_set`
       %.not = icmp sgt i32 %4, -1, !dbg !51
; ││││└
      br i1 %.not, label %L9, label %L6, !dbg !54

L6:                                               ; preds = %conversion
; ││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/quirks.jl:8 within `#throw_inexacterror`
; │││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:113 within `signal_exception`
; ││││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:11 within `exception_flag`
; │││││││┌ @ none within `kernel_state`
; ││││││││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
           %state.i.fca.0.extract.i = extractvalue { i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, 0, !dbg !63
; ││││││└└└
; ││││││┌ @ pointer.jl:180 within `unsafe_store!` @ pointer.jl:180
         %5 = inttoptr i64 %state.i.fca.0.extract.i to ptr, !dbg !80
         store i32 1, ptr %5, align 1, !dbg !80
; ││││││└
; ││││││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:115 within `signal_exception`
; ││││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52 within `endpgm`
         call void @llvm.amdgcn.endpgm(), !dbg !84
; ││││││└
; ││││││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:116 within `signal_exception`
        unreachable, !dbg !88

L9:                                               ; preds = %conversion
; └└└└└└
;  @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:5 within `ker_shfl!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:84 within `wavefrontsize`
   %6 = call i32 @llvm.amdgcn.wavefrontsize(), !dbg !89
; └
;  @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:6 within `ker_shfl!`
; ┌ @ int.jl:232 within `rem` @ int.jl:298
   %.not19 = icmp eq i32 %6, 0, !dbg !92
   br i1 %.not19, label %fail, label %pass, !dbg !92

fail:                                             ; preds = %L9
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:113 within `signal_exception`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:11 within `exception_flag`
; │││┌ @ none within `kernel_state`
; ││││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
       %state.i.fca.0.extract.i6 = extractvalue { i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, 0, !dbg !97
; ││└└└
; ││┌ @ pointer.jl:180 within `unsafe_store!` @ pointer.jl:180
     %7 = inttoptr i64 %state.i.fca.0.extract.i6 to ptr, !dbg !103
     store i32 1, ptr %7, align 1, !dbg !103
; ││└
; ││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:115 within `signal_exception`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52 within `endpgm`
     call void @llvm.amdgcn.endpgm(), !dbg !105
; ││└
; ││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:116 within `signal_exception`
    unreachable, !dbg !107

pass:                                             ; preds = %L9
; └└
; ┌ @ int.jl:87 within `+`
   %8 = add nuw i32 %4, 1, !dbg !108
; └
; ┌ @ int.jl:232 within `rem`
; │┌ @ int.jl:188 within `abs`
; ││┌ @ int.jl:142 within `flipsign`
     %9 = icmp slt i32 %8, 0, !dbg !110
     %10 = xor i32 %4, -1, !dbg !110
     %11 = select i1 %9, i32 %10, i32 %8, !dbg !110
     %.fca.1.extract = extractvalue { [1 x i64], ptr addrspace(1), i64 } %0, 1
; │└└
; │ @ int.jl:232 within `rem` @ int.jl:298
   %12 = urem i32 %11, %6, !dbg !92
; │ @ int.jl:232 within `rem`
; │┌ @ int.jl:142 within `flipsign`
    %13 = ashr i32 %8, 31, !dbg !114
    %14 = add i32 %12, %13, !dbg !114
    %15 = xor i32 %14, %13, !dbg !114
; └└
; ┌ @ int.jl:88 within `*`
   %16 = shl i32 %15, 2, !dbg !115
; └
;  @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:7 within `ker_shfl!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:180 within `bpermute`
   %17 = call i32 @llvm.amdgcn.ds.bpermute(i32 %16, i32 %4), !dbg !117
; └
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:90 within `#setindex!`
; │┌ @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:88 within `unsafe_store!`
; ││┌ @ none within `pointerset`
; │││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
      %18 = zext i32 %4 to i64, !dbg !120
      %19 = getelementptr inbounds i32, ptr addrspace(1) %.fca.1.extract, i64 %18, !dbg !120
      store i32 %17, ptr addrspace(1) %19, align 4, !dbg !120, !tbaa !132
; └└└└
;  @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:8 within `ker_shfl!`
  ret void, !dbg !135
}

attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #1 = { convergent nocallback nofree nounwind willreturn memory(none) }
attributes #2 = { alwaysinline nocallback nofree nosync nounwind willreturn memory(read) }
attributes #3 = { alwaysinline nocallback nofree nosync nounwind willreturn memory(none) }
attributes #4 = { cold nocallback nofree noreturn nounwind }
attributes #5 = { "amdgpu-unsafe-fp-atomics"="true" "target-cpu"="gfx1100" "target-features"="-wavefrontsize32,+wavefrontsize64" }
attributes #6 = { convergent }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.dbg.cu = !{!5, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34, !35, !36, !37}
!opencl.ocl.version = !{!38, !38, !38, !38, !38, !38, !38, !38, !38}
!llvm.ident = !{!39, !39, !39, !39, !39, !39, !39, !39, !39}
!julia.kernel = !{!40}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 4, !"amdgpu_hostcall", i32 1}
!3 = !{i32 1, !"wchar_size", i32 4}
!4 = !{i32 8, !"PIC Level", i32 0}
!5 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!6 = !DIFile(filename: "julia", directory: ".")
!7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!36 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!37 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!38 = !{i32 2, i32 0}
!39 = !{!"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"}
!40 = !{ptr @_Z9ker_shfl_14ROCDeviceArrayI5Int32Li1ELi1EE}
!41 = distinct !DISubprogram(name: "ker_shfl!", linkageName: "julia_ker_shfl!_13627", scope: null, file: !42, line: 3, type: !43, scopeLine: 3, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!42 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/t.jl", directory: ".")
!43 = !DISubroutineType(types: !44)
!44 = !{}
!45 = !{!"exec_hi"}
!46 = !DILocation(line: 108, scope: !47, inlinedAt: !49)
!47 = distinct !DISubprogram(name: "activelane;", linkageName: "activelane", scope: !48, file: !48, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!48 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl", directory: ".")
!49 = !DILocation(line: 4, scope: !41)
!50 = !{!"exec_lo"}
!51 = !DILocation(line: 743, scope: !52, inlinedAt: !54)
!52 = distinct !DISubprogram(name: "is_top_bit_set;", linkageName: "is_top_bit_set", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!53 = !DIFile(filename: "boot.jl", directory: ".")
!54 = !DILocation(line: 758, scope: !55, inlinedAt: !56)
!55 = distinct !DISubprogram(name: "check_sign_bit;", linkageName: "check_sign_bit", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!56 = !DILocation(line: 807, scope: !57, inlinedAt: !58)
!57 = distinct !DISubprogram(name: "toInt32;", linkageName: "toInt32", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!58 = !DILocation(line: 893, scope: !59, inlinedAt: !60)
!59 = distinct !DISubprogram(name: "Int32;", linkageName: "Int32", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!60 = !DILocation(line: 7, scope: !61, inlinedAt: !49)
!61 = distinct !DISubprogram(name: "convert;", linkageName: "convert", scope: !62, file: !62, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!62 = !DIFile(filename: "number.jl", directory: ".")
!63 = !DILocation(line: 38, scope: !64, inlinedAt: !66)
!64 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !65, file: !65, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!65 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl", directory: ".")
!66 = distinct !DILocation(line: 0, scope: !67, inlinedAt: !69)
!67 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!68 = !DIFile(filename: "none", directory: ".")
!69 = distinct !DILocation(line: 0, scope: !70, inlinedAt: !71)
!70 = distinct !DISubprogram(name: "kernel_state;", linkageName: "kernel_state", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!71 = distinct !DILocation(line: 11, scope: !72, inlinedAt: !74)
!72 = distinct !DISubprogram(name: "exception_flag;", linkageName: "exception_flag", scope: !73, file: !73, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!73 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl", directory: ".")
!74 = distinct !DILocation(line: 113, scope: !75, inlinedAt: !76)
!75 = distinct !DISubprogram(name: "signal_exception", linkageName: "julia_signal_exception_15167", scope: null, file: !73, line: 112, type: !43, scopeLine: 112, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!76 = distinct !DILocation(line: 8, scope: !77, inlinedAt: !79)
!77 = distinct !DISubprogram(name: "#throw_inexacterror", linkageName: "julia_#throw_inexacterror_13659", scope: null, file: !78, line: 40, type: !43, scopeLine: 40, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !5, retainedNodes: !44)
!78 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/quirks.jl", directory: ".")
!79 = distinct !DILocation(line: 758, scope: !55, inlinedAt: !56)
!80 = !DILocation(line: 180, scope: !81, inlinedAt: !83)
!81 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !82, file: !82, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!82 = !DIFile(filename: "pointer.jl", directory: ".")
!83 = distinct !DILocation(line: 180, scope: !81, inlinedAt: !74)
!84 = !DILocation(line: 52, scope: !85, inlinedAt: !87)
!85 = distinct !DISubprogram(name: "endpgm;", linkageName: "endpgm", scope: !86, file: !86, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!86 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl", directory: ".")
!87 = distinct !DILocation(line: 115, scope: !75, inlinedAt: !76)
!88 = !DILocation(line: 116, scope: !75, inlinedAt: !76)
!89 = !DILocation(line: 84, scope: !90, inlinedAt: !91)
!90 = distinct !DISubprogram(name: "wavefrontsize;", linkageName: "wavefrontsize", scope: !48, file: !48, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!91 = !DILocation(line: 5, scope: !41)
!92 = !DILocation(line: 298, scope: !93, inlinedAt: !95)
!93 = distinct !DISubprogram(name: "rem;", linkageName: "rem", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!94 = !DIFile(filename: "int.jl", directory: ".")
!95 = !DILocation(line: 232, scope: !93, inlinedAt: !96)
!96 = !DILocation(line: 6, scope: !41)
!97 = !DILocation(line: 38, scope: !64, inlinedAt: !98)
!98 = distinct !DILocation(line: 0, scope: !67, inlinedAt: !99)
!99 = distinct !DILocation(line: 0, scope: !70, inlinedAt: !100)
!100 = distinct !DILocation(line: 11, scope: !72, inlinedAt: !101)
!101 = distinct !DILocation(line: 113, scope: !75, inlinedAt: !102)
!102 = distinct !DILocation(line: 298, scope: !93, inlinedAt: !95)
!103 = !DILocation(line: 180, scope: !81, inlinedAt: !104)
!104 = distinct !DILocation(line: 180, scope: !81, inlinedAt: !101)
!105 = !DILocation(line: 52, scope: !85, inlinedAt: !106)
!106 = distinct !DILocation(line: 115, scope: !75, inlinedAt: !102)
!107 = !DILocation(line: 116, scope: !75, inlinedAt: !102)
!108 = !DILocation(line: 87, scope: !109, inlinedAt: !96)
!109 = distinct !DISubprogram(name: "+;", linkageName: "+", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!110 = !DILocation(line: 142, scope: !111, inlinedAt: !112)
!111 = distinct !DISubprogram(name: "flipsign;", linkageName: "flipsign", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!112 = !DILocation(line: 188, scope: !113, inlinedAt: !95)
!113 = distinct !DISubprogram(name: "abs;", linkageName: "abs", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!114 = !DILocation(line: 142, scope: !111, inlinedAt: !95)
!115 = !DILocation(line: 88, scope: !116, inlinedAt: !96)
!116 = distinct !DISubprogram(name: "*;", linkageName: "*", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!117 = !DILocation(line: 180, scope: !118, inlinedAt: !119)
!118 = distinct !DISubprogram(name: "bpermute;", linkageName: "bpermute", scope: !48, file: !48, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!119 = !DILocation(line: 7, scope: !41)
!120 = !DILocation(line: 38, scope: !121, inlinedAt: !122)
!121 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !65, file: !65, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!122 = !DILocation(line: 0, scope: !123, inlinedAt: !124)
!123 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!124 = !DILocation(line: 0, scope: !125, inlinedAt: !126)
!125 = distinct !DISubprogram(name: "pointerset;", linkageName: "pointerset", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!126 = !DILocation(line: 88, scope: !127, inlinedAt: !129)
!127 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !128, file: !128, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!128 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl", directory: ".")
!129 = !DILocation(line: 90, scope: !130, inlinedAt: !119)
!130 = distinct !DISubprogram(name: "#setindex!;", linkageName: "#setindex!", scope: !131, file: !131, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!131 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl", directory: ".")
!132 = !{!133, !133, i64 0, i64 0}
!133 = !{!"custom_tbaa_addrspace(1)", !134, i64 0}
!134 = !{!"custom_tbaa"}
!135 = !DILocation(line: 8, scope: !41)

@arsenm
Copy link
Contributor

arsenm commented Jul 30, 2024

However, some other started failing.
Here's example with bpermute which is llvm.amdgcn.ds.bpermute intrinsic
and the kernel shifts values in the array left by 1 item.

Here 32nd item should be 32 and 64th item 0, but it still acts as if it is using wave32.

Correct. bpermute cannot access the other half of the wave in wave64. The ISA manual states:
"Note that in wave64 mode the permute operates only across 32 lanes at a time of each half of a wave64. In other words, it executes as if were two independent wave32’s. Each half-wave can use indices in the range 0-31 to reference lanes in that same half-wave."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU llvm:SelectionDAG SelectionDAGISel as well question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!
Projects
None yet
Development

No branches or pull requests

5 participants