-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
target amdgcn-amd-amdhsa: cttz in BB after amdgcn.ballot leads to "Cannot select SETCC..." #89332
Comments
@llvm/issue-subscribers-backend-amdgpu Author: Matt (MattPD)
The following LLVM IR:
; reduced.ll
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"
define ptr addrspace(1) @<!-- -->__ockl_dm_alloc(i1 %0) {
__ockl_wfany_i32.exit:
%1 = tail call i32 @<!-- -->llvm.amdgcn.ballot.i32(i1 %0)
br label %2
2: ; preds = %__ockl_wfany_i32.exit
%3 = tail call i32 @<!-- -->llvm.cttz.i32(i32 %1, i1 false)
ret ptr addrspace(1) null
}
; Function Attrs: convergent nocallback nofree nounwind willreturn memory(none)
declare i32 @<!-- -->llvm.amdgcn.ballot.i32(i1) #<!-- -->0
; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @<!-- -->llvm.cttz.i32(i32, i1 immarg) #<!-- -->1
attributes #<!-- -->0 = { convergent nocallback nofree nounwind willreturn memory(none) }
attributes #<!-- -->1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) } Compiled with: llc -O0 reduced.ll yields ICE:
Worth noting that the ICE occurs at -O0, https://llc.godbolt.org/z/xhG4qrYWv as well as at -O1, https://llc.godbolt.org/z/WeMvT17bb. However, it does not occur at -O2, https://llc.godbolt.org/z/WeMvT17bb We can see that the basic block (BB) This is reduced LLVM IR (using llvm-reduce). For context, the original LLVM IR has been produced from the following (adding
The original |
This is trying to codegen wave32 IR on a wave64 target. While we currently handle the wave64 ballots on wave32, I don't think we try to handle the inverse case. This should be trying to use ballot.i64 |
FWIW, the original code comes from However, the actual used code is LLVM bitcode (which OCKL ships as). To illustrate the issue in a less reduced context: @__oclc_wavefrontsize64 = linkonce_odr protected local_unnamed_addr addrspace(4) constant i8 1, align 1
define ptr addrspace(1) @__ockl_dm_alloc(i1 %0) {
%wavefrontsize64 = load i8, ptr addrspace(4) @__oclc_wavefrontsize64, align 1
%wavefrontsize64_zero = icmp eq i8 %wavefrontsize64, 0
br i1 %wavefrontsize64_zero, label %ballot32, label %ballot64
ballot64:
%ballot64result = tail call i64 @llvm.amdgcn.ballot.i64(i1 %0)
%193 = icmp eq i64 %ballot64result, 0
br label %cttz64
ballot32:
%ballot32result = tail call i32 @llvm.amdgcn.ballot.i32(i1 %0)
%202 = icmp eq i32 %ballot32result, 0
br label %cttz32
cttz64:
%cttz64result = tail call i64 @llvm.cttz.i64(i64 %ballot64result, i1 false)
br label %bbexit
cttz32:
%cttz32result = tail call i32 @llvm.cttz.i32(i32 %ballot32result, i1 false)
br label %bbexit
bbexit:
ret ptr addrspace(1) null
} Thus even if |
Yes, the approach OCKL uses for the wave size is fragile this way. Really we should be treating the wavesize as a wholly incompatible ABI. These two implementations should not coexist in the same function, so this is primarily an OCKL bug |
In this particular case, I think ockl can just pretend wave32 doesn't exist and only use the wave64 builtins |
Thanks! OK, I'm going to see if I can pass this on to the OCKL folks from my end. |
I've posted the workaround to ockl. Something like #86957 is needed to improve the diagnostic |
Wave32 and wave64 paths cannot really co-exist in the same function or callgraph. They need to be treated as a hard ABI incompatibility. We cannot handle the wave32 operation on wave64, but we can and do handle the wave64 operation on wave32. Given the current linking scheme, the most expedient fix for this not working is to pretend wave32 does not exist and just use the wave64 ballot. The optimizer will fold the 64-bit ballot intrinsic to the 32-bit one when it sees a 32-bit target. This was reported broken in llvm#89332 Change-Id: If897f631066e257e18d8adf574cd17d3f9278ca7
Wave32 and wave64 paths cannot really co-exist in the same function or callgraph. They need to be treated as a hard ABI incompatibility. We cannot handle the wave32 operation on wave64, but we can and do handle the wave64 operation on wave32. Given the current linking scheme, the most expedient fix for this not working is to pretend wave32 does not exist and just use the wave64 ballot. The optimizer will fold the 64-bit ballot intrinsic to the 32-bit one when it sees a 32-bit target. This was reported broken in llvm#89332 Change-Id: If897f631066e257e18d8adf574cd17d3f9278ca7
Wave32 and wave64 paths cannot really co-exist in the same function or callgraph. They need to be treated as a hard ABI incompatibility. We cannot handle the wave32 operation on wave64, but we can and do handle the wave64 operation on wave32. Given the current linking scheme, the most expedient fix for this not working is to pretend wave32 does not exist and just use the wave64 ballot. The optimizer will fold the 64-bit ballot intrinsic to the 32-bit one when it sees a 32-bit target. This was reported broken in llvm#89332 Change-Id: If897f631066e257e18d8adf574cd17d3f9278ca7
@arsenm, what optimization pass should fold 64 bit to 32 bit? I'm having a similar issue with ROCm 6.1.2 which includes ROCm@96b2ba3. Example kernel: function ker!(X, Y)
idx = workitemIdx().x
Y[1] = AMDGPU.Device.wfany(X[idx])
return
end Where With opaque pointers enabled, the above kernel results in ICE:
Looking at the unoptimized & optimized IR we can see that 64 bit ballot is not folded:
However, if I disable opaque pointers (LLVM 16 still supports typed pointers), then it is folded:
Alternatively, if I fool the compiler and set |
InstCombine performs the wave64->wave32 fold, but that can only hide the issue. It's not a mandatory transformation. 07c5920 is the patch that permits selecting the wave64 version on wave32
It supports wave32 and wave64. It's not wrong to use wave64 on gfx10/11/12, they support both. 32 is just the native and default wavesize. You do have to consistently use the same wavesize in a callgraph, they cannot be intermixed. |
I see it is for LLVM 17+. We'd have to wait for Julia 1.12 for that (1.11 itself is still in beta).
I do see wrong results with julia> using AMDGPU
julia> function ker!(x)
i = workitemIdx().x
x[i] = AMDGPU.Device.sync_workgroup_count(Cint(1))
return
end
ker! (generic function with 1 method) I then launch it and pass array of length julia> x = ROCArray(zeros(Cint, (1, 33)));
julia> @roc groupsize=length(x) gridsize=1 ker!(x); For julia> x
1×33 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 But for julia> x
1×33 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I set Optimized LLVM IR for the kernel:
LLVM IR; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"
@__scratch_lds = internal addrspace(3) global [32 x i64] undef, align 8
; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0
; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.y() #0
; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.z() #0
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare align 4 i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #1
; Function Attrs: alwaysinline nounwind readnone willreturn
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #2
; Function Attrs: alwaysinline convergent nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.y() #1
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.x() #1
; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #4
; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.permlanex16(i32, i32, i32, i32, i1 immarg, i1 immarg) #4
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.z() #1
; @ REPL[15]:1 within `ker!`
define amdgpu_kernel void @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [2 x i64], i8 addrspace(1)*, i64 } %0) local_unnamed_addr #5 !dbg !39 {
conversion:
; @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
%1 = call i32 @llvm.amdgcn.workgroup.id.z(), !dbg !43
%2 = call i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr(), !dbg !43
%3 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 8, !dbg !43
%4 = bitcast i8 addrspace(4)* %3 to i32 addrspace(4)*, !dbg !43
%5 = load i32, i32 addrspace(4)* %4, align 4, !dbg !43, !tbaa !47
%6 = icmp ult i32 %1, %5, !dbg !43
%7 = select i1 %6, i64 16, i64 22, !dbg !43
%8 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %7, !dbg !43
%9 = bitcast i8 addrspace(4)* %8 to i16 addrspace(4)*, !dbg !43
%10 = load i16, i16 addrspace(4)* %9, align 2, !dbg !43, !tbaa !51
%11 = zext i16 %10 to i32, !dbg !43
%12 = call i32 @llvm.amdgcn.workgroup.id.y(), !dbg !43
%13 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 4, !dbg !43
%14 = bitcast i8 addrspace(4)* %13 to i32 addrspace(4)*, !dbg !43
%15 = load i32, i32 addrspace(4)* %14, align 4, !dbg !43, !tbaa !47
%16 = icmp ult i32 %12, %15, !dbg !43
%17 = select i1 %16, i64 14, i64 20, !dbg !43
%18 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %17, !dbg !43
%19 = bitcast i8 addrspace(4)* %18 to i16 addrspace(4)*, !dbg !43
%20 = load i16, i16 addrspace(4)* %19, align 2, !dbg !43, !tbaa !51
%21 = zext i16 %20 to i32, !dbg !43
%22 = call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !43
%23 = bitcast i8 addrspace(4)* %2 to i32 addrspace(4)*, !dbg !43
%24 = load i32, i32 addrspace(4)* %23, align 4, !dbg !43, !tbaa !47
%25 = icmp ult i32 %22, %24, !dbg !43
%26 = select i1 %25, i64 12, i64 18, !dbg !43
%27 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %26, !dbg !43
%28 = bitcast i8 addrspace(4)* %27 to i16 addrspace(4)*, !dbg !43
%29 = load i16, i16 addrspace(4)* %28, align 2, !dbg !43, !tbaa !51
%30 = zext i16 %29 to i32, !dbg !43
%31 = shl nuw nsw i32 %21, 8, !dbg !43
%32 = mul i32 %31, %30, !dbg !43
%33 = ashr exact i32 %32, 8, !dbg !43
%34 = mul nsw i32 %33, %11, !dbg !43
%35 = add i32 %34, 31, !dbg !43
%36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 1, i32 257, i32 15, i32 15, i1 true), !dbg !43
%37 = add nsw i32 %36, 1, !dbg !43
%38 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %37, i32 258, i32 15, i32 15, i1 true), !dbg !43
%39 = add nsw i32 %37, %38, !dbg !43
%40 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %39, i32 260, i32 15, i32 15, i1 true), !dbg !43
%41 = add nsw i32 %39, %40, !dbg !43
%42 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %41, i32 264, i32 15, i32 15, i1 true), !dbg !43
%43 = add nsw i32 %41, %42, !dbg !43
%44 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %43, i32 336, i32 15, i32 15, i1 true), !dbg !43
%45 = call i32 @llvm.amdgcn.permlanex16(i32 undef, i32 %44, i32 0, i32 0, i1 false, i1 true), !dbg !43
%46 = add nsw i32 %45, %44, !dbg !43
%.mask = and i32 %35, -32, !dbg !43
%47 = icmp eq i32 %.mask, 32, !dbg !43
br i1 %47, label %conversion.__ockl_wgred_add_i32.exit_crit_edge, label %48, !dbg !43
conversion.__ockl_wgred_add_i32.exit_crit_edge: ; preds = %conversion
; └
; @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%.pre = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !53, !range !66
; └└└└
; @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
br label %__ockl_wgred_add_i32.exit, !dbg !43
48: ; preds = %conversion
%49 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0), !dbg !43
%50 = call i32 @llvm.amdgcn.workitem.id.z(), !dbg !43, !range !67
%51 = mul nuw nsw i32 %50, %21, !dbg !43
%52 = call i32 @llvm.amdgcn.workitem.id.y(), !dbg !43, !range !67
%53 = add nuw nsw i32 %51, %52, !dbg !43
%54 = mul i32 %53, %30, !dbg !43
%55 = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !43, !range !67
%56 = add i32 %54, %55, !dbg !43
%57 = lshr i32 %56, 5, !dbg !43
%58 = icmp eq i32 %49, 0, !dbg !43
%59 = or i32 %57, %49, !dbg !43
%60 = icmp eq i32 %59, 0, !dbg !43
br i1 %60, label %61, label %62, !dbg !43
61: ; preds = %48
store atomic i32 %46, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
br label %62, !dbg !43
62: ; preds = %61, %48
fence syncscope("workgroup") release, !dbg !43
call void @llvm.amdgcn.s.barrier(), !dbg !43
fence syncscope("workgroup") acquire, !dbg !43
%63 = icmp ugt i32 %56, 31, !dbg !43
%64 = and i1 %58, %63, !dbg !43
br i1 %64, label %65, label %67, !dbg !43
65: ; preds = %62
%66 = atomicrmw add i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*), i32 %46 syncscope("workgroup-one-as") monotonic, align 4, !dbg !43
br label %67, !dbg !43
67: ; preds = %65, %62
fence syncscope("workgroup") release, !dbg !43
call void @llvm.amdgcn.s.barrier(), !dbg !43
fence syncscope("workgroup") acquire, !dbg !43
%68 = load atomic i32, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
fence syncscope("workgroup") release, !dbg !43
call void @llvm.amdgcn.s.barrier(), !dbg !43
fence syncscope("workgroup") acquire, !dbg !43
br label %__ockl_wgred_add_i32.exit, !dbg !43
__ockl_wgred_add_i32.exit: ; preds = %67, %conversion.__ockl_wgred_add_i32.exit_crit_edge
; └
; @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%.pre-phi = phi i32 [ %.pre, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %55, %67 ], !dbg !53
; └└└└
; @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
%69 = phi i32 [ %46, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %68, %67 ], !dbg !43
%.fca.1.extract = extractvalue { [2 x i64], i8 addrspace(1)*, i64 } %0, 1
; └
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:90 within `#setindex!`
; │┌ @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:88 within `unsafe_store!`
; ││┌ @ none within `pointerset`
; │││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%70 = bitcast i8 addrspace(1)* %.fca.1.extract to i32 addrspace(1)*, !dbg !68
%71 = zext i32 %.pre-phi to i64, !dbg !68
%72 = getelementptr inbounds i32, i32 addrspace(1)* %70, i64 %71, !dbg !68
store i32 %69, i32 addrspace(1)* %72, align 4, !dbg !68, !tbaa !80
; └└└└
; @ REPL[15]:4 within `ker!`
ret void, !dbg !83
}
attributes #0 = { nounwind readnone speculatable willreturn "amdgpu-unsafe-fp-atomics"="true" }
attributes #1 = { alwaysinline nounwind readnone speculatable willreturn }
attributes #2 = { alwaysinline nounwind readnone willreturn }
attributes #3 = { alwaysinline convergent nounwind willreturn }
attributes #4 = { alwaysinline convergent nounwind readnone willreturn }
attributes #5 = { "amdgpu-unsafe-fp-atomics"="true" }
!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.dbg.cu = !{!4, !6, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34, !35}
!opencl.ocl.version = !{!36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36}
!llvm.ident = !{!37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37}
!julia.kernel = !{!38}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 1, !"wchar_size", i32 4}
!3 = !{i32 7, !"PIC Level", i32 1}
!4 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!5 = !DIFile(filename: "julia", directory: ".")
!6 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!36 = !{i32 2, i32 0}
!37 = !{!"clang version 15.0.0 (/cache/yggdrasil/downloads/clones/llvm-project.git-974efd367bc513231526d317489c66cb27727ef3caa41108e3819c131a8acf57 f3d695fc2985a8dfdd5f4219d351fdeac3038867)"}
!38 = !{void ({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 }, { [2 x i64], i8 addrspace(1)*, i64 })* @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE}
!39 = distinct !DISubprogram(name: "ker!", linkageName: "julia_ker!_4401", scope: null, file: !40, line: 1, type: !41, scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!40 = !DIFile(filename: "REPL[15]", directory: ".")
!41 = !DISubroutineType(types: !42)
!42 = !{}
!43 = !DILocation(line: 17, scope: !44, inlinedAt: !46)
!44 = distinct !DISubprogram(name: "sync_workgroup_count;", linkageName: "sync_workgroup_count", scope: !45, file: !45, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!45 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl", directory: ".")
!46 = !DILocation(line: 3, scope: !39)
!47 = !{!48, !48, i64 0}
!48 = !{!"int", !49, i64 0}
!49 = !{!"omnipotent char", !50, i64 0}
!50 = !{!"Simple C/C++ TBAA"}
!51 = !{!52, !52, i64 0}
!52 = !{!"short", !49, i64 0}
!53 = !DILocation(line: 38, scope: !54, inlinedAt: !56)
!54 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !55, file: !55, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!55 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl", directory: ".")
!56 = !DILocation(line: 3, scope: !57, inlinedAt: !59)
!57 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!58 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl", directory: ".")
!59 = !DILocation(line: 3, scope: !60, inlinedAt: !61)
!60 = distinct !DISubprogram(name: "_index;", linkageName: "_index", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!61 = !DILocation(line: 87, scope: !62, inlinedAt: !63)
!62 = distinct !DISubprogram(name: "workitemIdx_x;", linkageName: "workitemIdx_x", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!63 = !DILocation(line: 122, scope: !64, inlinedAt: !65)
!64 = distinct !DISubprogram(name: "workitemIdx;", linkageName: "workitemIdx", scope: !58, file: !58, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!65 = !DILocation(line: 2, scope: !39)
!66 = !{i32 0, i32 1023}
!67 = !{i32 0, i32 1024}
!68 = !DILocation(line: 38, scope: !54, inlinedAt: !69)
!69 = !DILocation(line: 0, scope: !70, inlinedAt: !72)
!70 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !71, file: !71, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!71 = !DIFile(filename: "none", directory: ".")
!72 = !DILocation(line: 0, scope: !73, inlinedAt: !74)
!73 = distinct !DISubprogram(name: "pointerset;", linkageName: "pointerset", scope: !71, file: !71, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!74 = !DILocation(line: 88, scope: !75, inlinedAt: !77)
!75 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !76, file: !76, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!76 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl", directory: ".")
!77 = !DILocation(line: 90, scope: !78, inlinedAt: !46)
!78 = distinct !DISubprogram(name: "#setindex!;", linkageName: "#setindex!", scope: !79, file: !79, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !42)
!79 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl", directory: ".")
!80 = !{!81, !81, i64 0, i64 0}
!81 = !{!"custom_tbaa_addrspace(1)", !82, i64 0}
!82 = !{!"custom_tbaa"}
!83 = !DILocation(line: 4, scope: !39)
LLVM IR; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"
@__scratch_lds = internal addrspace(3) global [32 x i64] undef, align 8
; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0
; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.y() #0
; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.z() #0
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare align 4 i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #1
; Function Attrs: alwaysinline nounwind readnone willreturn
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #2
; Function Attrs: alwaysinline nounwind readnone willreturn
declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #2
; Function Attrs: alwaysinline convergent nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.y() #1
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.x() #1
; Function Attrs: alwaysinline nounwind readonly
declare i32 @llvm.read_register.i32(metadata) #4
; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.readlane(i32, i32) #5
; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #5
; Function Attrs: alwaysinline convergent nounwind readnone willreturn
declare i32 @llvm.amdgcn.permlanex16(i32, i32, i32, i32, i1 immarg, i1 immarg) #5
; Function Attrs: alwaysinline nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workgroup.id.z() #1
; @ REPL[15]:1 within `ker!`
define amdgpu_kernel void @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [2 x i64], i8 addrspace(1)*, i64 } %0) local_unnamed_addr #6 !dbg !39 {
conversion:
; @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
%1 = call i32 @llvm.amdgcn.workgroup.id.z(), !dbg !43
%2 = call i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr(), !dbg !43
%3 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 8, !dbg !43
%4 = bitcast i8 addrspace(4)* %3 to i32 addrspace(4)*, !dbg !43
%5 = load i32, i32 addrspace(4)* %4, align 4, !dbg !43, !tbaa !47
%6 = icmp ult i32 %1, %5, !dbg !43
%7 = select i1 %6, i64 16, i64 22, !dbg !43
%8 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %7, !dbg !43
%9 = bitcast i8 addrspace(4)* %8 to i16 addrspace(4)*, !dbg !43
%10 = load i16, i16 addrspace(4)* %9, align 2, !dbg !43, !tbaa !51
%11 = zext i16 %10 to i32, !dbg !43
%12 = call i32 @llvm.amdgcn.workgroup.id.y(), !dbg !43
%13 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 4, !dbg !43
%14 = bitcast i8 addrspace(4)* %13 to i32 addrspace(4)*, !dbg !43
%15 = load i32, i32 addrspace(4)* %14, align 4, !dbg !43, !tbaa !47
%16 = icmp ult i32 %12, %15, !dbg !43
%17 = select i1 %16, i64 14, i64 20, !dbg !43
%18 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %17, !dbg !43
%19 = bitcast i8 addrspace(4)* %18 to i16 addrspace(4)*, !dbg !43
%20 = load i16, i16 addrspace(4)* %19, align 2, !dbg !43, !tbaa !51
%21 = zext i16 %20 to i32, !dbg !43
%22 = call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !43
%23 = bitcast i8 addrspace(4)* %2 to i32 addrspace(4)*, !dbg !43
%24 = load i32, i32 addrspace(4)* %23, align 4, !dbg !43, !tbaa !47
%25 = icmp ult i32 %22, %24, !dbg !43
%26 = select i1 %25, i64 12, i64 18, !dbg !43
%27 = getelementptr inbounds i8, i8 addrspace(4)* %2, i64 %26, !dbg !43
%28 = bitcast i8 addrspace(4)* %27 to i16 addrspace(4)*, !dbg !43
%29 = load i16, i16 addrspace(4)* %28, align 2, !dbg !43, !tbaa !51
%30 = zext i16 %29 to i32, !dbg !43
%31 = shl nuw nsw i32 %21, 8, !dbg !43
%32 = mul i32 %31, %30, !dbg !43
%33 = ashr exact i32 %32, 8, !dbg !43
%34 = mul nsw i32 %33, %11, !dbg !43
%35 = add i32 %34, 63, !dbg !43
%36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 1, i32 257, i32 15, i32 15, i1 true), !dbg !43
%37 = add nsw i32 %36, 1, !dbg !43
%38 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %37, i32 258, i32 15, i32 15, i1 true), !dbg !43
%39 = add nsw i32 %37, %38, !dbg !43
%40 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %39, i32 260, i32 15, i32 15, i1 true), !dbg !43
%41 = add nsw i32 %39, %40, !dbg !43
%42 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %41, i32 264, i32 15, i32 15, i1 true), !dbg !43
%43 = add nsw i32 %41, %42, !dbg !43
%44 = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %43, i32 336, i32 15, i32 15, i1 true), !dbg !43
%45 = call i32 @llvm.amdgcn.permlanex16(i32 undef, i32 %44, i32 0, i32 0, i1 false, i1 true), !dbg !43
%46 = add nsw i32 %45, %44, !dbg !43
%47 = call i32 @llvm.amdgcn.readlane(i32 %46, i32 32), !dbg !43
%48 = call i32 @llvm.read_register.i32(metadata !53) #7, !dbg !43
%49 = and i32 %48, 1, !dbg !43
%50 = icmp eq i32 %49, 0, !dbg !43
%51 = select i1 %50, i32 0, i32 %47, !dbg !43
%52 = call i32 @llvm.amdgcn.readlane(i32 %46, i32 0), !dbg !43
%53 = add nsw i32 %51, %52, !dbg !43
%.mask = and i32 %35, -64, !dbg !43
%54 = icmp eq i32 %.mask, 64, !dbg !43
br i1 %54, label %conversion.__ockl_wgred_add_i32.exit_crit_edge, label %55, !dbg !43
conversion.__ockl_wgred_add_i32.exit_crit_edge: ; preds = %conversion
; └
; @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%.pre = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !54, !range !67
; └└└└
; @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
br label %__ockl_wgred_add_i32.exit, !dbg !43
55: ; preds = %conversion
%56 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0), !dbg !43
%57 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %56), !dbg !43
%58 = call i32 @llvm.amdgcn.workitem.id.z(), !dbg !43, !range !68
%59 = mul nuw nsw i32 %58, %21, !dbg !43
%60 = call i32 @llvm.amdgcn.workitem.id.y(), !dbg !43, !range !68
%61 = add nuw nsw i32 %59, %60, !dbg !43
%62 = mul i32 %61, %30, !dbg !43
%63 = call i32 @llvm.amdgcn.workitem.id.x(), !dbg !43, !range !68
%64 = add i32 %62, %63, !dbg !43
%65 = lshr i32 %64, 6, !dbg !43
%66 = icmp eq i32 %57, 0, !dbg !43
%67 = or i32 %65, %57, !dbg !43
%68 = icmp eq i32 %67, 0, !dbg !43
br i1 %68, label %69, label %70, !dbg !43
69: ; preds = %55
store atomic i32 %53, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
br label %70, !dbg !43
70: ; preds = %69, %55
fence syncscope("workgroup") release, !dbg !43
call void @llvm.amdgcn.s.barrier(), !dbg !43
fence syncscope("workgroup") acquire, !dbg !43
%71 = icmp ugt i32 %64, 63, !dbg !43
%72 = and i1 %66, %71, !dbg !43
br i1 %72, label %73, label %75, !dbg !43
73: ; preds = %70
%74 = atomicrmw add i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*), i32 %53 syncscope("workgroup-one-as") monotonic, align 4, !dbg !43
br label %75, !dbg !43
75: ; preds = %73, %70
fence syncscope("workgroup") release, !dbg !43
call void @llvm.amdgcn.s.barrier(), !dbg !43
fence syncscope("workgroup") acquire, !dbg !43
%76 = load atomic i32, i32 addrspace(3)* bitcast ([32 x i64] addrspace(3)* @__scratch_lds to i32 addrspace(3)*) syncscope("workgroup-one-as") monotonic, align 8, !dbg !43
fence syncscope("workgroup") release, !dbg !43
call void @llvm.amdgcn.s.barrier(), !dbg !43
fence syncscope("workgroup") acquire, !dbg !43
br label %__ockl_wgred_add_i32.exit, !dbg !43
__ockl_wgred_add_i32.exit: ; preds = %75, %conversion.__ockl_wgred_add_i32.exit_crit_edge
; └
; @ REPL[15]:2 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:122 within `workitemIdx`
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:87 within `workitemIdx_x`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `_index`
; │││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl:3 within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%.pre-phi = phi i32 [ %.pre, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %63, %75 ], !dbg !54
; └└└└
; @ REPL[15]:3 within `ker!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17 within `sync_workgroup_count`
%77 = phi i32 [ %53, %conversion.__ockl_wgred_add_i32.exit_crit_edge ], [ %76, %75 ], !dbg !43
%.fca.1.extract = extractvalue { [2 x i64], i8 addrspace(1)*, i64 } %0, 1
; └
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:90 within `#setindex!`
; │┌ @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:88 within `unsafe_store!`
; ││┌ @ none within `pointerset`
; │││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%78 = bitcast i8 addrspace(1)* %.fca.1.extract to i32 addrspace(1)*, !dbg !69
%79 = zext i32 %.pre-phi to i64, !dbg !69
%80 = getelementptr inbounds i32, i32 addrspace(1)* %78, i64 %79, !dbg !69
store i32 %77, i32 addrspace(1)* %80, align 4, !dbg !69, !tbaa !81
; └└└└
; @ REPL[15]:4 within `ker!`
ret void, !dbg !84
}
attributes #0 = { nounwind readnone speculatable willreturn "amdgpu-unsafe-fp-atomics"="true" }
attributes #1 = { alwaysinline nounwind readnone speculatable willreturn }
attributes #2 = { alwaysinline nounwind readnone willreturn }
attributes #3 = { alwaysinline convergent nounwind willreturn }
attributes #4 = { alwaysinline nounwind readonly }
attributes #5 = { alwaysinline convergent nounwind readnone willreturn }
attributes #6 = { "amdgpu-unsafe-fp-atomics"="true" }
attributes #7 = { convergent }
!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.dbg.cu = !{!4, !6, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34, !35}
!opencl.ocl.version = !{!36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36, !36}
!llvm.ident = !{!37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37, !37}
!julia.kernel = !{!38}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 1, !"wchar_size", i32 4}
!3 = !{i32 7, !"PIC Level", i32 1}
!4 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!5 = !DIFile(filename: "julia", directory: ".")
!6 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !5, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!36 = !{i32 2, i32 0}
!37 = !{!"clang version 15.0.0 (/cache/yggdrasil/downloads/clones/llvm-project.git-974efd367bc513231526d317489c66cb27727ef3caa41108e3819c131a8acf57 f3d695fc2985a8dfdd5f4219d351fdeac3038867)"}
!38 = !{void ({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 }, { [2 x i64], i8 addrspace(1)*, i64 })* @_Z4ker_14ROCDeviceArrayI5Int32Li2ELi1EE}
!39 = distinct !DISubprogram(name: "ker!", linkageName: "julia_ker!_4495", scope: null, file: !40, line: 1, type: !41, scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!40 = !DIFile(filename: "REPL[15]", directory: ".")
!41 = !DISubroutineType(types: !42)
!42 = !{}
!43 = !DILocation(line: 17, scope: !44, inlinedAt: !46)
!44 = distinct !DISubprogram(name: "sync_workgroup_count;", linkageName: "sync_workgroup_count", scope: !45, file: !45, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!45 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl", directory: ".")
!46 = !DILocation(line: 3, scope: !39)
!47 = !{!48, !48, i64 0}
!48 = !{!"int", !49, i64 0}
!49 = !{!"omnipotent char", !50, i64 0}
!50 = !{!"Simple C/C++ TBAA"}
!51 = !{!52, !52, i64 0}
!52 = !{!"short", !49, i64 0}
!53 = !{!"exec_hi"}
!54 = !DILocation(line: 38, scope: !55, inlinedAt: !57)
!55 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !56, file: !56, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!56 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl", directory: ".")
!57 = !DILocation(line: 3, scope: !58, inlinedAt: !60)
!58 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!59 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/indexing.jl", directory: ".")
!60 = !DILocation(line: 3, scope: !61, inlinedAt: !62)
!61 = distinct !DISubprogram(name: "_index;", linkageName: "_index", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!62 = !DILocation(line: 87, scope: !63, inlinedAt: !64)
!63 = distinct !DISubprogram(name: "workitemIdx_x;", linkageName: "workitemIdx_x", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!64 = !DILocation(line: 122, scope: !65, inlinedAt: !66)
!65 = distinct !DISubprogram(name: "workitemIdx;", linkageName: "workitemIdx", scope: !59, file: !59, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!66 = !DILocation(line: 2, scope: !39)
!67 = !{i32 0, i32 1023}
!68 = !{i32 0, i32 1024}
!69 = !DILocation(line: 38, scope: !55, inlinedAt: !70)
!70 = !DILocation(line: 0, scope: !71, inlinedAt: !73)
!71 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !72, file: !72, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!72 = !DIFile(filename: "none", directory: ".")
!73 = !DILocation(line: 0, scope: !74, inlinedAt: !75)
!74 = distinct !DISubprogram(name: "pointerset;", linkageName: "pointerset", scope: !72, file: !72, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!75 = !DILocation(line: 88, scope: !76, inlinedAt: !78)
!76 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !77, file: !77, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!77 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl", directory: ".")
!78 = !DILocation(line: 90, scope: !79, inlinedAt: !46)
!79 = distinct !DISubprogram(name: "#setindex!;", linkageName: "#setindex!", scope: !80, file: !80, type: !41, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !6, retainedNodes: !42)
!80 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl", directory: ".")
!81 = !{!82, !82, i64 0, i64 0}
!82 = !{!"custom_tbaa_addrspace(1)", !83, i64 0}
!83 = !{!"custom_tbaa"}
!84 = !DILocation(line: 4, scope: !39) |
The wave64 IR posted here looks a bit suspicious to me, like you linked the library incorrectly. For example, why is it using read_register on exec_hi? Also, you didn't provide codegen invocation and the wavesize and target-cpu aren't set in target-features in the IR attributes |
Ah I set features a bit incorrectly, fixing it gives correct results on wave64.
We set target cpu when creating ASM .text
.amdgcn_target "amdgcn-amd-amdhsa--gfx1100"
.globl _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE ; -- Begin function _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
.p2align 8
.type _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE,@function
_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE: ; @_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
.Lfunc_begin0:
.file 1 "." "/home/pxlth/.julia/dev/AMDGPU/src/t.jl"
.loc 1 3 0 ; /home/pxlth/.julia/dev/AMDGPU/src/t.jl:3:0
.cfi_sections .debug_frame
.cfi_startproc
; %bb.0: ; %conversion
.file 2 "." "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl"
.loc 2 17 0 prologue_end ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
s_clause 0x1
s_load_b32 s4, s[0:1], 0x78
s_load_b64 s[2:3], s[0:1], 0x70
s_add_u32 s8, s0, 0x70
s_addc_u32 s9, s1, 0
v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v3, 1
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
v_mov_b32_dpp v3, v3 row_shl:1 row_mask:0xf bank_mask:0xf bound_ctrl:1
v_add_nc_u32_e32 v3, 1, v3
s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
v_add_nc_u32_dpp v3, v3, v3 row_shl:2 row_mask:0xf bank_mask:0xf bound_ctrl:1
s_waitcnt lgkmcnt(0)
s_cmp_lt_u32 s15, s4
s_cselect_b32 s4, 16, 22
s_add_u32 s4, s8, s4
s_addc_u32 s5, s9, 0
s_cmp_lt_u32 s14, s3
v_add_nc_u32_dpp v3, v3, v3 row_shl:4 row_mask:0xf bank_mask:0xf bound_ctrl:1
s_cselect_b32 s3, 14, 20
s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
s_add_u32 s6, s8, s3
s_addc_u32 s7, s9, 0
s_cmp_lt_u32 s13, s2
v_add_nc_u32_dpp v3, v3, v3 row_shl:8 row_mask:0xf bank_mask:0xf bound_ctrl:1
s_cselect_b32 s2, 12, 18
s_add_u32 s2, s8, s2
s_addc_u32 s3, s9, 0
s_clause 0x2
global_load_u16 v1, v2, s[6:7]
global_load_u16 v5, v2, s[2:3]
global_load_u16 v2, v2, s[4:5]
v_mov_b32_dpp v3, v3 row_share:0 row_mask:0xf bank_mask:0xf bound_ctrl:1
s_waitcnt vmcnt(1)
v_mul_lo_u32 v4, v1, v5
s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
v_bfe_i32 v4, v4, 0, 24
s_waitcnt vmcnt(0)
v_mul_lo_u32 v2, v4, v2
v_permlanex16_b32 v4, v3, 0, 0 op_sel:[0,1]
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
v_add_nc_u32_e32 v4, v4, v3
v_add_nc_u32_e32 v2, 31, v2
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
v_and_b32_e32 v2, 0xffffffe0, v2
v_cmp_ne_u32_e32 vcc_lo, 32, v2
s_cbranch_vccz .LBB0_6
; %bb.1:
.loc 2 0 0 is_stmt 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
v_bfe_u32 v2, v0, 10, 10
v_bfe_u32 v3, v0, 20, 10
.loc 2 17 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
s_mov_b32 s2, exec_lo
s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
v_mad_u32_u24 v6, v3, v1, v2
v_and_b32_e32 v1, 0x3ff, v0
v_mad_u64_u32 v[2:3], null, v6, v5, v[1:2]
v_mbcnt_lo_u32_b32 v3, -1, 0
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
v_mbcnt_hi_u32_b32 v3, -1, v3
v_lshrrev_b32_e32 v5, 5, v2
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
v_or_b32_e32 v5, v5, v3
v_cmpx_eq_u32_e32 0, v5
s_cbranch_execz .LBB0_3
; %bb.2:
.loc 2 0 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
v_mov_b32_e32 v5, 0
.loc 2 17 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
ds_store_b32 v5, v4
.LBB0_3:
.loc 2 0 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
s_or_b32 exec_lo, exec_lo, s2
.loc 2 17 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
v_cmp_eq_u32_e32 vcc_lo, 0, v3
v_cmp_lt_u32_e64 s2, 31, v2
s_mov_b32 s3, 0
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
s_barrier
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
buffer_gl0_inv
s_and_b32 s4, vcc_lo, s2
s_delay_alu instid0(SALU_CYCLE_1)
s_and_saveexec_b32 s2, s4
s_cbranch_execz .LBB0_5
; %bb.4:
.loc 2 0 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
v_mov_b32_e32 v2, 0
.loc 2 17 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
ds_add_u32 v2, v4
.LBB0_5:
.loc 2 0 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
s_or_b32 exec_lo, exec_lo, s2
v_mov_b32_e32 v2, 0
.loc 2 17 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
s_barrier
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
buffer_gl0_inv
ds_load_b32 v3, v2
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
s_barrier
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
buffer_gl0_inv
s_and_b32 vcc_lo, exec_lo, s3
s_cbranch_vccz .LBB0_8
s_branch .LBB0_7
.LBB0_6:
; implicit-def: $vgpr3
; implicit-def: $vgpr1
.loc 2 0 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:0:0
s_cbranch_execz .LBB0_8
.LBB0_7: ; %conversion.__ockl_wgred_add_i32.exit_crit_edge
v_and_b32_e32 v1, 0x3ff, v0
v_mov_b32_e32 v3, v4
.LBB0_8: ; %__ockl_wgred_add_i32.exit
.loc 2 17 0 ; /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/synchronization.jl:17:0
s_load_b64 s[0:1], s[0:1], 0x60
.Ltmp0:
.file 3 "." "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl"
.loc 3 38 0 is_stmt 1 ; /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38:0
v_mov_b32_e32 v2, 0
s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
v_lshlrev_b64 v[0:1], 2, v[1:2]
s_waitcnt lgkmcnt(0)
v_add_co_u32 v0, vcc_lo, s0, v0
s_delay_alu instid0(VALU_DEP_2)
v_add_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo
global_store_b32 v[0:1], v3, off
s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
.Ltmp1:
.loc 1 6 0 ; /home/pxlth/.julia/dev/AMDGPU/src/t.jl:6:0
s_endpgm
.Ltmp2:
.section .rodata,#alloc
.p2align 6, 0x0
.amdhsa_kernel _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
.amdhsa_group_segment_fixed_size 256
.amdhsa_private_segment_fixed_size 0
.amdhsa_kernarg_size 168
.amdhsa_user_sgpr_count 13
.amdhsa_user_sgpr_dispatch_ptr 0
.amdhsa_user_sgpr_queue_ptr 0
.amdhsa_user_sgpr_kernarg_segment_ptr 1
.amdhsa_user_sgpr_dispatch_id 0
.amdhsa_user_sgpr_private_segment_size 0
.amdhsa_wavefront_size32 1
.amdhsa_enable_private_segment 0
.amdhsa_system_sgpr_workgroup_id_x 1
.amdhsa_system_sgpr_workgroup_id_y 1
.amdhsa_system_sgpr_workgroup_id_z 1
.amdhsa_system_sgpr_workgroup_info 0
.amdhsa_system_vgpr_workitem_id 2
.amdhsa_next_free_vgpr 7
.amdhsa_next_free_sgpr 16
.amdhsa_float_round_mode_32 0
.amdhsa_float_round_mode_16_64 0
.amdhsa_float_denorm_mode_32 3
.amdhsa_float_denorm_mode_16_64 3
.amdhsa_dx10_clamp 1
.amdhsa_ieee_mode 1
.amdhsa_fp16_overflow 0
.amdhsa_workgroup_processor_mode 1
.amdhsa_memory_ordered 1
.amdhsa_forward_progress 0
.amdhsa_shared_vgpr_count 0
.amdhsa_exception_fp_ieee_invalid_op 0
.amdhsa_exception_fp_denorm_src 0
.amdhsa_exception_fp_ieee_div_zero 0
.amdhsa_exception_fp_ieee_overflow 0
.amdhsa_exception_fp_ieee_underflow 0
.amdhsa_exception_fp_ieee_inexact 0
.amdhsa_exception_int_div_zero 0
.end_amdhsa_kernel
.text
.Lfunc_end0:
.size _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE, .Lfunc_end0-_Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
.cfi_endproc
.file 4 "." "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl"
.file 5 "." "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl"
.file 6 "." "none"
; -- End function
.section .AMDGPU.csdata
; Kernel info:
; codeLenInByte = 604
; NumSgprs: 18
; NumVgprs: 7
; ScratchSize: 0
; MemoryBound: 0
; FloatMode: 240
; IeeeMode: 1
; LDSByteSize: 256 bytes/workgroup (compile time only)
; SGPRBlocks: 2
; VGPRBlocks: 0
; NumSGPRsForWavesPerEU: 18
; NumVGPRsForWavesPerEU: 7
; Occupancy: 16
; WaveLimiterHint : 0
; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0
; COMPUTE_PGM_RSRC2:USER_SGPR: 13
; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0
; COMPUTE_PGM_RSRC2:TGID_X_EN: 1
; COMPUTE_PGM_RSRC2:TGID_Y_EN: 1
; COMPUTE_PGM_RSRC2:TGID_Z_EN: 1
; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 2
.text
.p2alignl 7, 3214868480
.fill 96, 4, 3214868480
.section .debug_abbrev
.byte 1 ; Abbreviation Code
.byte 17 ; DW_TAG_compile_unit
.byte 1 ; DW_CHILDREN_yes
.byte 37 ; DW_AT_producer
.byte 14 ; DW_FORM_strp
.byte 19 ; DW_AT_language
.byte 5 ; DW_FORM_data2
.byte 3 ; DW_AT_name
.byte 14 ; DW_FORM_strp
.byte 16 ; DW_AT_stmt_list
.byte 23 ; DW_FORM_sec_offset
.byte 27 ; DW_AT_comp_dir
.byte 14 ; DW_FORM_strp
.byte 17 ; DW_AT_low_pc
.byte 1 ; DW_FORM_addr
.byte 18 ; DW_AT_high_pc
.byte 6 ; DW_FORM_data4
.byte 0 ; EOM(1)
.byte 0 ; EOM(2)
.byte 2 ; Abbreviation Code
.byte 46 ; DW_TAG_subprogram
.byte 0 ; DW_CHILDREN_no
.byte 3 ; DW_AT_name
.byte 14 ; DW_FORM_strp
.byte 32 ; DW_AT_inline
.byte 11 ; DW_FORM_data1
.byte 0 ; EOM(1)
.byte 0 ; EOM(2)
.byte 3 ; Abbreviation Code
.byte 46 ; DW_TAG_subprogram
.byte 1 ; DW_CHILDREN_yes
.byte 17 ; DW_AT_low_pc
.byte 1 ; DW_FORM_addr
.byte 18 ; DW_AT_high_pc
.byte 6 ; DW_FORM_data4
.byte 3 ; DW_AT_name
.byte 14 ; DW_FORM_strp
.byte 0 ; EOM(1)
.byte 0 ; EOM(2)
.byte 4 ; Abbreviation Code
.byte 29 ; DW_TAG_inlined_subroutine
.byte 0 ; DW_CHILDREN_no
.byte 49 ; DW_AT_abstract_origin
.byte 19 ; DW_FORM_ref4
.byte 17 ; DW_AT_low_pc
.byte 1 ; DW_FORM_addr
.byte 18 ; DW_AT_high_pc
.byte 6 ; DW_FORM_data4
.byte 88 ; DW_AT_call_file
.byte 11 ; DW_FORM_data1
.byte 89 ; DW_AT_call_line
.byte 11 ; DW_FORM_data1
.byte 0 ; EOM(1)
.byte 0 ; EOM(2)
.byte 5 ; Abbreviation Code
.byte 29 ; DW_TAG_inlined_subroutine
.byte 1 ; DW_CHILDREN_yes
.byte 49 ; DW_AT_abstract_origin
.byte 19 ; DW_FORM_ref4
.byte 17 ; DW_AT_low_pc
.byte 1 ; DW_FORM_addr
.byte 18 ; DW_AT_high_pc
.byte 6 ; DW_FORM_data4
.byte 88 ; DW_AT_call_file
.byte 11 ; DW_FORM_data1
.byte 89 ; DW_AT_call_line
.byte 11 ; DW_FORM_data1
.byte 0 ; EOM(1)
.byte 0 ; EOM(2)
.byte 0 ; EOM(3)
.section .debug_info
.Lcu_begin0:
.long .Ldebug_info_end0-.Ldebug_info_start0 ; Length of Unit
.Ldebug_info_start0:
.short 4 ; DWARF version number
.long .debug_abbrev ; Offset Into Abbrev. Section
.byte 8 ; Address Size (in bytes)
.byte 1 ; Abbrev [1] 0xb:0xcc DW_TAG_compile_unit
.long .Linfo_string0 ; DW_AT_producer
.short 31 ; DW_AT_language
.long .Linfo_string0 ; DW_AT_name
.long .Lline_table_start0 ; DW_AT_stmt_list
.long .Linfo_string1 ; DW_AT_comp_dir
.quad .Lfunc_begin0 ; DW_AT_low_pc
.long .Lfunc_end0-.Lfunc_begin0 ; DW_AT_high_pc
.byte 2 ; Abbrev [2] 0x2a:0x6 DW_TAG_subprogram
.long .Linfo_string2 ; DW_AT_name
.byte 1 ; DW_AT_inline
.byte 2 ; Abbrev [2] 0x30:0x6 DW_TAG_subprogram
.long .Linfo_string3 ; DW_AT_name
.byte 1 ; DW_AT_inline
.byte 2 ; Abbrev [2] 0x36:0x6 DW_TAG_subprogram
.long .Linfo_string3 ; DW_AT_name
.byte 1 ; DW_AT_inline
.byte 2 ; Abbrev [2] 0x3c:0x6 DW_TAG_subprogram
.long .Linfo_string4 ; DW_AT_name
.byte 1 ; DW_AT_inline
.byte 2 ; Abbrev [2] 0x42:0x6 DW_TAG_subprogram
.long .Linfo_string5 ; DW_AT_name
.byte 1 ; DW_AT_inline
.byte 2 ; Abbrev [2] 0x48:0x6 DW_TAG_subprogram
.long .Linfo_string6 ; DW_AT_name
.byte 1 ; DW_AT_inline
.byte 3 ; Abbrev [3] 0x4e:0x88 DW_TAG_subprogram
.quad .Lfunc_begin0 ; DW_AT_low_pc
.long .Lfunc_end0-.Lfunc_begin0 ; DW_AT_high_pc
.long .Linfo_string7 ; DW_AT_name
.byte 4 ; Abbrev [4] 0x5f:0x13 DW_TAG_inlined_subroutine
.long 42 ; DW_AT_abstract_origin
.quad .Lfunc_begin0 ; DW_AT_low_pc
.long .Ltmp0-.Lfunc_begin0 ; DW_AT_high_pc
.byte 1 ; DW_AT_call_file
.byte 5 ; DW_AT_call_line
.byte 5 ; Abbrev [5] 0x72:0x63 DW_TAG_inlined_subroutine
.long 72 ; DW_AT_abstract_origin
.quad .Ltmp0 ; DW_AT_low_pc
.long .Ltmp1-.Ltmp0 ; DW_AT_high_pc
.byte 1 ; DW_AT_call_file
.byte 5 ; DW_AT_call_line
.byte 5 ; Abbrev [5] 0x85:0x4f DW_TAG_inlined_subroutine
.long 66 ; DW_AT_abstract_origin
.quad .Ltmp0 ; DW_AT_low_pc
.long .Ltmp1-.Ltmp0 ; DW_AT_high_pc
.byte 4 ; DW_AT_call_file
.byte 90 ; DW_AT_call_line
.byte 5 ; Abbrev [5] 0x98:0x3b DW_TAG_inlined_subroutine
.long 60 ; DW_AT_abstract_origin
.quad .Ltmp0 ; DW_AT_low_pc
.long .Ltmp1-.Ltmp0 ; DW_AT_high_pc
.byte 5 ; DW_AT_call_file
.byte 88 ; DW_AT_call_line
.byte 5 ; Abbrev [5] 0xab:0x27 DW_TAG_inlined_subroutine
.long 54 ; DW_AT_abstract_origin
.quad .Ltmp0 ; DW_AT_low_pc
.long .Ltmp1-.Ltmp0 ; DW_AT_high_pc
.byte 6 ; DW_AT_call_file
.byte 0 ; DW_AT_call_line
.byte 4 ; Abbrev [4] 0xbe:0x13 DW_TAG_inlined_subroutine
.long 48 ; DW_AT_abstract_origin
.quad .Ltmp0 ; DW_AT_low_pc
.long .Ltmp1-.Ltmp0 ; DW_AT_high_pc
.byte 6 ; DW_AT_call_file
.byte 0 ; DW_AT_call_line
.byte 0 ; End Of Children Mark
.byte 0 ; End Of Children Mark
.byte 0 ; End Of Children Mark
.byte 0 ; End Of Children Mark
.byte 0 ; End Of Children Mark
.byte 0 ; End Of Children Mark
.Ldebug_info_end0:
.section .debug_str,"MS",@progbits,1
.Linfo_string0:
.asciz "julia" ; string offset=0
.Linfo_string1:
.asciz "." ; string offset=6
.Linfo_string2:
.asciz "sync_workgroup_count;" ; string offset=8
.Linfo_string3:
.asciz "macro expansion;" ; string offset=30
.Linfo_string4:
.asciz "pointerset;" ; string offset=47
.Linfo_string5:
.asciz "unsafe_store!;" ; string offset=59
.Linfo_string6:
.asciz "#setindex!;" ; string offset=74
.Linfo_string7:
.asciz "ker!" ; string offset=86
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.ident "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"
.section ".note.GNU-stack"
.amdgpu_metadata
---
amdhsa.kernels:
- .args:
- .name: state
.offset: 0
.size: 88
.value_kind: by_value
- .offset: 88
.size: 24
.value_kind: by_value
- .offset: 112
.size: 8
.value_kind: hidden_global_offset_x
- .offset: 120
.size: 8
.value_kind: hidden_global_offset_y
- .offset: 128
.size: 8
.value_kind: hidden_global_offset_z
- .offset: 136
.size: 8
.value_kind: hidden_none
- .offset: 144
.size: 8
.value_kind: hidden_none
- .offset: 152
.size: 8
.value_kind: hidden_none
- .offset: 160
.size: 8
.value_kind: hidden_none
.group_segment_fixed_size: 256
.kernarg_segment_align: 8
.kernarg_segment_size: 168
.language: OpenCL C
.language_version:
- 2
- 0
.max_flat_workgroup_size: 1024
.name: _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE
.private_segment_fixed_size: 0
.sgpr_count: 18
.sgpr_spill_count: 0
.symbol: _Z4ker_14ROCDeviceArrayI5Int32Li1ELi1EE.kd
.vgpr_count: 7
.vgpr_spill_count: 0
.wavefront_size: 32
amdhsa.target: amdgcn-amd-amdhsa--gfx1100
amdhsa.version:
- 1
- 1
...
.end_amdgpu_metadata
.section .debug_line
.Lline_table_start0: A bit unrelated to this, but for the equivalent C++/HIP code I don't observe this information (wavesize, target-cpu) in LLVM IR as well. #include <hip/hip_runtime.h>
#include <iostream>
__global__ void ker(int* x) {
int i = hipThreadIdx_x;
x[i] = __syncthreads_count(1);
}
int main() {
int N = 33;
size_t size = N * sizeof(int);
int* x = new int[N];
for (int i = 0; i < N; ++i)
x[i] = 0;
int* dx;
hipMalloc(&dx, size);
hipMemcpy(dx, x, size, hipMemcpyHostToDevice);
int threadsPerBlock = 33;
int blocksPerGrid = 1;
ker<<<blocksPerGrid, threadsPerBlock>>>(dx);
hipMemcpy(x, dx, size, hipMemcpyDeviceToHost);
for (int i = 0; i < N; ++i)
std::cout << " " << x[i];
std::cout << std::endl;
hipFree(dx);
delete[] x;
return 0;
} Dumping with LLVM IR; ModuleID = 'main.cpp'
source_filename = "main.cpp"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8"
target triple = "amdgcn-amd-amdhsa"
@__scratch_lds = internal addrspace(3) global [32 x i64] poison, align 8
; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z3kerPi(ptr addrspace(1) nocapture writeonly %0) local_unnamed_addr #0 {
%2 = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !7, !noundef !8
%3 = tail call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
%4 = getelementptr inbounds i16, ptr addrspace(4) %3, i64 8
%5 = load i16, ptr addrspace(4) %4, align 4, !tbaa !9
%6 = zext i16 %5 to i32
%7 = getelementptr inbounds i16, ptr addrspace(4) %3, i64 7
%8 = load i16, ptr addrspace(4) %7, align 2, !tbaa !9
%9 = zext i16 %8 to i32
%10 = getelementptr inbounds i16, ptr addrspace(4) %3, i64 6
%11 = load i16, ptr addrspace(4) %10, align 4, !tbaa !9
%12 = zext i16 %11 to i32
%13 = shl nuw nsw i32 %9, 8
%14 = mul i32 %13, %12
%15 = ashr exact i32 %14, 8
%16 = mul nsw i32 %15, %6
%17 = add i32 %16, 31
%18 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 1, i32 257, i32 15, i32 15, i1 true)
%19 = add nsw i32 %18, 1
%20 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %19, i32 258, i32 15, i32 15, i1 true)
%21 = add nsw i32 %19, %20
%22 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %21, i32 260, i32 15, i32 15, i1 true)
%23 = add nsw i32 %21, %22
%24 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %23, i32 264, i32 15, i32 15, i1 true)
%25 = add nsw i32 %23, %24
%26 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %25, i32 336, i32 15, i32 15, i1 true)
%27 = tail call i32 @llvm.amdgcn.permlanex16(i32 undef, i32 %26, i32 0, i32 0, i1 false, i1 true)
%28 = add nsw i32 %27, %26
%29 = and i32 %17, -32
%30 = icmp eq i32 %29, 32
br i1 %30, label %51, label %31
31: ; preds = %1
%32 = tail call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%33 = tail call i32 @llvm.amdgcn.workitem.id.z(), !range !7, !noundef !8
%34 = mul nuw nsw i32 %33, %9
%35 = tail call i32 @llvm.amdgcn.workitem.id.y(), !range !7, !noundef !8
%36 = add nuw nsw i32 %34, %35
%37 = mul i32 %36, %12
%38 = add i32 %37, %2
%39 = lshr i32 %38, 5
%40 = icmp eq i32 %32, 0
%41 = or i32 %32, %39
%42 = icmp eq i32 %41, 0
br i1 %42, label %43, label %44
43: ; preds = %31
store atomic i32 %28, ptr addrspace(3) @__scratch_lds syncscope("workgroup-one-as") monotonic, align 8
br label %44
44: ; preds = %43, %31
fence syncscope("workgroup") release
tail call void @llvm.amdgcn.s.barrier()
fence syncscope("workgroup") acquire
%45 = icmp ugt i32 %38, 31
%46 = and i1 %45, %40
br i1 %46, label %47, label %49
47: ; preds = %44
%48 = atomicrmw add ptr addrspace(3) @__scratch_lds, i32 %28 syncscope("workgroup-one-as") monotonic, align 4
br label %49
49: ; preds = %47, %44
fence syncscope("workgroup") release
tail call void @llvm.amdgcn.s.barrier()
fence syncscope("workgroup") acquire
%50 = load atomic i32, ptr addrspace(3) @__scratch_lds syncscope("workgroup-one-as") monotonic, align 8
fence syncscope("workgroup") release
tail call void @llvm.amdgcn.s.barrier()
fence syncscope("workgroup") acquire
br label %51
51: ; preds = %1, %49
%52 = phi i32 [ %50, %49 ], [ %28, %1 ]
%53 = zext i32 %2 to i64
%54 = getelementptr inbounds i32, ptr addrspace(1) %0, i64 %53
store i32 %52, ptr addrspace(1) %54, align 4, !tbaa !13
ret void
}
; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.x() #1
; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.y() #1
; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.z() #1
; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare align 4 ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() #1
; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #2
; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3
; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32 immarg, i32 immarg, i32 immarg, i1 immarg) #4
; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.permlanex16(i32, i32, i32, i32, i1 immarg, i1 immarg) #4
attributes #0 = { convergent mustprogress nofree norecurse nounwind willreturn "amdgpu-flat-work-group-size"="1,1024" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="true" }
attributes #1 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #2 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) }
attributes #3 = { convergent mustprogress nocallback nofree nounwind willreturn }
attributes #4 = { convergent mustprogress nocallback nofree nounwind willreturn memory(none) }
!llvm.module.flags = !{!0, !1, !2, !3, !4}
!opencl.ocl.version = !{!5}
!llvm.ident = !{!6}
!0 = !{i32 4, !"amdgpu_hostcall", i32 1}
!1 = !{i32 1, !"amdgpu_code_object_version", i32 500}
!2 = !{i32 1, !"amdgpu_printf_kind", !"hostcall"}
!3 = !{i32 1, !"wchar_size", i32 4}
!4 = !{i32 8, !"PIC Level", i32 2}
!5 = !{i32 2, i32 0}
!6 = !{!"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"}
!7 = !{i32 0, i32 1024}
!8 = !{}
!9 = !{!10, !10, i64 0}
!10 = !{!"short", !11, i64 0}
!11 = !{!"omnipotent char", !12, i64 0}
!12 = !{!"Simple C/C++ TBAA"}
!13 = !{!14, !14, i64 0}
!14 = !{!"int", !15, i64 0}
!15 = !{!"omnipotent char", !16, i64 0}
!16 = !{!"Simple C++ TBAA"} Also if I try compiling for wave64 it errors:
|
That's within a single kernel, right? |
It's right here:
Yes, hip currently doesn't officially support wave64
Typically the same thing, the kernel is the entry point to the callgraph. Kernels with different wave sizes can't call the same functions. |
Thanks for the replies! Appreciate it!
🤦♂️ didn't notice. I do now set IR attributes. And above example with However, some other started failing. julia> using AMDGPU
julia> AMDGPU.device() # wave64 is set
┌────┬────────────────────┬──────────┬───────────┬────────────┐
│ Id │ Name │ GCN arch │ Wavefront │ Memory │
├────┼────────────────────┼──────────┼───────────┼────────────┤
│ 1 │ Radeon RX 7900 XTX │ gfx1100 │ 64 │ 23.984 GiB │
└────┴────────────────────┴──────────┴───────────┴────────────┘
julia> function ker_shfl!(x)
i::Cint = AMDGPU.Device.activelane()
ws = AMDGPU.Device.wavefrontsize()
# compute next lane address.
addr = ((i + 1) % ws) * 4 # VGPRs are 4-byte wide.
# julia array indexing starts from 1.
@inbounds x[i + 1] = AMDGPU.Device.bpermute(addr, i)
return
end
ker_shfl! (generic function with 1 method)
julia> x = ROCArray(zeros(Cint, 1, 64));
julia> @roc groupsize=length(x) ker_shfl!(x);
julia> x
1×64 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 0, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 32 Here 32nd item should be
julia> function ws_ker!(x)
i::Cint = AMDGPU.Device.activelane()
@inbounds x[i + 1] = AMDGPU.Device.wavefrontsize()
return
end
ws_ker! (generic function with 1 method)
julia> @roc groupsize=length(x) ws_ker!(x);
julia> x
1×64 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 … 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 LLVM IR for the `ker_shfl!` kernel; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:10:11:12:13"
target triple = "amdgcn-amd-amdhsa"
; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.wavefrontsize() #0
; Function Attrs: convergent nocallback nofree nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.ds.bpermute(i32, i32) #1
; Function Attrs: alwaysinline nocallback nofree nosync nounwind willreturn memory(read)
declare i32 @llvm.read_register.i32(metadata) #2
; Function Attrs: alwaysinline nocallback nofree nosync nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #3
; Function Attrs: alwaysinline nocallback nofree nosync nounwind willreturn memory(none)
declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #3
; Function Attrs: cold nocallback nofree noreturn nounwind
declare void @llvm.amdgcn.endpgm() #4
; @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:3 within `ker_shfl!`
define amdgpu_kernel void @_Z9ker_shfl_14ROCDeviceArrayI5Int32Li1ELi1EE({ i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, { [1 x i64], ptr addrspace(1), i64 } %0) local_unnamed_addr #5 !dbg !41 {
conversion:
; @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:4 within `ker_shfl!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:108 within `activelane`
%1 = call i32 @llvm.read_register.i32(metadata !45) #6, !dbg !46
%2 = call i32 @llvm.read_register.i32(metadata !50) #6, !dbg !46
%3 = call i32 @llvm.amdgcn.mbcnt.lo(i32 %2, i32 0), !dbg !46
%4 = call i32 @llvm.amdgcn.mbcnt.hi(i32 %1, i32 %3), !dbg !46
; └
; ┌ @ number.jl:7 within `convert`
; │┌ @ boot.jl:893 within `Int32`
; ││┌ @ boot.jl:807 within `toInt32`
; │││┌ @ boot.jl:758 within `check_sign_bit`
; ││││┌ @ boot.jl:743 within `is_top_bit_set`
%.not = icmp sgt i32 %4, -1, !dbg !51
; ││││└
br i1 %.not, label %L9, label %L6, !dbg !54
L6: ; preds = %conversion
; ││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/quirks.jl:8 within `#throw_inexacterror`
; │││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:113 within `signal_exception`
; ││││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:11 within `exception_flag`
; │││││││┌ @ none within `kernel_state`
; ││││││││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%state.i.fca.0.extract.i = extractvalue { i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, 0, !dbg !63
; ││││││└└└
; ││││││┌ @ pointer.jl:180 within `unsafe_store!` @ pointer.jl:180
%5 = inttoptr i64 %state.i.fca.0.extract.i to ptr, !dbg !80
store i32 1, ptr %5, align 1, !dbg !80
; ││││││└
; ││││││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:115 within `signal_exception`
; ││││││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52 within `endpgm`
call void @llvm.amdgcn.endpgm(), !dbg !84
; ││││││└
; ││││││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:116 within `signal_exception`
unreachable, !dbg !88
L9: ; preds = %conversion
; └└└└└└
; @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:5 within `ker_shfl!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:84 within `wavefrontsize`
%6 = call i32 @llvm.amdgcn.wavefrontsize(), !dbg !89
; └
; @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:6 within `ker_shfl!`
; ┌ @ int.jl:232 within `rem` @ int.jl:298
%.not19 = icmp eq i32 %6, 0, !dbg !92
br i1 %.not19, label %fail, label %pass, !dbg !92
fail: ; preds = %L9
; │┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:113 within `signal_exception`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:11 within `exception_flag`
; │││┌ @ none within `kernel_state`
; ││││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%state.i.fca.0.extract.i6 = extractvalue { i64, i64, i64, i64, i64, i64, i32, i32, i64, i64, i64, i64 } %state, 0, !dbg !97
; ││└└└
; ││┌ @ pointer.jl:180 within `unsafe_store!` @ pointer.jl:180
%7 = inttoptr i64 %state.i.fca.0.extract.i6 to ptr, !dbg !103
store i32 1, ptr %7, align 1, !dbg !103
; ││└
; ││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:115 within `signal_exception`
; ││┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl:52 within `endpgm`
call void @llvm.amdgcn.endpgm(), !dbg !105
; ││└
; ││ @ /home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl:116 within `signal_exception`
unreachable, !dbg !107
pass: ; preds = %L9
; └└
; ┌ @ int.jl:87 within `+`
%8 = add nuw i32 %4, 1, !dbg !108
; └
; ┌ @ int.jl:232 within `rem`
; │┌ @ int.jl:188 within `abs`
; ││┌ @ int.jl:142 within `flipsign`
%9 = icmp slt i32 %8, 0, !dbg !110
%10 = xor i32 %4, -1, !dbg !110
%11 = select i1 %9, i32 %10, i32 %8, !dbg !110
%.fca.1.extract = extractvalue { [1 x i64], ptr addrspace(1), i64 } %0, 1
; │└└
; │ @ int.jl:232 within `rem` @ int.jl:298
%12 = urem i32 %11, %6, !dbg !92
; │ @ int.jl:232 within `rem`
; │┌ @ int.jl:142 within `flipsign`
%13 = ashr i32 %8, 31, !dbg !114
%14 = add i32 %12, %13, !dbg !114
%15 = xor i32 %14, %13, !dbg !114
; └└
; ┌ @ int.jl:88 within `*`
%16 = shl i32 %15, 2, !dbg !115
; └
; @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:7 within `ker_shfl!`
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl:180 within `bpermute`
%17 = call i32 @llvm.amdgcn.ds.bpermute(i32 %16, i32 %4), !dbg !117
; └
; ┌ @ /home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl:90 within `#setindex!`
; │┌ @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl:88 within `unsafe_store!`
; ││┌ @ none within `pointerset`
; │││┌ @ none within `macro expansion` @ /home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl:38
%18 = zext i32 %4 to i64, !dbg !120
%19 = getelementptr inbounds i32, ptr addrspace(1) %.fca.1.extract, i64 %18, !dbg !120
store i32 %17, ptr addrspace(1) %19, align 4, !dbg !120, !tbaa !132
; └└└└
; @ /home/pxlth/.julia/dev/AMDGPU/src/t.jl:8 within `ker_shfl!`
ret void, !dbg !135
}
attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #1 = { convergent nocallback nofree nounwind willreturn memory(none) }
attributes #2 = { alwaysinline nocallback nofree nosync nounwind willreturn memory(read) }
attributes #3 = { alwaysinline nocallback nofree nosync nounwind willreturn memory(none) }
attributes #4 = { cold nocallback nofree noreturn nounwind }
attributes #5 = { "amdgpu-unsafe-fp-atomics"="true" "target-cpu"="gfx1100" "target-features"="-wavefrontsize32,+wavefrontsize64" }
attributes #6 = { convergent }
!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.dbg.cu = !{!5, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34, !35, !36, !37}
!opencl.ocl.version = !{!38, !38, !38, !38, !38, !38, !38, !38, !38}
!llvm.ident = !{!39, !39, !39, !39, !39, !39, !39, !39, !39}
!julia.kernel = !{!40}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 4, !"amdgpu_hostcall", i32 1}
!3 = !{i32 1, !"wchar_size", i32 4}
!4 = !{i32 8, !"PIC Level", i32 0}
!5 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!6 = !DIFile(filename: "julia", directory: ".")
!7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!36 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!37 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !6, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None)
!38 = !{i32 2, i32 0}
!39 = !{!"AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)"}
!40 = !{ptr @_Z9ker_shfl_14ROCDeviceArrayI5Int32Li1ELi1EE}
!41 = distinct !DISubprogram(name: "ker_shfl!", linkageName: "julia_ker_shfl!_13627", scope: null, file: !42, line: 3, type: !43, scopeLine: 3, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!42 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/t.jl", directory: ".")
!43 = !DISubroutineType(types: !44)
!44 = !{}
!45 = !{!"exec_hi"}
!46 = !DILocation(line: 108, scope: !47, inlinedAt: !49)
!47 = distinct !DISubprogram(name: "activelane;", linkageName: "activelane", scope: !48, file: !48, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!48 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/wavefront.jl", directory: ".")
!49 = !DILocation(line: 4, scope: !41)
!50 = !{!"exec_lo"}
!51 = !DILocation(line: 743, scope: !52, inlinedAt: !54)
!52 = distinct !DISubprogram(name: "is_top_bit_set;", linkageName: "is_top_bit_set", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!53 = !DIFile(filename: "boot.jl", directory: ".")
!54 = !DILocation(line: 758, scope: !55, inlinedAt: !56)
!55 = distinct !DISubprogram(name: "check_sign_bit;", linkageName: "check_sign_bit", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!56 = !DILocation(line: 807, scope: !57, inlinedAt: !58)
!57 = distinct !DISubprogram(name: "toInt32;", linkageName: "toInt32", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!58 = !DILocation(line: 893, scope: !59, inlinedAt: !60)
!59 = distinct !DISubprogram(name: "Int32;", linkageName: "Int32", scope: !53, file: !53, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!60 = !DILocation(line: 7, scope: !61, inlinedAt: !49)
!61 = distinct !DISubprogram(name: "convert;", linkageName: "convert", scope: !62, file: !62, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!62 = !DIFile(filename: "number.jl", directory: ".")
!63 = !DILocation(line: 38, scope: !64, inlinedAt: !66)
!64 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !65, file: !65, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!65 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/base.jl", directory: ".")
!66 = distinct !DILocation(line: 0, scope: !67, inlinedAt: !69)
!67 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!68 = !DIFile(filename: "none", directory: ".")
!69 = distinct !DILocation(line: 0, scope: !70, inlinedAt: !71)
!70 = distinct !DISubprogram(name: "kernel_state;", linkageName: "kernel_state", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!71 = distinct !DILocation(line: 11, scope: !72, inlinedAt: !74)
!72 = distinct !DISubprogram(name: "exception_flag;", linkageName: "exception_flag", scope: !73, file: !73, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!73 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/runtime.jl", directory: ".")
!74 = distinct !DILocation(line: 113, scope: !75, inlinedAt: !76)
!75 = distinct !DISubprogram(name: "signal_exception", linkageName: "julia_signal_exception_15167", scope: null, file: !73, line: 112, type: !43, scopeLine: 112, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!76 = distinct !DILocation(line: 8, scope: !77, inlinedAt: !79)
!77 = distinct !DISubprogram(name: "#throw_inexacterror", linkageName: "julia_#throw_inexacterror_13659", scope: null, file: !78, line: 40, type: !43, scopeLine: 40, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !5, retainedNodes: !44)
!78 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/quirks.jl", directory: ".")
!79 = distinct !DILocation(line: 758, scope: !55, inlinedAt: !56)
!80 = !DILocation(line: 180, scope: !81, inlinedAt: !83)
!81 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !82, file: !82, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!82 = !DIFile(filename: "pointer.jl", directory: ".")
!83 = distinct !DILocation(line: 180, scope: !81, inlinedAt: !74)
!84 = !DILocation(line: 52, scope: !85, inlinedAt: !87)
!85 = distinct !DISubprogram(name: "endpgm;", linkageName: "endpgm", scope: !86, file: !86, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !20, retainedNodes: !44)
!86 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/execution_control.jl", directory: ".")
!87 = distinct !DILocation(line: 115, scope: !75, inlinedAt: !76)
!88 = !DILocation(line: 116, scope: !75, inlinedAt: !76)
!89 = !DILocation(line: 84, scope: !90, inlinedAt: !91)
!90 = distinct !DISubprogram(name: "wavefrontsize;", linkageName: "wavefrontsize", scope: !48, file: !48, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!91 = !DILocation(line: 5, scope: !41)
!92 = !DILocation(line: 298, scope: !93, inlinedAt: !95)
!93 = distinct !DISubprogram(name: "rem;", linkageName: "rem", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!94 = !DIFile(filename: "int.jl", directory: ".")
!95 = !DILocation(line: 232, scope: !93, inlinedAt: !96)
!96 = !DILocation(line: 6, scope: !41)
!97 = !DILocation(line: 38, scope: !64, inlinedAt: !98)
!98 = distinct !DILocation(line: 0, scope: !67, inlinedAt: !99)
!99 = distinct !DILocation(line: 0, scope: !70, inlinedAt: !100)
!100 = distinct !DILocation(line: 11, scope: !72, inlinedAt: !101)
!101 = distinct !DILocation(line: 113, scope: !75, inlinedAt: !102)
!102 = distinct !DILocation(line: 298, scope: !93, inlinedAt: !95)
!103 = !DILocation(line: 180, scope: !81, inlinedAt: !104)
!104 = distinct !DILocation(line: 180, scope: !81, inlinedAt: !101)
!105 = !DILocation(line: 52, scope: !85, inlinedAt: !106)
!106 = distinct !DILocation(line: 115, scope: !75, inlinedAt: !102)
!107 = !DILocation(line: 116, scope: !75, inlinedAt: !102)
!108 = !DILocation(line: 87, scope: !109, inlinedAt: !96)
!109 = distinct !DISubprogram(name: "+;", linkageName: "+", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!110 = !DILocation(line: 142, scope: !111, inlinedAt: !112)
!111 = distinct !DISubprogram(name: "flipsign;", linkageName: "flipsign", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!112 = !DILocation(line: 188, scope: !113, inlinedAt: !95)
!113 = distinct !DISubprogram(name: "abs;", linkageName: "abs", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!114 = !DILocation(line: 142, scope: !111, inlinedAt: !95)
!115 = !DILocation(line: 88, scope: !116, inlinedAt: !96)
!116 = distinct !DISubprogram(name: "*;", linkageName: "*", scope: !94, file: !94, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!117 = !DILocation(line: 180, scope: !118, inlinedAt: !119)
!118 = distinct !DISubprogram(name: "bpermute;", linkageName: "bpermute", scope: !48, file: !48, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!119 = !DILocation(line: 7, scope: !41)
!120 = !DILocation(line: 38, scope: !121, inlinedAt: !122)
!121 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !65, file: !65, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!122 = !DILocation(line: 0, scope: !123, inlinedAt: !124)
!123 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!124 = !DILocation(line: 0, scope: !125, inlinedAt: !126)
!125 = distinct !DISubprogram(name: "pointerset;", linkageName: "pointerset", scope: !68, file: !68, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!126 = !DILocation(line: 88, scope: !127, inlinedAt: !129)
!127 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !128, file: !128, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!128 = !DIFile(filename: "/home/pxlth/.julia/packages/LLVM/5DlHM/src/interop/pointer.jl", directory: ".")
!129 = !DILocation(line: 90, scope: !130, inlinedAt: !119)
!130 = distinct !DISubprogram(name: "#setindex!;", linkageName: "#setindex!", scope: !131, file: !131, type: !43, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !7, retainedNodes: !44)
!131 = !DIFile(filename: "/home/pxlth/.julia/dev/AMDGPU/src/device/gcn/array.jl", directory: ".")
!132 = !{!133, !133, i64 0, i64 0}
!133 = !{!"custom_tbaa_addrspace(1)", !134, i64 0}
!134 = !{!"custom_tbaa"}
!135 = !DILocation(line: 8, scope: !41) |
Correct. bpermute cannot access the other half of the wave in wave64. The ISA manual states: |
The following LLVM IR:
Compiled with:
yields ICE:
Worth noting that the ICE occurs at -O0, https://llc.godbolt.org/z/xhG4qrYWv as well as at -O1, https://llc.godbolt.org/z/WeMvT17bb.
However, it does not occur at -O2, https://llc.godbolt.org/z/WeMvT17bb
We can see that the basic block (BB)
__ockl_wfany_i32.exit
is just terminated by an unconditional branch to the successor BB2
. If I pretend I'm SimplifyCFG and fold these BBs myself then the ICE doesn't occur at -O0 or -O1, either, https://llc.godbolt.org/z/qTxa444sKThis is reduced LLVM IR (using llvm-reduce). For context, the original LLVM IR has been produced from the following (adding
-v -save-temps
)The original
llc
invocation from-v
wasllc -O0 -mtriple=amdgcn-amd-amdhsa -disable-promote-alloca-to-lds -mcpu=gfx90a -amdgpu-dump-hsa-metadata
although only-O0
suffices to trigger the ICE (with the IR itself already containingtarget triple = "amdgcn-amd-amdhsa"
); cf. the Compiler Explorer links.The text was updated successfully, but these errors were encountered: