WIP: Add an index typevar to CuDeviceArray. #1895

maleadt · 2023-05-03T14:51:16Z

This PR makes it possible to customize the index type of CuDeviceArray, which is a requirement for performing index calculations in 32-bits. It should improve performance by lowering register pressure, and because certain NVIDIA GPUs can execute fp32 and int32 ops in parallel. This has been requested by HPC people (@luraess or @omlins maybe, I don't remember, and Slack has eaten the conversation), but note that this PR is only the first step, as much of Julia's indexing logic assumes it can use machine-native integers (and Int is 64 bits on all platforms that support CUDA).

As such, this is only a test, and will need work on both CUDA.jl to ensure that CuDeviceArray with an additional typevar is supported (i.e. this PR) as well as improvements to CUDA.jl and Base such that 32-bit indices are preserved longer than they currently are (where I'm hoping people will help).

So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using @device_code_warntype interactive=true, or using @device_code_llvm), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.

Demo of the above:

julia> typeof(d_a)
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

# note how the host array doesn't have an index type. maybe we should?

julia> @device_code_warntype @cuda threads=len vadd(d_a, d_b, d_c)
PTX CompilerJob of MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}) for sm_86

MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32})
  from vadd(a, b, c) in Main at REPL[21]:1
Arguments
  #self#::Core.Const(vadd)
  a::CuDeviceMatrix{Float32, 1, Int32}
  b::CuDeviceMatrix{Float32, 1, Int32}
  c::CuDeviceMatrix{Float32, 1, Int32}
Locals
  val::Float32
  i::Int32
Body::Nothing
1 ─ %1  = Main.blockIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│   %2  = Base.getproperty(%1, :x)::Int32
│   %3  = (1 * Main.i32)::Core.Const(1)
│   %4  = (%2 - %3)::Int32
│   %5  = Main.blockDim()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│   %6  = Base.getproperty(%5, :x)::Int32
│   %7  = (%4 * %6)::Int32
│   %8  = Main.threadIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│   %9  = Base.getproperty(%8, :x)::Int32
│         (i = %7 + %9)
│         nothing
│   %12 = Base.getindex(a, i)::Float32
│   %13 = Base.getindex(b, i)::Float32
│   %14 = (%12 + %13)::Float32
│         Base.setindex!(c, %14, i)
│         (val = %14)
│         nothing
│         val
└──       return nothing


julia> @device_code_llvm debuginfo=:none @cuda threads=len vadd(d_a, d_b, d_c)
; PTX CompilerJob of MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}) for sm_86
define ptx_kernel void @_Z4vadd13CuDeviceArrayI7Float32Li2ELi1E5Int32ES_IS0_Li2ELi1ES1_ES_IS0_Li2ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [2 x i32], i32 } %0, { i8 addrspace(1)*, i32, [2 x i32], i32 } %1, { i8 addrspace(1)*, i32, [2 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract11 = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %0, 0
  %.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %1, 0
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %2, 0
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %5 = mul i32 %4, %3
  %6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %7 = add i32 %5, %6
  %8 = sext i32 %7 to i64
  %9 = bitcast i8 addrspace(1)* %.fca.0.extract11 to float addrspace(1)*
  %10 = getelementptr inbounds float, float addrspace(1)* %9, i64 %8
  %11 = load float, float addrspace(1)* %10, align 4
  %12 = bitcast i8 addrspace(1)* %.fca.0.extract1 to float addrspace(1)*
  %13 = getelementptr inbounds float, float addrspace(1)* %12, i64 %8
  %14 = load float, float addrspace(1)* %13, align 4
  %15 = fadd float %11, %14
  %16 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %17 = getelementptr inbounds float, float addrspace(1)* %16, i64 %8
  store float %15, float addrspace(1)* %17, align 4
  ret void
}

Note for example how the Julia IR doesn't have any 64-bits integer, yet the LLVM code still promotes. Inspecting with Cthulhu reveals:

  3 ── %35 = Base.getfield(a, :ptr)::Core.LLVMPtr{Float32, 1}                                                                                                                                                                               ││╻╷╷╷     #arrayref
  │    %36 = Base.llvmcall::Core.Const(Core.Intrinsics.llvmcall)                                                                                                                                                                            │││╻╷╷      arrayref_bits
  │    %37 = Core.tuple("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine float @entry(i8 addrspace(1)* %0, i64 %1) #0 {\nentry:\n  %2 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*\n  %3 = getelementptr inbounds float, float addrspace(1)* %2, i64 %1\n  %4 = load float, float addrspace(1)* %3, align 4, !tbaa !0\n  ret float %4\n}\n\nattributes #0 = { alwaysinline }\n\n!0 = !{!1, !1, i64 0, i64 0}\n!1 = !{!\"custom_tbaa_addrspace(1)\", !2, i64 0}\n!2 = !{!\"custom_tbaa\"}\n", "entry")::Core.Const(("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine float @entry(i8 addrspace(1)* %0, i64 %1) #0 {\nentry:\n  %2 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*\n  %3 = getelementptr inbounds float, float addrspace(1)* %2, i64 %1\n  %4 = load float, float addrspace(1)* %3, align 4, !tbaa !0\n  ret float %4\n}\n\nattributes #0 = { alwaysinline }\n\n!0 = !{!1, !1, i64 0, i64 0}\n!1 = !{!\"custom_tbaa_addrspace(1)\", !2, i64 0}\n!2 = !{!\"custom_tbaa\"}\n", "entry"))
  │    %38 = Base.sub_int(%32, 1)::Int32                                                                                                                                                                                                    │││││┃│││     pointerref
  │    %39 = Core.sext_int(Core.Int64, %38)::Int64                                                                                                                                                                                          ││││││╻        macro expansion
  │    %40 = (%36)(%37, Float32, Tuple{Core.LLVMPtr{Float32, 1}, Int64}, %35, %39)::Float32                                                                                                                                                 │││││││┃        macro expansion
  └───       goto #4                                                                                                                                                                                                                        │││

This would require a change to LLVM.jl's pointerref in order to pass a 32-bit integer to getelementpointer.

cc @jpsamaroo, I think you were also part of the Slack conversation

maleadt · 2023-05-03T17:54:12Z

Debugging breadcrumb: mapreduce fails to compile due to broadcast's getindex returning either an Int64 or Int32:

    72 ┄─ %173 = φ (#64 => %119, #71 => %119)::Int64                                                                                                                                                                        │
121 │     %174 = Base.slt_int(%173, %112)::Bool                                                                                                                                                                             │╻╷╷╷           max
    │     %175 = Core.ifelse::Core.Const(Core.ifelse)                                                                                                                                                                       ││╻              map
    │     %176 = (%175)(%174, %112, %173)::Int64                                                                                                                                                                            │││┃│             max
    │            nothing                                                                                                                                                                                                    │││
    │            nothing                                                                                                                                                                                                    ││╻              CartesianIndex
122 │     %179 = Base.getfield(As, 1, false)::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{CuDeviceVector{Int64, 1, Int32}}}                                          │╻╷             _map_getindex
    └────        goto #73                                                                                                                                                                                                   ││╻              getindex
    73 ── %181 = Base.getfield(%179, :args)::Tuple{CuDeviceVector{Int64, 1, Int32}}                                                                                                                                         │││╻╷             _broadcast_getindex
    │     %182 = Base.getfield(%181, 1, false)::CuDeviceVector{Int64, 1, Int32}                                                                                                                                             ││││╻              _getindex
    │     %183 = Base.getfield(%182, :dims)::Tuple{Int32}                                                                                                                                                                   │││││╻╷╷╷           _broadcast_getindex
    └──── %184 = Base.getfield(%183, 1, true)::Int32                                                                                                                                                                        ││││││╻              newindex
    74 ── %185 = Base.slt_int(%184, 0)::Bool                                                                                                                                                                                │││││││╻╷╷╷           axes
    │     %186 = Core.ifelse::Core.Const(Core.ifelse)                                                                                                                                                                       ││││││││╻              map
    │     %187 = (%186)(%185, 0, %184)::Int32                                                                                                                                                                               │││││││││┃││││          oneto
    │            nothing                                                                                                                                                                                                    ││││││││││┃│             OneTo
    └────        goto #75                                                                                                                                                                                                   │││││││││││┃              OneTo
    75 ──        goto #76                                                                                                                                                                                                   │││││││││││
    76 ──        goto #77                                                                                                                                                                                                   ││││││││││
    77 ──        goto #78                                                                                                                                                                                                   │││││││││
    78 ──        goto #79                                                                                                                                                                                                   ││││││││
    79 ── %194 = Core.sext_int(Core.Int64, %187)::Int64                                                                                                                                                                     ││││││││╻╷╷            length
    └──── %195 = (%194 === 1)::Bool                                                                                                                                                                                         ││││││││╻              ==
    80 ──        goto #81                                                                                                                                                                                                   ││││││││╻              getindex
    81 ──        goto #82                                                                                                                                                                                                   │││││││││
    82 ── %198 = Core.ifelse::Core.Const(Core.ifelse)                                                                                                                                                                       ││││││││╻              ifelse
    │     %199 = (%198)(%195, 1, %176)::Union{Int32, Int64}                                                                                                                                                                 │││││││││
    │     %200 = Core.tuple(%199)::Tuple{Union{Int32, Int64}}                                                                                                                                                               ││││││││
    └────        goto #83                                                                                                                                                                                                   ││││││││

maleadt · 2023-05-09T18:42:01Z

I was sceptical that this change would do much without a thorough pass over all of Base, so I did a test using this PR + the LLVM.jl and GPUCompiler.jl PRs above on Broadcast (which does a whole lot of 64-bit integer stuff):

Reference:

PTX (i.e., virtual registers):
    .reg .pred 	%p<10>;
    .reg .b16 	%rs<3>;
    .reg .f32 	%f<7>;
    .reg .b32 	%r<7>;
    .reg .b64 	%rd<58>;

Effective:
    registers(kernel) = 17

This PR:

PTX:
    .reg .pred 	%p<10>;
    .reg .b16 	%rs<3>;
    .reg .f32 	%f<7>;
    .reg .b32 	%r<29>;
    .reg .b64 	%rd<37>;

Effective:
    registers(kernel) = 15

So not a spectacular reduction, but better than I expected nontheless. KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64)

vchuravy · 2023-05-09T18:56:25Z

KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64)

Currently it does, but we can change that.

omlins · 2023-05-10T10:40:41Z

I was sceptical that this change would do much without a thorough pass over all of Base

@maleadt What you did has already a drastic impact on some kernels! 👍 The following example requires half the amount of registers now (improving from 21 to 10 registers):

using CUDA

function copy3D!(T2, T, Ci)
    ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
    iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
    @inbounds T2[ix-1,iy-2,iz+1] = 3.4
    @inbounds T[ix-1,iy-2,iz+1] = 3.4
    return
end

T  = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);

omlins · 2023-05-10T10:47:02Z

@maleadt When i remove the UInt32 casting in the above code, then it introduces some i64 operations. Here is the code:

using CUDA

function copy3D!(T2, T, Ci)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
    @inbounds T2[ix-1,iy-2,iz+1] = 3.4
    @inbounds T[ix-1,iy-2,iz+1] = 3.4
    return
end

T  = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);

And here is the output:

julia> using CUDA

julia> function copy3D!(T2, T, Ci)
           ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
           iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
           iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
           @inbounds T2[ix-1,iy-2,iz+1] = 3.4
           @inbounds T[ix-1,iy-2,iz+1] = 3.4
           return
       end
copy3D! (generic function with 1 method)

julia> T  = CUDA.zeros(Float64, 2,2,2);

julia> T2 = CUDA.zeros(Float64, 2,2,2);

julia> Ci = CUDA.zeros(Float64, 2,2,2);

julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
  %.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
  %.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
  %.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
  %.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
  %.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %4 = zext i32 %3 to i64
  %5 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %6 = zext i32 %5 to i64
  %7 = mul nuw nsw i64 %6, %4
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %9 = add nuw nsw i32 %8, 1
  %10 = zext i32 %9 to i64
  %11 = add nuw nsw i64 %7, %10
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %13 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %narrow = mul nuw nsw i32 %13, %12
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %15 = add nuw nsw i32 %14, 1
  %narrow26 = add nuw nsw i32 %15, %narrow
  %16 = zext i32 %narrow26 to i64
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
  %18 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
  %narrow27 = mul nuw nsw i32 %18, %17
  %19 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
  %20 = add nuw nsw i32 %19, 1
  %narrow28 = add nuw nsw i32 %20, %narrow27
  %21 = zext i32 %narrow28 to i64
  %22 = icmp sgt i32 %.fca.2.0.extract14, 0
  %23 = select i1 %22, i32 %.fca.2.0.extract14, i32 0
  %24 = icmp sgt i32 %.fca.2.1.extract15, 0
  %25 = select i1 %24, i32 %.fca.2.1.extract15, i32 0
  %26 = zext i32 %23 to i64
  %27 = zext i32 %25 to i64
  %28 = add nsw i64 %16, -3
  %29 = mul nuw nsw i64 %21, %27
  %reass.add = add nsw i64 %28, %29
  %reass.mul = mul i64 %reass.add, %26
  %30 = add nuw nsw i64 %11, 4294967295
  %31 = add i64 %30, %reass.mul
  %32 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
  %33 = trunc i64 %31 to i32
  %34 = add i32 %33, -1
  %35 = getelementptr inbounds double, double addrspace(1)* %32, i32 %34
  store double 3.400000e+00, double addrspace(1)* %35, align 8
  %36 = icmp sgt i32 %.fca.2.0.extract3, 0
  %37 = select i1 %36, i32 %.fca.2.0.extract3, i32 0
  %38 = icmp sgt i32 %.fca.2.1.extract4, 0
  %39 = select i1 %38, i32 %.fca.2.1.extract4, i32 0
  %40 = zext i32 %37 to i64
  %41 = zext i32 %39 to i64
  %42 = mul nuw nsw i64 %21, %41
  %reass.add29 = add nsw i64 %28, %42
  %reass.mul30 = mul i64 %reass.add29, %40
  %43 = add i64 %30, %reass.mul30
  %44 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
  %45 = trunc i64 %43 to i32
  %46 = add i32 %45, -1
  %47 = getelementptr inbounds double, double addrspace(1)* %44, i32 %46
  store double 3.400000e+00, double addrspace(1)* %47, align 8
  ret void
}

julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000005f1de40, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005f0fac0, CuContext(0x0000000001401640, instance a86b98eac7129536))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))

julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 12

julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)

maleadt · 2023-05-10T12:11:48Z

When i remove the UInt32 casting in the above code, then it introduces some i64 operations.

Well, yeah, because you're introducing Int64s. Or what did you expect?
That's the whole problem with expecting Int32s; Julia's integer literals are 64-bits. And that's why I asked for help to audit existing code, which instead of - 1 will have to do stuff like - one(T) (as far as the dispatch allows, because Dims is hard-coded to Int).

omlins · 2023-05-10T12:31:21Z

Well, yeah, because you're introducing Int64s. Or what did you expect?

I originally did expect literals to be treated as Int64. However, as in the first example there are no Int64s introduced (see below), I thought you had taken care of this somehow to some extent. If there is no simple solution to that, I don't think having to cast literals is a priority problem to solve now.

Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments):

julia> using CUDA

julia> function copy3D!(T2, T, Ci)
           ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
           iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
           iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
           @inbounds T2[ix-1,iy-2,iz+1] = 3.4
           @inbounds T[ix-1,iy-2,iz+1] = 3.4
           return
       end
copy3D! (generic function with 2 methods)

julia> T  = CUDA.zeros(Float64, 2,2,2);

julia> T2 = CUDA.zeros(Float64, 2,2,2);

julia> Ci = CUDA.zeros(Float64, 2,2,2);

julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)

; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
  %.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
  %.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
  %.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
  %.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
  %.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %5 = mul i32 %4, %3
  %6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %9 = mul nuw nsw i32 %8, %7
  %10 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %11 = add nuw nsw i32 %10, 1
  %12 = add nuw nsw i32 %11, %9
  %13 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
  %15 = mul nuw nsw i32 %14, %13
  %16 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
  %17 = add nuw nsw i32 %16, 1
  %18 = add nuw nsw i32 %17, %15
  %19 = icmp sgt i32 %.fca.2.0.extract14, 0
  %20 = select i1 %19, i32 %.fca.2.0.extract14, i32 0
  %21 = icmp sgt i32 %.fca.2.1.extract15, 0
  %22 = select i1 %21, i32 %.fca.2.1.extract15, i32 0
  %23 = add nsw i32 %12, -3
  %24 = mul i32 %18, %22
  %reass.add = add i32 %23, %24
  %reass.mul = mul i32 %reass.add, %20
  %25 = add i32 %5, %6
  %26 = add i32 %25, -1
  %27 = add i32 %26, %reass.mul
  %28 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
  %29 = getelementptr inbounds double, double addrspace(1)* %28, i32 %27
  store double 3.400000e+00, double addrspace(1)* %29, align 8
  %30 = icmp sgt i32 %.fca.2.0.extract3, 0
  %31 = select i1 %30, i32 %.fca.2.0.extract3, i32 0
  %32 = icmp sgt i32 %.fca.2.1.extract4, 0
  %33 = select i1 %32, i32 %.fca.2.1.extract4, i32 0
  %34 = mul i32 %18, %33
  %reass.add26 = add i32 %23, %34
  %reass.mul27 = mul i32 %reass.add26, %31
  %35 = add i32 %26, %reass.mul27
  %36 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
  %37 = getelementptr inbounds double, double addrspace(1)* %36, i32 %35
  store double 3.400000e+00, double addrspace(1)* %37, align 8
  ret void
}

julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x000000000649b690, CuModule(Ptr{CUDA.CUmod_st} @0x0000000006523ff0, CuContext(0x00000000014542b0, instance 2d6c2e17d661c696))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))

julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 10

julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)

maleadt · 2023-05-10T13:13:17Z

Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments):

LLVM probably managed to optimize them away. We can't change the fundamental nature of integer literals being Int64 on 64-bit systems from the GPUCompiler side.

omlins · 2023-05-10T14:04:13Z

So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using @device_code_warntype interactive=true, or using @device_code_llvm), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.

@luraess, @utkinis, @albert-de-montserrat: could you please run some of your codes with CUDA.jl from this PR and the corresponding GPUCompiler and LLVM branches. In summary the branches are the following:

  [052768ef] CUDA v4.2.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/32bit_device_array`
  [61eb1bfa] GPUCompiler v0.19.3 `https://github.com/JuliaGPU/GPUCompiler.jl.git#tb/ptx_dl_32bit`
  [929cbde3] LLVM v5.0.0 `https://github.com/maleadt/LLVM.jl.git#tb/pointerref_int32`

omlins · 2023-05-10T16:38:53Z

@maleadt: I have run some little test codes with increasing complexity (including some examples from https://github.com/omlins/julia-gpu-course/blob/main/solutions/4_datatransfer_optimisations_advanced_part2.ipynb), always casting literal integers to Int32.

I have have encountered a first issue. The following little diffusion code requires less registers with this PR as expected (and does not include any i64 operations), however, this does not lead to better performance but worse performance. Thus, something is not going as it should. Here is the code:

using CUDA
using BenchmarkTools

function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
    ix = (blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-Int32(1)) * blockDim().y + threadIdx().y
    T_ix_iy_izm1 = 0.0
    T_ix_iy_iz   = 0.0
    T_ix_iy_izp1 = T[ix,iy,Int32(1)]
    for iz = Int32(1):size(T2,3)
        T_ix_iy_izm1   = T_ix_iy_iz
        T_ix_iy_iz     = T_ix_iy_izp1
        T_ix_iy_izp1   = iz<size(T2,3) ? T[ix,iy,iz+Int32(1)] : 0.0
        if (ix>Int32(1) && ix<size(T2,1) && iy>Int32(1) && iy<size(T2,2) && iz>Int32(1) && iz<size(T2,3))
            T2[ix,iy,iz] = T_ix_iy_iz + dt*(Ci[ix,iy,iz]*(
                            - ((-lam*(T[ix+Int32(1),iy,iz] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ix-Int32(1),iy,iz])*_dx))*_dx
                            - ((-lam*(T[ix,iy+Int32(1),iz] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[ix,iy-Int32(1),iz])*_dy))*_dy
                            - ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
                            ));
        end
    end
    return
end

function diffusion3D()
# Physics
lam        = 1.0;                                        # Thermal conductivity
c0         = 2.0;                                        # Heat capacity
lx, ly, lz = 1.0, 1.0, 1.0;                              # Length of computational domain in dimension x, y and z

# Numerics
nx, ny, nz = 512, 512, 512;                              # Number of gridpoints in dimensions x, y and z
nt         = 100;                                        # Number of time steps
dx         = lx/(nx-1);                                  # Space step in x-dimension
dy         = ly/(ny-1);                                  # Space step in y-dimension
dz         = lz/(nz-1);                                  # Space step in z-dimension
_dx, _dy, _dz = 1.0/dx, 1.0/dy, 1.0/dz;

# Array initializations
T   = CUDA.zeros(Float64, nx, ny, nz);
T2  = CUDA.zeros(Float64, nx, ny, nz);
Ci  = CUDA.zeros(Float64, nx, ny, nz);

# Initial conditions
Ci .= 1/c0;                                              # 1/Heat capacity
T  .= 1.7;
T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

# GPU launch parameters
threads = (32, 8)
blocks  = (nx, ny) .÷ threads

# Time loop
dt   = 0.000001 #min(dx^2,dy^2,dz^2)/lam/maximum(Ci)/6.1;          # Time step for 3D Heat diffusion
for it = 1:nt
    if (it == 11) GC.gc(); global t_tic=time(); end      # Start measuring time.
    @cuda blocks=blocks threads=threads diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
    synchronize()
    T, T2 = T2, T;
end
time_s = time() - t_tic

# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T));        # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it  = time_s/(nt-10);                                  # Execution time per iteration [s]
T_eff = A_eff/t_it;                                      # Effective memory throughput [GB/s]
println("time_s=$time_s t_it=$t_it T_eff=$T_eff");

# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T));        # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads diffusion3D_step!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy, $_dz); synchronize() end
println("Benchmarktools (min): t_it=$t_it T_eff=$(A_eff/t_it)");

# Resource usage
@show kernel = @cuda launch=false diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);
@device_code_llvm debuginfo=:none @cuda diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);

end

diffusion3D()

Here is the output from running it with this PR:

omlins@nid00000:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=1.0009851455688477 t_it=0.011122057172987197 T_eff=289.6249697244483
Benchmarktools (min): t_it=0.010862522 T_eff=296.54489740043795
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000006e97940, CuModule(Ptr{CUDA.CUmod_st} @0x000000000777fd50, CuContext(0x0000000001488260, instance caed86fb0f770aba))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4800000))
CUDA.registers(kernel) = 32
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_ES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
  %.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
  %.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
  %.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 2
  %.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
  %.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
  %.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 0
  %.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 1
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %10 = mul i32 %9, %8
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %12 = add i32 %10, %11
  %13 = add i32 %12, 1
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %16 = mul nuw nsw i32 %15, %14
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %18 = add nuw nsw i32 %16, %17
  %19 = add nuw nsw i32 %18, 1
  %20 = icmp sgt i32 %.fca.2.0.extract6, 0
  %21 = select i1 %20, i32 %.fca.2.0.extract6, i32 0
  %22 = mul i32 %18, %21
  %23 = add i32 %12, %22
  %24 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
  %25 = getelementptr inbounds double, double addrspace(1)* %24, i32 %23
  %26 = load double, double addrspace(1)* %25, align 8
  %.inv = icmp sgt i32 %.fca.2.2.extract42, 0
  %value_phi = select i1 %.inv, i32 %.fca.2.2.extract42, i32 0
  %27 = icmp slt i32 %value_phi, 1
  %28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
  br i1 %27, label %L560, label %L133.preheader

L133.preheader:                                   ; preds = %conversion
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 0
  %29 = icmp sgt i32 %.fca.2.1.extract7, 0
  %30 = select i1 %29, i32 %.fca.2.1.extract7, i32 0
  %31 = icmp slt i32 %13, 2
  %.not47 = icmp sge i32 %13, %.fca.2.0.extract40
  %32 = icmp eq i32 %18, 0
  %or.cond = select i1 %.not47, i1 true, i1 %32
  %.not48 = icmp sge i32 %19, %.fca.2.1.extract41
  %33 = icmp sgt i32 %.fca.2.0.extract, 0
  %34 = select i1 %33, i32 %.fca.2.0.extract, i32 0
  %35 = icmp sgt i32 %.fca.2.1.extract, 0
  %36 = select i1 %35, i32 %.fca.2.1.extract, i32 0
  %37 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
  %38 = fneg double %3
  %39 = add i32 %12, -1
  %40 = add nsw i32 %18, -1
  %41 = icmp sgt i32 %.fca.2.0.extract40, 0
  %42 = select i1 %41, i32 %.fca.2.0.extract40, i32 0
  %43 = icmp sgt i32 %.fca.2.1.extract41, 0
  %44 = select i1 %43, i32 %.fca.2.1.extract41, i32 0
  br i1 %31, label %L560, label %L133

L133:                                             ; preds = %L547, %L133.preheader
  %value_phi4 = phi i32 [ %101, %L547 ], [ 1, %L133.preheader ]
  %value_phi6 = phi double [ %value_phi8, %L547 ], [ %26, %L133.preheader ]
  %value_phi7 = phi double [ %value_phi6, %L547 ], [ 0.000000e+00, %L133.preheader ]
  %.not44 = icmp slt i32 %value_phi4, %.fca.2.2.extract42
  br i1 %.not44, label %L141, label %L196

L141:                                             ; preds = %L133
  %45 = mul i32 %value_phi4, %30
  %reass.add = add i32 %18, %45
  %reass.mul = mul i32 %reass.add, %21
  %46 = add i32 %12, %reass.mul
  %47 = getelementptr inbounds double, double addrspace(1)* %24, i32 %46
  %48 = load double, double addrspace(1)* %47, align 8
  br label %L196

L196:                                             ; preds = %L141, %L133
  %value_phi8 = phi double [ %48, %L141 ], [ 0.000000e+00, %L133 ]
  br i1 %or.cond, label %L547, label %L202

L202:                                             ; preds = %L196
  %49 = icmp ult i32 %value_phi4, 2
  %or.cond63 = select i1 %.not48, i1 true, i1 %49
  %.not44.not = xor i1 %.not44, true
  %brmerge = select i1 %or.cond63, i1 true, i1 %.not44.not
  br i1 %brmerge, label %L547, label %L212

L212:                                             ; preds = %L202
  %50 = add nsw i32 %value_phi4, -1
  %51 = mul i32 %50, %36
  %reass.add64 = add i32 %18, %51
  %reass.mul65 = mul i32 %reass.add64, %34
  %52 = add i32 %12, %reass.mul65
  %53 = getelementptr inbounds double, double addrspace(1)* %37, i32 %52
  %54 = load double, double addrspace(1)* %53, align 8
  %55 = mul i32 %50, %30
  %reass.add66 = add i32 %18, %55
  %reass.mul67 = mul i32 %reass.add66, %21
  %56 = add i32 %reass.mul67, %13
  %57 = getelementptr inbounds double, double addrspace(1)* %24, i32 %56
  %58 = load double, double addrspace(1)* %57, align 8
  %59 = fsub double %58, %value_phi6
  %60 = fmul double %59, %38
  %61 = fmul double %60, %5
  %62 = add i32 %39, %reass.mul67
  %63 = getelementptr inbounds double, double addrspace(1)* %24, i32 %62
  %64 = load double, double addrspace(1)* %63, align 8
  %65 = fsub double %value_phi6, %64
  %66 = fmul double %65, %38
  %67 = fmul double %66, %5
  %68 = fsub double %61, %67
  %69 = fneg double %68
  %70 = fmul double %69, %5
  %reass.add70 = add i32 %55, %19
  %reass.mul71 = mul i32 %reass.add70, %21
  %71 = add i32 %12, %reass.mul71
  %72 = getelementptr inbounds double, double addrspace(1)* %24, i32 %71
  %73 = load double, double addrspace(1)* %72, align 8
  %74 = fsub double %73, %value_phi6
  %75 = fmul double %74, %38
  %76 = fmul double %75, %6
  %reass.add72 = add i32 %40, %55
  %reass.mul73 = mul i32 %reass.add72, %21
  %77 = add i32 %12, %reass.mul73
  %78 = getelementptr inbounds double, double addrspace(1)* %24, i32 %77
  %79 = load double, double addrspace(1)* %78, align 8
  %80 = fsub double %value_phi6, %79
  %81 = fmul double %80, %38
  %82 = fmul double %81, %6
  %83 = fsub double %76, %82
  %84 = fmul double %83, %6
  %85 = fsub double %70, %84
  %86 = fsub double %value_phi8, %value_phi6
  %87 = fmul double %86, %38
  %88 = fmul double %87, %7
  %89 = fsub double %value_phi6, %value_phi7
  %90 = fmul double %89, %38
  %91 = fmul double %90, %7
  %92 = fsub double %88, %91
  %93 = fmul double %92, %7
  %94 = fsub double %85, %93
  %95 = fmul double %54, %94
  %96 = fmul double %95, %4
  %97 = fadd double %value_phi6, %96
  %98 = mul i32 %44, %50
  %reass.add74 = add i32 %18, %98
  %reass.mul75 = mul i32 %reass.add74, %42
  %99 = add i32 %12, %reass.mul75
  %100 = getelementptr inbounds double, double addrspace(1)* %28, i32 %99
  store double %97, double addrspace(1)* %100, align 8
  br label %L547

L547:                                             ; preds = %L212, %L202, %L196
  %.not62.not = icmp eq i32 %value_phi4, %value_phi
  %101 = add nuw i32 %value_phi4, 1
  br i1 %.not62.not, label %L560, label %L133

L560:                                             ; preds = %L547, %L133.preheader, %conversion
  ret void
}

... and here is the output from running it with CUDA.jl v4.2.0:

omlins@nid02027:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=0.7261550426483154 t_it=0.008068389362759061 T_eff=399.240211047335
Benchmarktools (min): t_it=0.007945697 T_eff=405.4050226179025
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x00000000075f7cd0, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005c6ed10, CuContext(0x000000000135f910, instance ea771e7429a2560b))), CUDA.KernelState(Ptr{Nothing} @0x00001553a4800000))
CUDA.registers(kernel) = 48
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 0
  %.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 0
  %.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 1
  %.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 2
  %.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 0
  %.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 0
  %.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 1
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 0
  %.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 1
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %10 = mul i32 %9, %8
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %12 = add i32 %10, %11
  %13 = add i32 %12, 1
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %16 = mul nuw nsw i32 %15, %14
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %18 = add nuw nsw i32 %16, %17
  %19 = add nuw nsw i32 %18, 1
  %20 = icmp sgt i64 %.fca.2.0.extract6, 0
  %21 = select i1 %20, i64 %.fca.2.0.extract6, i64 0
  %22 = sext i32 %13 to i64
  %23 = zext i32 %19 to i64
  %24 = add nsw i64 %23, -1
  %25 = add nsw i64 %22, -1
  %26 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
  %.inv = icmp sgt i64 %.fca.2.2.extract42, 0
  %value_phi = select i1 %.inv, i64 %.fca.2.2.extract42, i64 0
  %27 = icmp slt i64 %value_phi, 1
  %28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
  br i1 %27, label %L532, label %L131.preheader

L131.preheader:                                   ; preds = %conversion
  %29 = mul i64 %24, %21
  %30 = add i64 %25, %29
  %31 = getelementptr inbounds double, double addrspace(1)* %26, i64 %30
  %32 = load double, double addrspace(1)* %31, align 8
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 0
  %33 = icmp sgt i64 %.fca.2.1.extract7, 0
  %34 = select i1 %33, i64 %.fca.2.1.extract7, i64 0
  %35 = icmp slt i32 %13, 2
  %36 = zext i32 %13 to i64
  %.not47 = icmp sle i64 %.fca.2.0.extract40, %36
  %37 = icmp eq i32 %18, 0
  %.not48 = icmp sle i64 %.fca.2.1.extract41, %23
  %38 = icmp sgt i64 %.fca.2.0.extract, 0
  %39 = select i1 %38, i64 %.fca.2.0.extract, i64 0
  %40 = icmp sgt i64 %.fca.2.1.extract, 0
  %41 = select i1 %40, i64 %.fca.2.1.extract, i64 0
  %42 = add nsw i64 %36, -1
  %43 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
  %44 = fneg double %3
  %45 = add i32 %12, 2
  %46 = sext i32 %45 to i64
  %47 = add nsw i64 %46, -1
  %48 = sext i32 %12 to i64
  %49 = add nsw i64 %48, -1
  %50 = zext i32 %18 to i64
  %51 = add nsw i64 %50, -1
  %52 = icmp sgt i64 %.fca.2.0.extract40, 0
  %53 = select i1 %52, i64 %.fca.2.0.extract40, i64 0
  %54 = icmp sgt i64 %.fca.2.1.extract41, 0
  %55 = select i1 %54, i64 %.fca.2.1.extract41, i64 0
  %56 = select i1 %35, i1 true, i1 %.not47
  %brmerge = select i1 %56, i1 true, i1 %37
  br label %L131

L131:                                             ; preds = %L519, %L131.preheader
  %value_phi4 = phi i64 [ %113, %L519 ], [ 1, %L131.preheader ]
  %value_phi6 = phi double [ %value_phi8, %L519 ], [ %32, %L131.preheader ]
  %value_phi7 = phi double [ %value_phi6, %L519 ], [ 0.000000e+00, %L131.preheader ]
  %.not44 = icmp slt i64 %value_phi4, %.fca.2.2.extract42
  br i1 %.not44, label %L139, label %L187

L139:                                             ; preds = %L131
  %57 = mul i64 %value_phi4, %34
  %reass.add = add i64 %24, %57
  %reass.mul = mul i64 %reass.add, %21
  %58 = add i64 %25, %reass.mul
  %59 = getelementptr inbounds double, double addrspace(1)* %26, i64 %58
  %60 = load double, double addrspace(1)* %59, align 8
  br label %L187

L187:                                             ; preds = %L139, %L131
  %value_phi8 = phi double [ %60, %L139 ], [ 0.000000e+00, %L131 ]
  br i1 %brmerge, label %L519, label %L197

L197:                                             ; preds = %L187
  %61 = icmp ult i64 %value_phi4, 2
  %or.cond63 = select i1 %.not48, i1 true, i1 %61
  %.not44.not = xor i1 %.not44, true
  %brmerge76 = select i1 %or.cond63, i1 true, i1 %.not44.not
  br i1 %brmerge76, label %L519, label %L208

L208:                                             ; preds = %L197
  %62 = add nsw i64 %value_phi4, -1
  %63 = mul i64 %62, %41
  %reass.add64 = add i64 %24, %63
  %reass.mul65 = mul i64 %reass.add64, %39
  %64 = add i64 %42, %reass.mul65
  %65 = getelementptr inbounds double, double addrspace(1)* %43, i64 %64
  %66 = load double, double addrspace(1)* %65, align 8
  %67 = mul i64 %62, %34
  %reass.add66 = add i64 %24, %67
  %reass.mul67 = mul i64 %reass.add66, %21
  %68 = add i64 %47, %reass.mul67
  %69 = getelementptr inbounds double, double addrspace(1)* %26, i64 %68
  %70 = load double, double addrspace(1)* %69, align 8
  %71 = fsub double %70, %value_phi6
  %72 = fmul double %71, %44
  %73 = fmul double %72, %5
  %74 = add i64 %49, %reass.mul67
  %75 = getelementptr inbounds double, double addrspace(1)* %26, i64 %74
  %76 = load double, double addrspace(1)* %75, align 8
  %77 = fsub double %value_phi6, %76
  %78 = fmul double %77, %44
  %79 = fmul double %78, %5
  %80 = fsub double %73, %79
  %81 = fneg double %80
  %82 = fmul double %81, %5
  %reass.add70 = add i64 %67, %23
  %reass.mul71 = mul i64 %reass.add70, %21
  %83 = add i64 %42, %reass.mul71
  %84 = getelementptr inbounds double, double addrspace(1)* %26, i64 %83
  %85 = load double, double addrspace(1)* %84, align 8
  %86 = fsub double %85, %value_phi6
  %87 = fmul double %86, %44
  %88 = fmul double %87, %6
  %reass.add72 = add i64 %51, %67
  %reass.mul73 = mul i64 %reass.add72, %21
  %89 = add i64 %42, %reass.mul73
  %90 = getelementptr inbounds double, double addrspace(1)* %26, i64 %89
  %91 = load double, double addrspace(1)* %90, align 8
  %92 = fsub double %value_phi6, %91
  %93 = fmul double %92, %44
  %94 = fmul double %93, %6
  %95 = fsub double %88, %94
  %96 = fmul double %95, %6
  %97 = fsub double %82, %96
  %98 = fsub double %value_phi8, %value_phi6
  %99 = fmul double %98, %44
  %100 = fmul double %99, %7
  %101 = fsub double %value_phi6, %value_phi7
  %102 = fmul double %101, %44
  %103 = fmul double %102, %7
  %104 = fsub double %100, %103
  %105 = fmul double %104, %7
  %106 = fsub double %97, %105
  %107 = fmul double %66, %106
  %108 = fmul double %107, %4
  %109 = fadd double %value_phi6, %108
  %110 = mul i64 %55, %62
  %reass.add74 = add i64 %24, %110
  %reass.mul75 = mul i64 %reass.add74, %53
  %111 = add i64 %42, %reass.mul75
  %112 = getelementptr inbounds double, double addrspace(1)* %28, i64 %111
  store double %109, double addrspace(1)* %112, align 8
  br label %L519

L519:                                             ; preds = %L208, %L197, %L187
  %.not62.not = icmp eq i64 %value_phi4, %value_phi
  %113 = add nuw i64 %value_phi4, 1
  br i1 %.not62.not, label %L532, label %L131

L532:                                             ; preds = %L519, %conversion
  ret void
}

maleadt · 2023-05-10T16:56:38Z

That's surprising. Nothing in the code points to an obvious performance issue though, to the contrary actually. Try running with NSight Compute to compare kernel execution times. Maybe reuse of i32 registers complicates ILP, and mixing both integer widths inadvertently creates more opportunity for parallelism? In any case, profiling kernels seems necessary here.

maleadt added enhancement New feature or request performance How fast can we go? help wanted Extra attention is needed labels May 3, 2023

maleadt force-pushed the tb/32bit_device_array branch from 254bc7e to db2723f Compare May 3, 2023 15:15

luraess mentioned this pull request May 9, 2023

Thread (CPU) Float32/Float64 performance comparison on miniapp acoustic2D omlins/ParallelStencil.jl#88

Closed

This was referenced May 9, 2023

Preserve integer index type for pointerref/pointerset. maleadt/LLVM.jl#342

Merged

PTX: Default to 32-bit indexing of pointers. JuliaGPU/GPUCompiler.jl#444

Merged

maleadt mentioned this pull request May 31, 2023

PTX: Demote GEP indices to 32 bits, if possible. JuliaGPU/GPUCompiler.jl#463

Open

maleadt mentioned this pull request Jun 16, 2023

Audit uses of 32-bit indexing #1968

Open

maleadt mentioned this pull request Jul 5, 2023

Added wrapped C cuda code and runable examples TensorBFS/CuTropicalGEMM.jl#1

Merged

lcw mentioned this pull request Jul 12, 2023

Add an index typevar to GridArray HorribleSanity/Raven.jl#30

Open

maleadt mentioned this pull request Jul 28, 2023

volumerhs performance regression #2010

Closed

maleadt force-pushed the master branch from c97bc77 to d57e020 Compare September 8, 2023 20:12

maleadt force-pushed the tb/32bit_device_array branch from db2723f to 7e37a56 Compare September 11, 2023 15:25

maleadt force-pushed the master branch from 1cb1f53 to 1a1d127 Compare September 18, 2023 16:28

maleadt added 4 commits January 9, 2024 13:30

Add an index typevar to CuDeviceArray.

c906ca6

Fix reshape.

ac0f7d7

Work around type-instability in broadcast with 32-bit axes.

299c1bf

Fix sparse device arrays.

e8e7954

Fix intrinsic test.

4b07020

maleadt force-pushed the tb/32bit_device_array branch from 7e37a56 to 4b07020 Compare January 9, 2024 16:00

maleadt force-pushed the master branch from aef3298 to 4b017c6 Compare January 18, 2024 12:09

maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add an index typevar to CuDeviceArray. #1895

WIP: Add an index typevar to CuDeviceArray. #1895

maleadt commented May 3, 2023

maleadt commented May 3, 2023

maleadt commented May 9, 2023

vchuravy commented May 9, 2023

omlins commented May 10, 2023 •

edited

Loading

omlins commented May 10, 2023

maleadt commented May 10, 2023

omlins commented May 10, 2023 •

edited

Loading

maleadt commented May 10, 2023

omlins commented May 10, 2023

omlins commented May 10, 2023 •

edited

Loading

maleadt commented May 10, 2023

WIP: Add an index typevar to CuDeviceArray. #1895

Are you sure you want to change the base?

WIP: Add an index typevar to CuDeviceArray. #1895

Conversation

maleadt commented May 3, 2023

maleadt commented May 3, 2023

maleadt commented May 9, 2023

vchuravy commented May 9, 2023

omlins commented May 10, 2023 • edited Loading

omlins commented May 10, 2023

maleadt commented May 10, 2023

omlins commented May 10, 2023 • edited Loading

maleadt commented May 10, 2023

omlins commented May 10, 2023

omlins commented May 10, 2023 • edited Loading

maleadt commented May 10, 2023

omlins commented May 10, 2023 •

edited

Loading

omlins commented May 10, 2023 •

edited

Loading

omlins commented May 10, 2023 •

edited

Loading