Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

mapreduce (sum, prod, etc.) fail in some cases when given a dims argument. #583

Closed
Sleort opened this issue Feb 3, 2020 · 9 comments · Fixed by #602
Closed

mapreduce (sum, prod, etc.) fail in some cases when given a dims argument. #583

Sleort opened this issue Feb 3, 2020 · 9 comments · Fixed by #602
Labels

Comments

@Sleort
Copy link

Sleort commented Feb 3, 2020

Describe the bug
mapreduce(f, op, A...; dims = dims) and friends (sum(f, A; dims = dims), prod(f, A; dims = dims)...) fail for many (but not all) functions f when a dims ≠ : argument is given.

To Reproduce
The Minimal Working Example (MWE) for this bug:

julia> x = cu(rand(3,3))
3×3 CuArray{Float32,2,Nothing}:
 0.849469  0.625782  0.38785  
 0.877458  0.295448  0.0183218
 0.285424  0.496025  0.0742507

julia> sum(abs, x, dims=1) #Okay also when f = abs2
1×3 CuArray{Float32,2,Nothing}:
 2.01235  1.41725  0.480422

julia> sum(cos, x) #This is fine when dims = :
7.8284926f0

julia> sum(cos, x, dims = 1) #Fails for f ∈ (sin, sqrt, ...) as well...
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called cos(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:100, maybe you intended to call cos(x::Float32) in CUDAnative at /home/troels/.julia/packages/CUDAnative/KWTMt/src/device/cuda/math.jl:6 instead?
│    Stacktrace:
│     [1] cos at special/trig.jl:100
│     [2] mapreducedim_kernel_parallel at /home/troels/.julia/packages/CuArrays/OiLYC/src/mapreduce.jl:20
└ @ CUDAnative ~/.julia/packages/CUDAnative/KWTMt/src/compiler/irgen.jl:111
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called cos(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:100, maybe you intended to call cos(x::Float32) in CUDAnative at /home/troels/.julia/packages/CUDAnative/KWTMt/src/device/cuda/math.jl:6 instead?
│    Stacktrace:
│     [1] cos at special/trig.jl:100
│     [2] mapreducedim_kernel_parallel at /home/troels/.julia/packages/CuArrays/OiLYC/src/mapreduce.jl:20
└ @ CUDAnative ~/.julia/packages/CUDAnative/KWTMt/src/compiler/irgen.jl:111
ERROR: LLVM error: Cannot select: 0x690cc50: i64,glue = sube Constant:i64<0>, 0x690cb80, 0x690cbe8:1
  0x6909ec8: i64 = Constant<0>
  0x690cb80: i64 = add 0x6909d28, 0x690cb18
    0x6909d28: i64 = add 0x690a4e0, 0x690a208
      0x690a4e0: i64 = mul 0x6909d90, 0x690a0d0
        0x6909d90: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %13
          0x690bce0: i64 = Register %13
        0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
          0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
            0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
              0x6909df8: i64 = Register %0
            0x690a680: i64 = Constant<4503599627370495>
          0x6909cc0: i64 = Constant<4503599627370496>
      0x690a208: i64 = mulhu 0x690be80, 0x690a0d0
        0x690be80: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %14
          0x690a750: i64 = Register %14
        0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
          0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
            0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
              0x6909df8: i64 = Register %0
            0x690a680: i64 = Constant<4503599627370495>
          0x6909cc0: i64 = Constant<4503599627370496>
    0x690cb18: i64 = select 0x690cab0, Constant:i64<1>, 0x690a7b8
      0x690cab0: i1 = setcc 0x690c978, 0x690c910, setult:ch
        0x690c978: i64 = add 0x690a5b0, 0x690c910
          0x690a5b0: i64 = mul 0x690be80, 0x690a0d0
            0x690be80: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %14
              0x690a750: i64 = Register %14
            0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
              0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
                0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
                  0x6909df8: i64 = Register %0
                0x690a680: i64 = Constant<4503599627370495>
              0x6909cc0: i64 = Constant<4503599627370496>
          0x690c910: i64 = mulhu 0x690a138, 0x690a0d0
            0x690a138: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %15
              0x690c088: i64 = Register %15
            0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
              0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
                0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
                  0x6909df8: i64 = Register %0
                0x690a680: i64 = Constant<4503599627370495>
              0x6909cc0: i64 = Constant<4503599627370496>
        0x690c910: i64 = mulhu 0x690a138, 0x690a0d0
          0x690a138: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %15
            0x690c088: i64 = Register %15
          0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
            0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
              0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
                0x6909df8: i64 = Register %0
              0x690a680: i64 = Constant<4503599627370495>
            0x6909cc0: i64 = Constant<4503599627370496>
      0x690ab60: i64 = Constant<1>
      0x690a7b8: i64 = zero_extend 0x690c9e0
        0x690c9e0: i1 = setcc 0x690c978, 0x690a5b0, setult:ch
          0x690c978: i64 = add 0x690a5b0, 0x690c910
            0x690a5b0: i64 = mul 0x690be80, 0x690a0d0
              0x690be80: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %14
                0x690a750: i64 = Register %14
              0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
                0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
                  0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0

                  0x690a680: i64 = Constant<4503599627370495>
                0x6909cc0: i64 = Constant<4503599627370496>
            0x690c910: i64 = mulhu 0x690a138, 0x690a0d0
              0x690a138: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %15
                0x690c088: i64 = Register %15
              0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
                0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
                  0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0

                  0x690a680: i64 = Constant<4503599627370495>
                0x6909cc0: i64 = Constant<4503599627370496>
          0x690a5b0: i64 = mul 0x690be80, 0x690a0d0
            0x690be80: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %14
              0x690a750: i64 = Register %14
            0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
              0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
                0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
                  0x6909df8: i64 = Register %0
                0x690a680: i64 = Constant<4503599627370495>
              0x6909cc0: i64 = Constant<4503599627370496>
  0x690cbe8: i64,glue = subc Constant:i64<0>, 0x690c978
    0x6909ec8: i64 = Constant<0>
    0x690c978: i64 = add 0x690a5b0, 0x690c910
      0x690a5b0: i64 = mul 0x690be80, 0x690a0d0
        0x690be80: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %14
          0x690a750: i64 = Register %14
        0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
          0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
            0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
              0x6909df8: i64 = Register %0
            0x690a680: i64 = Constant<4503599627370495>
          0x6909cc0: i64 = Constant<4503599627370496>
      0x690c910: i64 = mulhu 0x690a138, 0x690a0d0
        0x690a138: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %15
          0x690c088: i64 = Register %15
        0x690a0d0: i64 = or 0x690aa28, Constant:i64<4503599627370496>
          0x690aa28: i64 = and 0x690c020, Constant:i64<4503599627370495>
            0x690c020: i64,ch = CopyFromReg 0x2cda6e0, Register:i64 %0
              0x6909df8: i64 = Register %0
            0x690a680: i64 = Constant<4503599627370495>
          0x6909cc0: i64 = Constant<4503599627370496>
In function: julia_paynehanek_18536
Stacktrace:
 [1] handle_error(::Cstring) at /home/troels/.julia/packages/LLVM/DAnFH/src/core/context.jl:103
 [2] macro expansion at /home/troels/.julia/packages/LLVM/DAnFH/src/base.jl:18 [inlined]
 [3] LLVMTargetMachineEmitToMemoryBuffer at /home/troels/.julia/packages/LLVM/DAnFH/lib/6.0/libLLVM_h.jl:2726 [inlined]
 [4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::LLVM.API.LLVMCodeGenFileType) at /home/troels/.julia/packages/LLVM/DAnFH/src/targetmachine.jl:42
 [5] mcgen(::CUDAnative.CompilerJob, ::LLVM.Module, ::LLVM.Function) at /home/troels/.julia/packages/CUDAnative/KWTMt/src/compiler/mcgen.jl:87
 [6] macro expansion at /home/troels/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
 [7] macro expansion at /home/troels/.julia/packages/CUDAnative/KWTMt/src/compiler/driver.jl:209 [inlined]
 [8] macro expansion at /home/troels/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
 [9] #codegen#154(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/troels/.julia/packages/CUDAnative/KWTMt/src/compiler/driver.jl:206
 [10] #codegen at ./none:0 [inlined]
 [11] #compile#153(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /home/troels/.julia/packages/CUDAnative/KWTMt/src/compiler/driver.jl:52
 [12] #compile at ./none:0 [inlined]
 [13] #compile#152 at /home/troels/.julia/packages/CUDAnative/KWTMt/src/compiler/driver.jl:33 [inlined]
 [14] #compile at ./none:0 [inlined] (repeats 2 times)
 [15] macro expansion at /home/troels/.julia/packages/CUDAnative/KWTMt/src/execution.jl:393 [inlined]
 [16] #cufunction#198(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::typeof(CuArrays.mapreducedim_kernel_parallel), ::Type{Tuple{typeof(cos),typeof(Base.add_sum),CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}},Int64,Int64}}) at /home/troels/.julia/packages/CUDAnative/KWTMt/src/execution.jl:360
 [17] cufunction(::Function, ::Type) at /home/troels/.julia/packages/CUDAnative/KWTMt/src/execution.jl:360
 [18] macro expansion at /home/troels/.julia/packages/CuArrays/OiLYC/src/mapreduce.jl:61 [inlined]
 [19] macro expansion at ./gcutils.jl:91 [inlined]
 [20] _mapreducedim!(::Function, ::Function, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,2,Nothing}) at /home/troels/.julia/packages/CuArrays/OiLYC/src/mapreduce.jl:58
 [21] mapreducedim!(::Function, ::Function, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,2,Nothing}) at ./reducedim.jl:274
 [22] _mapreduce_dim(::Function, ::Function, ::NamedTuple{(),Tuple{}}, ::CuArray{Float32,2,Nothing}, ::Int64) at ./reducedim.jl:317
 [23] mapreduce_impl at /home/troels/.julia/packages/GPUArrays/dhirJ/src/host/mapreduce.jl:78 [inlined]
 [24] #mapreduce#29 at /home/troels/.julia/packages/GPUArrays/dhirJ/src/host/mapreduce.jl:64 [inlined]
 [25] #mapreduce at ./none:0 [inlined]
 [26] _sum at ./reducedim.jl:679 [inlined]
 [27] #sum#588 at ./reducedim.jl:653 [inlined]
 [28] (::Base.var"#kw##sum")(::NamedTuple{(:dims,),Tuple{Int64}}, ::typeof(sum), ::Function, ::CuArray{Float32,2,Nothing}) at ./none:0
 [29] top-level scope at REPL[4]:1

Environment details
Details on Julia:

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Julia packages:

(v1.3) pkg> st CuArrays
    Status `~/.julia/environments/v1.3/Project.toml`
  [79e6a3ab] Adapt v1.0.0
  [fa961155] CEnum v0.2.0
  [3895d2a7] CUDAapi v2.1.0 #master (https://github.com/JuliaGPU/CUDAapi.jl.git)
  [c5f51814] CUDAdrv v5.0.1 #master (https://github.com/JuliaGPU/CUDAdrv.jl.git)
  [be33ccc6] CUDAnative v2.9.1 #master (https://github.com/JuliaGPU/CUDAnative.jl.git)
  [3a865a2d] CuArrays v1.7.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
  [864edb3b] DataStructures v0.17.9
  [0c68f7d7] GPUArrays v2.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl.git)
  [1914dd2f] MacroTools v0.5.3
  [872c559c] NNlib v0.6.4
  [189a3867] Reexport v0.2.0

CUDA: toolkit and driver version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
@Sleort Sleort added the bug label Feb 3, 2020
@maleadt
Copy link
Member

maleadt commented Feb 19, 2020

Couple of issues here. First of all, you're executing CPU code (cos, sin, etc) on the GPU. With broadcast, we try to substitute CPU functions for their GPU counterparts; this isn't as easy for mapreduce. Furthermore, it works for some arrays because GPUArrays falls back to a CPU reduction when dealing with small data: https://github.com/JuliaGPU/GPUArrays.jl/blob/02b3fb82f06c741c7542e331022463683c01c6f5/src/host/mapreduce.jl#L171-L174

julia> x = cu(rand(3,3));

julia> sum(cos, x)
8.2665415f0

julia> x = cu(rand(300,300));

julia> sum(cos, x)
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called cos(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:100, maybe you intended to call cos(x::Float32) in CUDAnative at /home/tim/Julia/pkg/CUDAnative/src/device/cuda/math.jl:6 instead?
│    Stacktrace:
│     [1] cos at special/trig.jl:100
│     [2] reduce_kernel at /home/tim/Julia/pkg/GPUArrays/src/host/mapreduce.jl:134
└ @ CUDAnative ~/Julia/pkg/CUDAnative/src/compiler/irgen.jl:111

On recent versions of Julia some of these math functions are implemented in Julia, and will result in normal output. However, switching to using CUDAnative.cos doesn't work because reduce needs to know the output type and executes an iteration (on the CPU) for that...

@Sleort
Copy link
Author

Sleort commented Feb 20, 2020

I see... A (sub-optimal) fix could maybe be to include a fallback like

sum(f, x::CuArray, dims=:) = sum(f.(x), dims=dims)

? Although allocating an intermediate array, at least it would work...

@maleadt
Copy link
Member

maleadt commented Feb 20, 2020

I'm hoping JuliaGPU/CUDAnative.jl#334 will land sometime soon and we won't have to deal with that. The workaround (from user code, without a fallback in CuArrays) should work for you now already?

@Sleort
Copy link
Author

Sleort commented Feb 20, 2020

Yeah, sure. The question was more about whether this should be included in CuArrays.jl. But if a more general/better solution is in the works, I'm happy to wait for that.

@maleadt
Copy link
Member

maleadt commented Feb 24, 2020

PR linked above fixes most of these issues: you can now safely reduce using CUDAnative.cos as function, and the cufunc method substitution machinery is used to switch to compatible functions (to some extent).

@Sleort
Copy link
Author

Sleort commented Feb 26, 2020

Great! Thanks a lot!

@Sleort
Copy link
Author

Sleort commented Mar 8, 2020

I finally got some time for a closer look at this again, and I'm afraid you should reopen the issue, @maleadt. While reduction over the entire CuArray works:

julia> x = cu(rand(300,300));

julia> sum(CUDAnative.cos, x)
75751.85f0

reduction over only one dimension makes Julia crash:

julia> sum(CUDAnative.cos, x; dims=1)
ERROR: LLVM error: Program used external function '__nv_cosf' which could not be resolved!
Stacktrace:
 [1] handle_error(::Cstring) at /home/troels/.julia/packages/LLVM/pINgj/src/core/context.jl:103
 [2] _mapreduce_dim(::Function, ::Function, ::NamedTuple{(),Tuple{}}, ::CuArray{Float32,2,Nothing}, ::Int64) at ./reducedim.jl:317
 [3] mapreduce_impl at /home/troels/.julia/packages/GPUArrays/1wgPO/src/mapreduce.jl:79 [inlined]
 [4] #mapreduce#50 at /home/troels/.julia/packages/GPUArrays/1wgPO/src/mapreduce.jl:65 [inlined]
 [5] #mapreduce at ./none:0 [inlined]
 [6] _sum at ./reducedim.jl:679 [inlined]
 [7] #sum#588 at ./reducedim.jl:653 [inlined]
 [8] fatal: error thrown and no exception handler available.
ReadOnlyMemoryError()
unknown function (ip: 0x7f834c5d8fff)
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1631 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:627
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:665
#invokelatest#1 at ./essentials.jl:709 [inlined]
invokelatest at ./essentials.jl:708 [inlined]
_start at ./client.jl:462
jfptr__start_2084.clone_1 at /home/troels/packages/julias/julia-1.3.1/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2135 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /build/glibc-t7JzpG/glibc-2.30/csu/../csu/libc-start.c:308
unknown function (ip: 0x4015d4)
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
error in running finalizer: ReadOnlyMemoryError()
fatal: error thrown and no exception handler available.
ReadOnlyMemoryError()
unknown function (ip: 0x7f834c5d8fff)
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1631 [inlined]
jl_uv_call_close_callback at /buildworker/worker/package_linux64/build/src/jl_uv.c:92 [inlined]
jl_uv_closeHandle at /buildworker/worker/package_linux64/build/src/jl_uv.c:111
uv__finish_close at /workspace/srcdir/libuv/src/unix/core.c:277
uv__run_closing_handles at /workspace/srcdir/libuv/src/unix/core.c:291
uv_run at /workspace/srcdir/libuv/src/unix/core.c:361
jl_atexit_hook at /buildworker/worker/package_linux64/build/src/init.c:296
jl_exit at /buildworker/worker/package_linux64/build/src/jl_uv.c:629
jl_no_exc_handler at /buildworker/worker/package_linux64/build/src/task.c:413
unknown function (ip: 0x401ba0)
unknown function (ip: 0x401533)
__libc_start_main at /build/glibc-t7JzpG/glibc-2.30/csu/../csu/libc-start.c:308
unknown function (ip: 0x4015d4)

(after which Julia freezes)


Current package status:

(v1.3) pkg> st CuArrays
    Status `~/.julia/environments/v1.3/Project.toml`
  [79e6a3ab] Adapt v1.0.1
  [fa961155] CEnum v0.2.0
  [3895d2a7] CUDAapi v3.1.0
  [c5f51814] CUDAdrv v6.0.0
  [be33ccc6] CUDAnative v2.10.2
  [3a865a2d] CuArrays v1.7.3
  [864edb3b] DataStructures v0.17.10
  [0c68f7d7] GPUArrays v2.0.1
  [1914dd2f] MacroTools v0.5.4
  [872c559c] NNlib v0.6.6

@maleadt
Copy link
Member

maleadt commented Mar 9, 2020

These fixes are not part of a release yet.

@Sleort
Copy link
Author

Sleort commented Mar 10, 2020

These fixes are not part of a release yet.

Ah. My mistake. Sorry. Never mind, then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants