Creating an MArray performs a dynamic allocation #340

vchuravy · 2019-02-12T01:48:48Z

function knl!()
  r_a = MArray{Tuple{4}, Float32}(undef)

  for k in 1:4
    r_a[k] = 0
  end

  nothing
end

@cuda threads=(5, 5) blocks=4000 knl!()

Fails sometimes with ERROR: LoadError: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)

========= CUDA-MEMCHECK
========= Invalid __global__ write of size 1
=========     at 0x00000018 in ./pointer.jl:118:ptxcall_knl__1
=========     by thread (4,4,0) in block (3733,0,0)
=========     Address 0x00000003 is out of bounds

There are two things going here that I found odd/interesting. First-of-all the write to illegal address, but secondly:

julia> @device_code_llvm @cuda threads=(5, 5) blocks=4000 knl!()

define void @ptxcall_knl__22() {
entry:
  %0 = call %jl_value_t addrspace(10)* @ptx_gc_pool_alloc(i64 16)
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t*
  %2 = bitcast %jl_value_t* %1 to float*
  store float 0.000000e+00, float* %2, align 1, !tbaa !10
  %3 = getelementptr inbounds float, float* %2, i64 1
  store float 0.000000e+00, float* %3, align 1, !tbaa !10
  %4 = getelementptr inbounds float, float* %2, i64 2
  store float 0.000000e+00, float* %4, align 1, !tbaa !10
  %5 = getelementptr inbounds float, float* %2, i64 3
  store float 0.000000e+00, float* %5, align 1, !tbaa !10
  ret void
}

I could have sworn that we used to be able to turn this into a alloca on CUDAnative v0.10.1 we got rid of this entire allocation.

entry:                                                                                                             [0/545]
  br label %L3.i

L3.i:                                             ; preds = %L31.i, %entry
  %value_phi.i = phi i64 [ 1, %entry ], [ %2, %L31.i ]
  %value_phi.i.off = add i64 %value_phi.i, -1
  %0 = icmp ugt i64 %value_phi.i.off, 3
  br i1 %0, label %L11.i.thread, label %L11.i

L11.i:                                            ; preds = %L3.i
  %1 = icmp eq i64 %value_phi.i, 4
  br i1 %1, label %julia_knl__3.exit, label %L31.i

L11.i.thread:                                     ; preds = %L3.i
  call fastcc void @ptx_throw_boundserror()
  call void asm sideeffect "trap;", ""() #0
  br label %L31.i

L31.i:                                            ; preds = %L11.i.thread, %L11.i
  %2 = add i64 %value_phi.i, 1
  br label %L3.i

julia_knl__3.exit:                                ; preds = %L11.i
  ret void
}

cc: @lcw who encountered this

The text was updated successfully, but these errors were encountered:

lcw · 2019-02-13T01:50:57Z

Until this bug is fixed, is there an alternative way I can use to get an array of registers on the device?

maleadt · 2019-02-13T06:04:03Z

There's no special support, just use regular Julia constructs. For example, an NTuple with some convenience functions for creating a new tuple with items modified (i.e., implementing mutability).
But that's basically what StaticArrays implement.

maleadt · 2019-02-14T10:10:31Z

wrt. the illegal memory access: malloc apparently returns NULL when asking for that much dynamic memory.

maleadt · 2019-02-14T10:40:40Z

Also note https://github.com/JuliaGPU/CUDAnative.jl/issues/342

maleadt · 2019-02-14T10:49:03Z

And concerning the GC allocation, we used to run the GC lowering pass and fail if there were actual GC allocations after that, while nowadays we eagerly lower julia.gc_alloc_obj (i.e., pre-lowering) to a call to malloc. Apparently, the GC lowering pass performs some escape analysis and promotes certain GC allocations to regular allocas. I always thought that happened during the optimizer.

Not sure how to proceed here. Either we do some similar escape analysis during our GC lowering pass, or we start relying on the Julia GC lowering pass again but then it needs to be adapted not to emit platform specific IR as it does now (this is something @jonathanvdc is probably going to have a look at soon).

@inbounds

343: Check for OOM when doing malloc. r=maleadt a=maleadt Addresses #340 Crashes on `-g2` with the following MWE though: ``` using CUDAnative, StaticArrays function knl!() r_a = MArray{Tuple{4}, Float32}(undef) for k in 1:4 @inbounds r_a[k] = 0 end nothing end @cuda threads=(5, 5) blocks=4000 knl!() using CUDAdrv CUDAdrv.synchronize() ``` Co-authored-by: Tim Besard <tim.besard@gmail.com>

lcw · 2019-02-14T14:32:55Z

Thanks for the clarification and for looking into this! I am excited to get back to optimizing my kernels once this issue is resolved.

maleadt · 2019-02-18T13:00:36Z

Apparently, the GC lowering pass performs some escape analysis and promotes certain GC allocations to regular allocas.

Except that it doesn't, that's of course the job of AllocOpt. Which is broken now that we eagerly lower julia.alloc_obj. The solution, lower intrinsics at the end of optimization, but that needs a way to disable Julia's own lowering.

lcw · 2019-02-28T00:48:24Z

Thanks for all your help on this issue so far.

I just tried to see if #349 with Julia 1.2.0-DEV.388 (gotten from the nightly builds section of the Julia website) fixes this issue and I now get a different error message. Is there something I am doing wrong?

The MWE is in the file bug.jl which contains

using StaticArrays
using CUDAdrv
using CUDAnative

function knl!()
  r_a = MArray{Tuple{4}, Float32}(undef)

  for k in 1:4
    r_a[k] = 0
  end

  nothing
end

@cuda threads=(5, 5) blocks=4000 knl!()

when I run the MWE I see

$ ~/opt/julia/1.2.0-DEV.388/bin/julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0-DEV.388 (2019-02-27)
 _/ |\__'_|_|_|\__'_|  |  Commit f44a37f333 (0 days old master)
|__/                   |

(nps_julialab_2019_Jan) pkg> status
    Status `~/research/code/nps_julialab_2019_Jan/Project.toml`
  [c5f51814] CUDAdrv v1.0.1+ #master (https://github.com/JuliaGPU/CUDAdrv.jl.git)
  [be33ccc6] CUDAnative v1.0.1+ #tb/skip_lower_intrinsics (https://github.com/JuliaGPU/CUDAnative.jl.git)
  [3a865a2d] CuArrays v0.9.1
  [ba82f77b] GPUifyLoops v0.1.0
  [90137ffa] StaticArrays v0.10.3

julia> include("bug.jl")
ERROR: Error while loading expression starting at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:15
caused by [exception 1]
InvalidIRError: compiling knl!() resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to llvm.julia.gc_preserve_begin)
Stacktrace:
 [1] macro expansion at gcutils.jl:86
 [2] setindex! at /home/lwilcox/.julia/packages/StaticArrays/VyRz3/src/MArray.jl:130
 [3] knl! at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:9
Reason: unsupported call to an unknown function (call to julia.pointer_from_objref)
Stacktrace:
 [1] pointer_from_objref at pointer.jl:143
 [2] macro expansion at gcutils.jl:87
 [3] setindex! at /home/lwilcox/.julia/packages/StaticArrays/VyRz3/src/MArray.jl:130
 [4] knl! at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:9
Reason: unsupported call to an unknown function (call to llvm.julia.gc_preserve_end)
Stacktrace:
 [1] macro expansion at gcutils.jl:88
 [2] setindex! at /home/lwilcox/.julia/packages/StaticArrays/VyRz3/src/MArray.jl:130
 [3] knl! at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:9
Stacktrace:
 [1] check_ir at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/validation.jl:77 [inlined]
 [2] compile(::CUDAnative.CompilerContext) at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:90
 [3] #compile#96 at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:38 [inlined]
 [4] compile at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:36 [inlined]
 [5] #compile#95 at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:18 [inlined]
 [6] compile at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:16 [inlined]
 [7] macro expansion at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:269 [inlined]
 [8] #cufunction#110(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(knl!), ::Type{Tuple{}}) at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:240
 [9] cufunction(::Function, ::Type) at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:240
 [10] top-level scope at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:208
 [11] top-level scope at gcutils.jl:87
 [12] top-level scope at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:205
 [13] include at ./boot.jl:325 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1041
 [15] include(::Module, ::String) at ./Base.jl:29
 [16] include(::String) at ./client.jl:443
 [17] top-level scope at REPL[2]:1

maleadt · 2019-02-28T08:06:03Z

I'm seeing the same behavior, while code_llvm seems to work fine:

julia> CUDAnative.code_llvm(knl!, Tuple{})

define void @julia_knl__29() {
top:
  ret void
}

Not sure what's up, I'll try to have a look tomorrow.

lcw · 2019-02-28T16:00:17Z

Thanks!

maleadt · 2019-03-01T13:15:35Z

Could you try again?

lcw · 2019-03-01T16:24:34Z

On the latest branch the error messages have gone away!

maleadt added the bug label Feb 12, 2019

maleadt mentioned this issue Feb 14, 2019

Check for OOM when doing malloc. #343

Merged

maleadt changed the title ~~Illegal memory access with MArray and allocation instead of an alloca~~ Creating an MArray performs a dynamic allocation Feb 14, 2019

maleadt added the performance label Feb 14, 2019

maleadt mentioned this issue Feb 14, 2019

Support for Thread Local Memory Arrays JuliaGPU/CuArrays.jl#278

Closed

maleadt mentioned this issue Feb 18, 2019

Don't eagerly lower intrinsics. #349

Merged

jonathanvdc mentioned this issue Feb 21, 2019

WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass JuliaLang/julia#31135

Merged

lcw mentioned this issue Feb 21, 2019

SVector, StructOfArrays, etc... CliMA/ClimateMachine.jl#45

Closed

maleadt mentioned this issue Feb 22, 2019

Make it possible to add optimization passes without lowering intrinsics. JuliaLang/julia#31101

Merged

jkozdon mentioned this issue Feb 22, 2019

Move moist tracers CliMA/ClimateMachine.jl#78

Closed

maleadt closed this as completed in #349 Mar 5, 2019

mattcbro mentioned this issue May 27, 2020

Overview of what is supported for kernel programming JuliaGPU/CUDA.jl#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating an MArray performs a dynamic allocation #340

Creating an MArray performs a dynamic allocation #340

vchuravy commented Feb 12, 2019

lcw commented Feb 13, 2019

maleadt commented Feb 13, 2019 •

edited

Loading

maleadt commented Feb 14, 2019

maleadt commented Feb 14, 2019

maleadt commented Feb 14, 2019

lcw commented Feb 14, 2019

maleadt commented Feb 18, 2019 •

edited

Loading

lcw commented Feb 28, 2019

maleadt commented Feb 28, 2019

lcw commented Feb 28, 2019

maleadt commented Mar 1, 2019

lcw commented Mar 1, 2019

Creating an MArray performs a dynamic allocation #340

Creating an MArray performs a dynamic allocation #340

Comments

vchuravy commented Feb 12, 2019

lcw commented Feb 13, 2019

maleadt commented Feb 13, 2019 • edited Loading

maleadt commented Feb 14, 2019

maleadt commented Feb 14, 2019

maleadt commented Feb 14, 2019

lcw commented Feb 14, 2019

maleadt commented Feb 18, 2019 • edited Loading

lcw commented Feb 28, 2019

maleadt commented Feb 28, 2019

lcw commented Feb 28, 2019

maleadt commented Mar 1, 2019

lcw commented Mar 1, 2019

maleadt commented Feb 13, 2019 •

edited

Loading

maleadt commented Feb 18, 2019 •

edited

Loading