Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Creating an MArray performs a dynamic allocation #340

Closed
vchuravy opened this issue Feb 12, 2019 · 12 comments
Closed

Creating an MArray performs a dynamic allocation #340

vchuravy opened this issue Feb 12, 2019 · 12 comments

Comments

@vchuravy
Copy link
Member

function knl!()
  r_a = MArray{Tuple{4}, Float32}(undef)

  for k in 1:4
    r_a[k] = 0
  end

  nothing
end

@cuda threads=(5, 5) blocks=4000 knl!()

Fails sometimes with ERROR: LoadError: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)

========= CUDA-MEMCHECK
========= Invalid __global__ write of size 1
=========     at 0x00000018 in ./pointer.jl:118:ptxcall_knl__1
=========     by thread (4,4,0) in block (3733,0,0)
=========     Address 0x00000003 is out of bounds

There are two things going here that I found odd/interesting. First-of-all the write to illegal address, but secondly:

julia> @device_code_llvm @cuda threads=(5, 5) blocks=4000 knl!()

define void @ptxcall_knl__22() {
entry:
  %0 = call %jl_value_t addrspace(10)* @ptx_gc_pool_alloc(i64 16)
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t*
  %2 = bitcast %jl_value_t* %1 to float*
  store float 0.000000e+00, float* %2, align 1, !tbaa !10
  %3 = getelementptr inbounds float, float* %2, i64 1
  store float 0.000000e+00, float* %3, align 1, !tbaa !10
  %4 = getelementptr inbounds float, float* %2, i64 2
  store float 0.000000e+00, float* %4, align 1, !tbaa !10
  %5 = getelementptr inbounds float, float* %2, i64 3
  store float 0.000000e+00, float* %5, align 1, !tbaa !10
  ret void
}

I could have sworn that we used to be able to turn this into a alloca on CUDAnative v0.10.1 we got rid of this entire allocation.

entry:                                                                                                             [0/545]
  br label %L3.i

L3.i:                                             ; preds = %L31.i, %entry
  %value_phi.i = phi i64 [ 1, %entry ], [ %2, %L31.i ]
  %value_phi.i.off = add i64 %value_phi.i, -1
  %0 = icmp ugt i64 %value_phi.i.off, 3
  br i1 %0, label %L11.i.thread, label %L11.i

L11.i:                                            ; preds = %L3.i
  %1 = icmp eq i64 %value_phi.i, 4
  br i1 %1, label %julia_knl__3.exit, label %L31.i

L11.i.thread:                                     ; preds = %L3.i
  call fastcc void @ptx_throw_boundserror()
  call void asm sideeffect "trap;", ""() #0
  br label %L31.i

L31.i:                                            ; preds = %L11.i.thread, %L11.i
  %2 = add i64 %value_phi.i, 1
  br label %L3.i

julia_knl__3.exit:                                ; preds = %L11.i
  ret void
}

cc: @lcw who encountered this

@maleadt maleadt added the bug label Feb 12, 2019
@lcw
Copy link

lcw commented Feb 13, 2019

Until this bug is fixed, is there an alternative way I can use to get an array of registers on the device?

@maleadt
Copy link
Member

maleadt commented Feb 13, 2019

There's no special support, just use regular Julia constructs. For example, an NTuple with some convenience functions for creating a new tuple with items modified (i.e., implementing mutability).
But that's basically what StaticArrays implement.

@maleadt
Copy link
Member

maleadt commented Feb 14, 2019

wrt. the illegal memory access: malloc apparently returns NULL when asking for that much dynamic memory.

@maleadt
Copy link
Member

maleadt commented Feb 14, 2019

@maleadt
Copy link
Member

maleadt commented Feb 14, 2019

And concerning the GC allocation, we used to run the GC lowering pass and fail if there were actual GC allocations after that, while nowadays we eagerly lower julia.gc_alloc_obj (i.e., pre-lowering) to a call to malloc. Apparently, the GC lowering pass performs some escape analysis and promotes certain GC allocations to regular allocas. I always thought that happened during the optimizer.

Not sure how to proceed here. Either we do some similar escape analysis during our GC lowering pass, or we start relying on the Julia GC lowering pass again but then it needs to be adapted not to emit platform specific IR as it does now (this is something @jonathanvdc is probably going to have a look at soon).

@maleadt maleadt changed the title Illegal memory access with MArray and allocation instead of an alloca Creating an MArray performs a dynamic allocation Feb 14, 2019
bors bot added a commit that referenced this issue Feb 14, 2019
343: Check for OOM when doing malloc. r=maleadt a=maleadt

Addresses #340

Crashes on `-g2` with the following MWE though:

```
using CUDAnative, StaticArrays

function knl!()
  r_a = MArray{Tuple{4}, Float32}(undef)

  for k in 1:4
    @inbounds r_a[k] = 0
  end

  nothing
end

@cuda threads=(5, 5) blocks=4000 knl!()

using CUDAdrv
CUDAdrv.synchronize()
```

Co-authored-by: Tim Besard <tim.besard@gmail.com>
@lcw
Copy link

lcw commented Feb 14, 2019

Thanks for the clarification and for looking into this! I am excited to get back to optimizing my kernels once this issue is resolved.

@maleadt
Copy link
Member

maleadt commented Feb 18, 2019

Apparently, the GC lowering pass performs some escape analysis and promotes certain GC allocations to regular allocas.

Except that it doesn't, that's of course the job of AllocOpt. Which is broken now that we eagerly lower julia.alloc_obj. The solution, lower intrinsics at the end of optimization, but that needs a way to disable Julia's own lowering.

@lcw
Copy link

lcw commented Feb 28, 2019

Thanks for all your help on this issue so far.

I just tried to see if #349 with Julia 1.2.0-DEV.388 (gotten from the nightly builds section of the Julia website) fixes this issue and I now get a different error message. Is there something I am doing wrong?

The MWE is in the file bug.jl which contains

using StaticArrays
using CUDAdrv
using CUDAnative

function knl!()
  r_a = MArray{Tuple{4}, Float32}(undef)

  for k in 1:4
    r_a[k] = 0
  end

  nothing
end

@cuda threads=(5, 5) blocks=4000 knl!()

when I run the MWE I see

$ ~/opt/julia/1.2.0-DEV.388/bin/julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0-DEV.388 (2019-02-27)
 _/ |\__'_|_|_|\__'_|  |  Commit f44a37f333 (0 days old master)
|__/                   |

(nps_julialab_2019_Jan) pkg> status
    Status `~/research/code/nps_julialab_2019_Jan/Project.toml`
  [c5f51814] CUDAdrv v1.0.1+ #master (https://github.com/JuliaGPU/CUDAdrv.jl.git)
  [be33ccc6] CUDAnative v1.0.1+ #tb/skip_lower_intrinsics (https://github.com/JuliaGPU/CUDAnative.jl.git)
  [3a865a2d] CuArrays v0.9.1
  [ba82f77b] GPUifyLoops v0.1.0
  [90137ffa] StaticArrays v0.10.3

julia> include("bug.jl")
ERROR: Error while loading expression starting at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:15
caused by [exception 1]
InvalidIRError: compiling knl!() resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to llvm.julia.gc_preserve_begin)
Stacktrace:
 [1] macro expansion at gcutils.jl:86
 [2] setindex! at /home/lwilcox/.julia/packages/StaticArrays/VyRz3/src/MArray.jl:130
 [3] knl! at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:9
Reason: unsupported call to an unknown function (call to julia.pointer_from_objref)
Stacktrace:
 [1] pointer_from_objref at pointer.jl:143
 [2] macro expansion at gcutils.jl:87
 [3] setindex! at /home/lwilcox/.julia/packages/StaticArrays/VyRz3/src/MArray.jl:130
 [4] knl! at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:9
Reason: unsupported call to an unknown function (call to llvm.julia.gc_preserve_end)
Stacktrace:
 [1] macro expansion at gcutils.jl:88
 [2] setindex! at /home/lwilcox/.julia/packages/StaticArrays/VyRz3/src/MArray.jl:130
 [3] knl! at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:9
Stacktrace:
 [1] check_ir at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/validation.jl:77 [inlined]
 [2] compile(::CUDAnative.CompilerContext) at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:90
 [3] #compile#96 at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:38 [inlined]
 [4] compile at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:36 [inlined]
 [5] #compile#95 at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:18 [inlined]
 [6] compile at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/compiler/driver.jl:16 [inlined]
 [7] macro expansion at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:269 [inlined]
 [8] #cufunction#110(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(knl!), ::Type{Tuple{}}) at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:240
 [9] cufunction(::Function, ::Type) at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:240
 [10] top-level scope at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:208
 [11] top-level scope at gcutils.jl:87
 [12] top-level scope at /home/lwilcox/.julia/packages/CUDAnative/YwBhJ/src/execution.jl:205
 [13] include at ./boot.jl:325 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1041
 [15] include(::Module, ::String) at ./Base.jl:29
 [16] include(::String) at ./client.jl:443
 [17] top-level scope at REPL[2]:1

@maleadt
Copy link
Member

maleadt commented Feb 28, 2019

I'm seeing the same behavior, while code_llvm seems to work fine:

julia> CUDAnative.code_llvm(knl!, Tuple{})

define void @julia_knl__29() {
top:
  ret void
}

Not sure what's up, I'll try to have a look tomorrow.

@lcw
Copy link

lcw commented Feb 28, 2019

Thanks!

@maleadt
Copy link
Member

maleadt commented Mar 1, 2019

Could you try again?

@lcw
Copy link

lcw commented Mar 1, 2019

On the latest branch the error messages have gone away!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants