Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix permutedims #334

Closed
wants to merge 2 commits into from
Closed

Conversation

GiggleLiu
Copy link
Contributor

@GiggleLiu GiggleLiu commented Nov 9, 2020

Fix the its behavior on high dimensional arrays, see issue #333

@maleadt
Copy link
Member

maleadt commented Nov 9, 2020

Why does this work with a NTuple of 16 elements+ but not with a similar CartesianIndex? I'd rather not duplicate the Cartesian<->Linear conversions here...

@GiggleLiu
Copy link
Contributor Author

julia> x = (fill(2, 15)...,)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)

julia> @benchmark CartesianIndices($x)[3]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     84.947 ns (0.00% GC)
  median time:      85.537 ns (0.00% GC)
  mean time:        86.939 ns (0.00% GC)
  maximum time:     183.274 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     963

julia> x = (fill(2, 16)...,)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)

julia> @benchmark CartesianIndices($x)[3]
BenchmarkTools.Trial: 
  memory estimate:  4.39 KiB
  allocs estimate:  73
  --------------
  minimum time:     3.644 μs (0.00% GC)
  median time:      3.713 μs (0.00% GC)
  mean time:        4.002 μs (0.98% GC)
  maximum time:     68.644 μs (92.66% GC)
  --------------
  samples:          10000
  evals/sample:     8

I think it might be related to tuple splatting. Current CartesianIndices getindex is

function getindex(A::AbstractArray, I...)
    @_propagate_inbounds_meta
    error_if_canonical_getindex(IndexStyle(A), A, I...)
    _getindex(IndexStyle(A), A, to_indices(A, I)...)
end

Related issues JuliaLang/julia#27398 , but it seems was fixed since 2018...
@Roger-luo do you know what is happening in this case?

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Nov 9, 2020

For n<15, this version is slightly faster too. But I haven't benchmarked on GPU yet.
NOTE: Division is the more time consuming than multiplication, this is why the l2c is much slower.

julia> @benchmark l2c(x, i) setup=(x=(fill(2, 15)...,); i=1008)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     103.222 ns (0.00% GC)
  median time:      104.035 ns (0.00% GC)
  mean time:        119.728 ns (6.55% GC)
  maximum time:     4.683 μs (97.24% GC)
  --------------
  samples:          10000
  evals/sample:     939

julia> @benchmark CartesianIndices(x)[i] setup=(x=(fill(2, 15)...,); i=1008)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     105.539 ns (0.00% GC)
  median time:      106.783 ns (0.00% GC)
  mean time:        121.469 ns (6.21% GC)
  maximum time:     4.444 μs (97.12% GC)
  --------------
  samples:          10000
  evals/sample:     934

julia> @benchmark c2l(x, c) setup=(x=(fill(2, 15)...,); c=ntuple(x->rand(1:2), 15))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.292 ns (0.00% GC)
  median time:      21.870 ns (0.00% GC)
  mean time:        24.027 ns (3.14% GC)
  maximum time:     3.807 μs (99.26% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark LinearIndices(x)[c...] setup=(x=(fill(2, 15)...,); c=ntuple(x->rand(1:2), 15))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.138 ns (0.00% GC)
  median time:      29.124 ns (0.00% GC)
  mean time:        31.420 ns (2.47% GC)
  maximum time:     3.938 μs (98.96% GC)
  --------------
  samples:          10000
  evals/sample:     995

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Dec 11, 2020

@maleadt Maybe we should not use Base.CartesianIndex in GPUArrays. It causes dynamic invokation in tensor network applications.

e.g.

julia> cz = CUDA.zeros(fill(2, 20)...);

julia> cz / Float32(0.3)
ERROR: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceArray{Float32,20,1}, Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to getindex)
Stacktrace:
 [1] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [2] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported dynamic function invocation (call to getindex)
Stacktrace:
 [1] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:62
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:62
Reason: unsupported call through a literal pointer (call to jl_alloc_array_1d)
Stacktrace:
 [1] Array at boot.jl:406
 [2] map at tuple.jl:168
 [3] axes at abstractarray.jl:75
 [4] CartesianIndices at multidimensional.jl:264
 [5] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [6] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported call to the Julia runtime (call to jl_f__apply_iterate)
Stacktrace:
 [1] map at tuple.jl:172
 [2] axes at abstractarray.jl:75
 [3] CartesianIndices at multidimensional.jl:264
 [4] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [5] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported dynamic function invocation (call to CartesianIndices)
Stacktrace:
 [1] CartesianIndices at multidimensional.jl:264
 [2] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [3] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Stacktrace:
 [1] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:123
 [2] macro expansion at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:239 [inlined]
 [3] macro expansion at /home/leo/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] cufunction_compile(::GPUCompiler.FunctionSpec) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
 [9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,20,1},Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [10] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:60 [inlined]
 [11] cached_compilation at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [12] cufunction(::GPUArrays.var"#broadcast_kernel#12", ::Type{Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,20,1},Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [13] cufunction at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:294 [inlined]
 [14] launch_heuristic(::CUDA.CuArrayBackend, ::GPUArrays.var"#broadcast_kernel#12", ::CuArray{Float32,20}, ::Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuArray{Float32,20},NTuple{20,Bool},NTuple{20,Int64}},Float32}}, ::Int64; maximize_blocksize::Bool) at /home/leo/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:19
 [15] launch_heuristic at /home/leo/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [16] copyto!(::CuArray{Float32,20}, ::Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{CuArray{Float32,20},Float32}}) at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:66
 [17] copyto! at ./broadcast.jl:886 [inlined]
 [18] copy(::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{20},NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{CuArray{Float32,20},Float32}}) at ./broadcast.jl:862
 [19] materialize at ./broadcast.jl:837 [inlined]
 [20] broadcast_preserving_zero_d at ./broadcast.jl:826 [inlined]
 [21] /(::CuArray{Float32,20}, ::Float32) at ./arraymath.jl:55
 [22] top-level scope at REPL[16]:1

Shall I go ahead to remove all CartesianIndices? Or maybe I should submit a PR to julia base to change the behavior of CartesianIndices?

EDIT: just looked into the Julia base code, it is not easy to fix.

@maleadt
Copy link
Member

maleadt commented Dec 14, 2020

Shall I go ahead to remove all CartesianIndices?

That seems like the nuclear option. As I already expressed in this PR, I'm really not a fan of duplicating functionality, let alone removing reuse of functionality that works just fine in almost all cases.

Could you isolate the actual problem with permutedims, or CartesianIndices in general? I'd rather parametrize the compiler such that these patterns do just work.

@GiggleLiu
Copy link
Contributor Author

Close because this issue is reolved by @kshyatt 's PR in upstream: JuliaLang/julia#40468

One can upgrade Julia to 1.7 to get rid of this issue.

@GiggleLiu GiggleLiu closed this Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants