[WIP] fix permutedims #334

GiggleLiu · 2020-11-09T07:19:38Z

Fix the its behavior on high dimensional arrays, see issue #333

maleadt · 2020-11-09T11:26:16Z

Why does this work with a NTuple of 16 elements+ but not with a similar CartesianIndex? I'd rather not duplicate the Cartesian<->Linear conversions here...

GiggleLiu · 2020-11-09T18:57:50Z

julia> x = (fill(2, 15)...,)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)

julia> @benchmark CartesianIndices($x)[3]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     84.947 ns (0.00% GC)
  median time:      85.537 ns (0.00% GC)
  mean time:        86.939 ns (0.00% GC)
  maximum time:     183.274 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     963

julia> x = (fill(2, 16)...,)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)

julia> @benchmark CartesianIndices($x)[3]
BenchmarkTools.Trial: 
  memory estimate:  4.39 KiB
  allocs estimate:  73
  --------------
  minimum time:     3.644 μs (0.00% GC)
  median time:      3.713 μs (0.00% GC)
  mean time:        4.002 μs (0.98% GC)
  maximum time:     68.644 μs (92.66% GC)
  --------------
  samples:          10000
  evals/sample:     8

I think it might be related to tuple splatting. Current CartesianIndices getindex is

function getindex(A::AbstractArray, I...)
    @_propagate_inbounds_meta
    error_if_canonical_getindex(IndexStyle(A), A, I...)
    _getindex(IndexStyle(A), A, to_indices(A, I)...)
end

Related issues JuliaLang/julia#27398 , but it seems was fixed since 2018...
@Roger-luo do you know what is happening in this case?

GiggleLiu · 2020-11-09T21:24:46Z

For n<15, this version is slightly faster too. But I haven't benchmarked on GPU yet.
NOTE: Division is the more time consuming than multiplication, this is why the l2c is much slower.

julia> @benchmark l2c(x, i) setup=(x=(fill(2, 15)...,); i=1008)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     103.222 ns (0.00% GC)
  median time:      104.035 ns (0.00% GC)
  mean time:        119.728 ns (6.55% GC)
  maximum time:     4.683 μs (97.24% GC)
  --------------
  samples:          10000
  evals/sample:     939

julia> @benchmark CartesianIndices(x)[i] setup=(x=(fill(2, 15)...,); i=1008)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     105.539 ns (0.00% GC)
  median time:      106.783 ns (0.00% GC)
  mean time:        121.469 ns (6.21% GC)
  maximum time:     4.444 μs (97.12% GC)
  --------------
  samples:          10000
  evals/sample:     934

julia> @benchmark c2l(x, c) setup=(x=(fill(2, 15)...,); c=ntuple(x->rand(1:2), 15))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.292 ns (0.00% GC)
  median time:      21.870 ns (0.00% GC)
  mean time:        24.027 ns (3.14% GC)
  maximum time:     3.807 μs (99.26% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark LinearIndices(x)[c...] setup=(x=(fill(2, 15)...,); c=ntuple(x->rand(1:2), 15))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.138 ns (0.00% GC)
  median time:      29.124 ns (0.00% GC)
  mean time:        31.420 ns (2.47% GC)
  maximum time:     3.938 μs (98.96% GC)
  --------------
  samples:          10000
  evals/sample:     995

GiggleLiu · 2020-12-11T22:55:55Z

@maleadt Maybe we should not use Base.CartesianIndex in GPUArrays. It causes dynamic invokation in tensor network applications.

e.g.

julia> cz = CUDA.zeros(fill(2, 20)...);

julia> cz / Float32(0.3)
ERROR: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceArray{Float32,20,1}, Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to getindex)
Stacktrace:
 [1] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [2] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported dynamic function invocation (call to getindex)
Stacktrace:
 [1] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:62
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:62
Reason: unsupported call through a literal pointer (call to jl_alloc_array_1d)
Stacktrace:
 [1] Array at boot.jl:406
 [2] map at tuple.jl:168
 [3] axes at abstractarray.jl:75
 [4] CartesianIndices at multidimensional.jl:264
 [5] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [6] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported call to the Julia runtime (call to jl_f__apply_iterate)
Stacktrace:
 [1] map at tuple.jl:172
 [2] axes at abstractarray.jl:75
 [3] CartesianIndices at multidimensional.jl:264
 [4] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [5] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported dynamic function invocation (call to CartesianIndices)
Stacktrace:
 [1] CartesianIndices at multidimensional.jl:264
 [2] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
 [3] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Stacktrace:
 [1] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:123
 [2] macro expansion at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:239 [inlined]
 [3] macro expansion at /home/leo/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] cufunction_compile(::GPUCompiler.FunctionSpec) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
 [9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,20,1},Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [10] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:60 [inlined]
 [11] cached_compilation at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [12] cufunction(::GPUArrays.var"#broadcast_kernel#12", ::Type{Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,20,1},Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [13] cufunction at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:294 [inlined]
 [14] launch_heuristic(::CUDA.CuArrayBackend, ::GPUArrays.var"#broadcast_kernel#12", ::CuArray{Float32,20}, ::Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuArray{Float32,20},NTuple{20,Bool},NTuple{20,Int64}},Float32}}, ::Int64; maximize_blocksize::Bool) at /home/leo/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:19
 [15] launch_heuristic at /home/leo/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [16] copyto!(::CuArray{Float32,20}, ::Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{CuArray{Float32,20},Float32}}) at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:66
 [17] copyto! at ./broadcast.jl:886 [inlined]
 [18] copy(::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{20},NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{CuArray{Float32,20},Float32}}) at ./broadcast.jl:862
 [19] materialize at ./broadcast.jl:837 [inlined]
 [20] broadcast_preserving_zero_d at ./broadcast.jl:826 [inlined]
 [21] /(::CuArray{Float32,20}, ::Float32) at ./arraymath.jl:55
 [22] top-level scope at REPL[16]:1

Shall I go ahead to remove all CartesianIndices? Or maybe I should submit a PR to julia base to change the behavior of CartesianIndices?

EDIT: just looked into the Julia base code, it is not easy to fix.

maleadt · 2020-12-14T14:51:35Z

Shall I go ahead to remove all CartesianIndices?

That seems like the nuclear option. As I already expressed in this PR, I'm really not a fan of duplicating functionality, let alone removing reuse of functionality that works just fine in almost all cases.

Could you isolate the actual problem with permutedims, or CartesianIndices in general? I'd rather parametrize the compiler such that these patterns do just work.

GiggleLiu · 2021-09-16T22:12:33Z

Close because this issue is reolved by @kshyatt 's PR in upstream: JuliaLang/julia#40468

One can upgrade Julia to 1.7 to get rid of this issue.

fix permutedims

3442598

GiggleLiu mentioned this pull request Nov 9, 2020

permutedims errors on high dimensional tensors #333

Closed

accelerate with divrem

ca9609f

GiggleLiu mentioned this pull request Dec 14, 2020

Cartesian indexing issue for high dimensional arrays #340

Closed

maleadt force-pushed the master branch from 8770e66 to c2240f0 Compare May 14, 2021 07:51

GiggleLiu closed this Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix permutedims #334

[WIP] fix permutedims #334

GiggleLiu commented Nov 9, 2020 •

edited

Loading

maleadt commented Nov 9, 2020

GiggleLiu commented Nov 9, 2020

GiggleLiu commented Nov 9, 2020 •

edited

Loading

GiggleLiu commented Dec 11, 2020 •

edited

Loading

maleadt commented Dec 14, 2020

GiggleLiu commented Sep 16, 2021

[WIP] fix permutedims #334

[WIP] fix permutedims #334

Conversation

GiggleLiu commented Nov 9, 2020 • edited Loading

maleadt commented Nov 9, 2020

GiggleLiu commented Nov 9, 2020

GiggleLiu commented Nov 9, 2020 • edited Loading

GiggleLiu commented Dec 11, 2020 • edited Loading

maleadt commented Dec 14, 2020

GiggleLiu commented Sep 16, 2021

GiggleLiu commented Nov 9, 2020 •

edited

Loading

GiggleLiu commented Nov 9, 2020 •

edited

Loading

GiggleLiu commented Dec 11, 2020 •

edited

Loading