Port `accmulate!` and `findall` from CUDA.jl #348

zhenwu0728 · 2024-05-09T22:54:11Z

Hi there, this is a really great package. Thanks for your great efforts!

And my questions is:
Is there any plan to support functions like accumulate!, cumsum and findall or similar in Metal.jl like CUDA.jl?
Or could you point me to any resources that I can follow to write these functions in Metal.jl?

Thanks!

The text was updated successfully, but these errors were encountered:

christiangnrd · 2024-05-10T19:00:31Z

Hi zhenwu0728, thank you for your interest!

The metal performance shaders library seems to provide functionality for cumsum. I have a local branch with the library wrappers implemented, but I won't be returning to it until I finish my thesis, so feel free to attempt it by yourself. There are a few recent PRs that you could use as a template.

Another option for the cumsum and the others is to adapt the CUDA.jl implementations for Metal.jl.

maleadt · 2024-05-13T07:26:44Z

Another option for the cumsum and the others is to adapt the CUDA.jl implementations for Metal.jl.

Or, alternatively, port them to GPUArrays.jl which would make them available to all back-ends.

zhenwu0728 · 2024-05-13T21:32:50Z

Or, alternatively, port them to GPUArrays.jl which would make them available to all back-ends.

@maleadt Any ideas when this will happen? I've seen similar codes for accumulate in CUDA.jl and AMDGPU.jl.

maleadt · 2024-05-14T11:35:00Z

Any ideas when this will happen?

Probably only after GPUArrays.jl migrates to KernelAbstractions.jl, which is still some weeks-months off. So feel free to take a stab at a native (i.e. in Metal.jl) implementation first if you require this functionality.

cncastillo · 2024-06-19T13:57:48Z

If anyone else encounters this problem and really needs cumsum, I implemented a very simple version using GPUArrays.jl that does cumsum(x; dims=2) (the only version I needed, but could be easily modified for dims=1):

# cumsum(x; dims=2), thanks to https://pde-on-gpu.vaw.ethz.ch/lecture10/
function cumsum2(A::AnyGPUMatrix{T}) where {T}
    B = similar(A)
    gpu_call(B, A; name="cumsum!", elements=size(A, 1)) do ctx, B, A
        idx = @cartesianidx B
        i, j = Tuple(idx)
        cur_val = zero(T)
        for k in 1:size(A, 2)
            @inbounds cur_val += A[i, k]
            @inbounds B[i, k] = cur_val
        end
        return
    end
    return B
end
# Potential improvements: use shared memory and block/grid sizes

At least in my tests, it seems to be comparable to CUDA's cumsum and multiple times faster than doing gpu(cumsum(cpu(x))).

# Benchmarking
using BenchmarkTools
b1 = @benchmark CUDA.@sync cumsum2($A)                  # My cumsum
b2 = @benchmark CUDA.@sync cumsum($A; dims=2)           # CUDA
b3 = @benchmark CUDA.@sync gpu(cumsum(cpu($A); dims=2)) # Current workaround
speedup_compared_to_cuda = mean(b2.times) / mean(b1.times)
speedup_compared_to_current = mean(b3.times) / mean(b1.times)

# A = randn(Float32, 101, 8) |> gpu
# speedup_compared_to_cuda = 1.1598880188052891
# speedup_compared_to_current = 1.7993942371142466

# A = randn(Float32, 10_001, 800) |> gpu
# speedup_compared_to_cuda = 3.420802824285882
# speedup_compared_to_current = 89.42437204572116

This is also an issue for oneAPI.jl, but the GPUArrays.jl solution should apply to both.

christiangnrd · 2024-07-05T12:35:16Z

Resolved by #377 and #382.

zhenwu0728 · 2024-07-05T15:26:12Z

Can you register a new version of the package? I would like to use these two functions. Thanks!

christiangnrd · 2024-07-05T16:28:28Z

#383

zhenwu0728 changed the title ~~Any plan to support cumsum and findall like CUDA.jl?~~ Any plan to support accmulate! and findall like CUDA.jl? May 10, 2024

christiangnrd added enhancement help wanted Extra attention is needed labels May 10, 2024

maleadt changed the title ~~Any plan to support accmulate! and findall like CUDA.jl?~~ Port accmulate! and findall from CUDA.jl May 24, 2024

maleadt added good first issue Good for newcomers and removed help wanted Extra attention is needed labels May 24, 2024

cncastillo mentioned this issue Jun 12, 2024

Extend GPU support to Metal, ROCm, and oneAPI backends JuliaHealth/KomaMRI.jl#405

Merged

cncastillo mentioned this issue Jun 19, 2024

Add Buildkite GPU pipeline JuliaHealth/KomaMRI.jl#411

Merged

christiangnrd closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port `accmulate!` and `findall` from CUDA.jl #348

Port `accmulate!` and `findall` from CUDA.jl #348

zhenwu0728 commented May 9, 2024 •

edited

Loading

christiangnrd commented May 10, 2024

maleadt commented May 13, 2024

zhenwu0728 commented May 13, 2024

maleadt commented May 14, 2024

cncastillo commented Jun 19, 2024 •

edited

Loading

christiangnrd commented Jul 5, 2024

zhenwu0728 commented Jul 5, 2024

christiangnrd commented Jul 5, 2024

Port accmulate! and findall from CUDA.jl #348

Port accmulate! and findall from CUDA.jl #348

Comments

zhenwu0728 commented May 9, 2024 • edited Loading

christiangnrd commented May 10, 2024

maleadt commented May 13, 2024

zhenwu0728 commented May 13, 2024

maleadt commented May 14, 2024

cncastillo commented Jun 19, 2024 • edited Loading

christiangnrd commented Jul 5, 2024

zhenwu0728 commented Jul 5, 2024

christiangnrd commented Jul 5, 2024

Port `accmulate!` and `findall` from CUDA.jl #348

Port `accmulate!` and `findall` from CUDA.jl #348

zhenwu0728 commented May 9, 2024 •

edited

Loading

cncastillo commented Jun 19, 2024 •

edited

Loading