Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port accmulate! and findall from CUDA.jl #348

Closed
zhenwu0728 opened this issue May 9, 2024 · 8 comments
Closed

Port accmulate! and findall from CUDA.jl #348

zhenwu0728 opened this issue May 9, 2024 · 8 comments
Labels
good first issue Good for newcomers

Comments

@zhenwu0728
Copy link
Contributor

zhenwu0728 commented May 9, 2024

Hi there, this is a really great package. Thanks for your great efforts!

And my questions is:
Is there any plan to support functions like accumulate!, cumsum and findall or similar in Metal.jl like CUDA.jl?
Or could you point me to any resources that I can follow to write these functions in Metal.jl?

Thanks!

@zhenwu0728 zhenwu0728 changed the title Any plan to support cumsum and findall like CUDA.jl? Any plan to support accmulate! and findall like CUDA.jl? May 10, 2024
@christiangnrd
Copy link
Contributor

Hi zhenwu0728, thank you for your interest!

The metal performance shaders library seems to provide functionality for cumsum. I have a local branch with the library wrappers implemented, but I won't be returning to it until I finish my thesis, so feel free to attempt it by yourself. There are a few recent PRs that you could use as a template.

Another option for the cumsum and the others is to adapt the CUDA.jl implementations for Metal.jl.

@christiangnrd christiangnrd added enhancement help wanted Extra attention is needed labels May 10, 2024
@maleadt
Copy link
Member

maleadt commented May 13, 2024

Another option for the cumsum and the others is to adapt the CUDA.jl implementations for Metal.jl.

Or, alternatively, port them to GPUArrays.jl which would make them available to all back-ends.

@zhenwu0728
Copy link
Contributor Author

Or, alternatively, port them to GPUArrays.jl which would make them available to all back-ends.

@maleadt Any ideas when this will happen? I've seen similar codes for accumulate in CUDA.jl and AMDGPU.jl.

@maleadt
Copy link
Member

maleadt commented May 14, 2024

Any ideas when this will happen?

Probably only after GPUArrays.jl migrates to KernelAbstractions.jl, which is still some weeks-months off. So feel free to take a stab at a native (i.e. in Metal.jl) implementation first if you require this functionality.

@maleadt maleadt changed the title Any plan to support accmulate! and findall like CUDA.jl? Port accmulate! and findall from CUDA.jl May 24, 2024
@maleadt maleadt added good first issue Good for newcomers and removed help wanted Extra attention is needed labels May 24, 2024
@cncastillo
Copy link

cncastillo commented Jun 19, 2024

If anyone else encounters this problem and really needs cumsum, I implemented a very simple version using GPUArrays.jl that does cumsum(x; dims=2) (the only version I needed, but could be easily modified for dims=1):

# cumsum(x; dims=2), thanks to https://pde-on-gpu.vaw.ethz.ch/lecture10/
function cumsum2(A::AnyGPUMatrix{T}) where {T}
    B = similar(A)
    gpu_call(B, A; name="cumsum!", elements=size(A, 1)) do ctx, B, A
        idx = @cartesianidx B
        i, j = Tuple(idx)
        cur_val = zero(T)
        for k in 1:size(A, 2)
            @inbounds cur_val += A[i, k]
            @inbounds B[i, k] = cur_val
        end
        return
    end
    return B
end
# Potential improvements: use shared memory and block/grid sizes

At least in my tests, it seems to be comparable to CUDA's cumsum and multiple times faster than doing gpu(cumsum(cpu(x))).

# Benchmarking
using BenchmarkTools
b1 = @benchmark CUDA.@sync cumsum2($A)                  # My cumsum
b2 = @benchmark CUDA.@sync cumsum($A; dims=2)           # CUDA
b3 = @benchmark CUDA.@sync gpu(cumsum(cpu($A); dims=2)) # Current workaround
speedup_compared_to_cuda = mean(b2.times) / mean(b1.times)
speedup_compared_to_current = mean(b3.times) / mean(b1.times)

# A = randn(Float32, 101, 8) |> gpu
# speedup_compared_to_cuda = 1.1598880188052891
# speedup_compared_to_current = 1.7993942371142466

# A = randn(Float32, 10_001, 800) |> gpu
# speedup_compared_to_cuda = 3.420802824285882
# speedup_compared_to_current = 89.42437204572116

This is also an issue for oneAPI.jl, but the GPUArrays.jl solution should apply to both.

@christiangnrd
Copy link
Contributor

Resolved by #377 and #382.

@zhenwu0728
Copy link
Contributor Author

Can you register a new version of the package? I would like to use these two functions. Thanks!

@christiangnrd
Copy link
Contributor

#383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants