Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeat for Julia 1.6 and higher #357

Closed
wants to merge 7 commits into from

Conversation

torfjelde
Copy link
Contributor

This PR adds an implementation of Base.repeat for arrays of arbitrary dimensionality. It unfortunately makes use of Base._RepeatInnerOuter which is only available on ≥ 1.6, but AFAIK this simplifies the implementation (plus my motivation was usage on Julia v1.6).

I haven't looked into an implementation for < 1.6, but guessing we can at least re-use the kernel for those too if we want to support earlier versions.

@maleadt
Copy link
Member

maleadt commented May 15, 2021

GPUArrays is 1.6+ only, so that shouldn't matter 🙂

@torfjelde
Copy link
Contributor Author

GPUArrays is 1.6+ only, so that shouldn't matter

I thought this was the case because of the recent work you have done on GPUCompiler, but doesn't Project.toml says 1.5?

@torfjelde
Copy link
Contributor Author

Tests are passing locally now, and I dropped the check for Julia v1.6 :)

@maleadt
Copy link
Member

maleadt commented May 15, 2021

Oh whoops, I was confusing with GPUCompiler. Yeah we can probably just bump it to 1.6 here then.

@torfjelde
Copy link
Contributor Author

Dope:) Did so now.

@maleadt
Copy link
Member

maleadt commented May 17, 2021

Looking good. I'm still concerned by the multiple reads of the input data though. Would it not be better to flip the inner and outer repetition kernels around, launching one thread per element of the input, and having a for loop in the kernel looping over every output element corresponding to this input? That would also avoid the costly division that happens now to calculate the src_inds.

One potential issue is when the array to repeat is very small. A possible solution here would be to pass in the number of elements each thread has to process, so that it would still be possible to launch one thread per output element (with the number of element to process then being one). This is similar to https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and would allow for a heuristic to determine a configuration (a simple one would be to ensure we have a couple of 100s of threads).

@torfjelde
Copy link
Contributor Author

I'm still concerned by the multiple reads of the input data though. Would it not be better to flip the inner and outer repetition kernels around, launching one thread per element of the input, and having a for loop in the kernel looping over every output element corresponding to this input? That would also avoid the costly division that happens now to calculate the src_inds.

This was my first "naive" attempt, but then I thought "Given the parallizable nature of GPUs, surely it's better to parallelize over the largest of the two, i.e. output, rather than using a sequential for-loop?". But this might very likely be the wrong intuition as I have no idea how expensive memory-reads/thread spawns/etc. vs loops are on GPU:)

One potential issue is when the array to repeat is very small. A possible solution here would be to pass in the number of elements each thread has to process, so that it would still be possible to launch one thread per output element (with the number of element to process then being one). This is similar to https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and would allow for a heuristic to determine a configuration (a simple one would be to ensure we have a couple of 100s of threads).

Lemme read this and I'll get back to you!

@maleadt maleadt marked this pull request as draft May 27, 2021 11:42
@mcabbott
Copy link
Contributor

It should also be possible to implement repeat in terms of broadcasting. This gist has an attempt (which probably needs to be checked & tested):
https://gist.github.com/mcabbott/80ac43cca3bee8f57809155a5240519f
I wonder how this compares to this PR's version for speed?

@maleadt
Copy link
Member

maleadt commented Jul 6, 2022

Superseded by #400

@maleadt maleadt closed this Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants