`repeat` for Julia 1.6 and higher #357

torfjelde · 2021-05-15T16:50:07Z

This PR adds an implementation of Base.repeat for arrays of arbitrary dimensionality. It unfortunately makes use of Base._RepeatInnerOuter which is only available on ≥ 1.6, but AFAIK this simplifies the implementation (plus my motivation was usage on Julia v1.6).

I haven't looked into an implementation for < 1.6, but guessing we can at least re-use the kernel for those too if we want to support earlier versions.

maleadt · 2021-05-15T18:16:29Z

GPUArrays is 1.6+ only, so that shouldn't matter 🙂

torfjelde · 2021-05-15T18:25:19Z

GPUArrays is 1.6+ only, so that shouldn't matter

I thought this was the case because of the recent work you have done on GPUCompiler, but doesn't Project.toml says 1.5?

torfjelde · 2021-05-15T18:38:49Z

Tests are passing locally now, and I dropped the check for Julia v1.6 :)

maleadt · 2021-05-15T19:08:13Z

Oh whoops, I was confusing with GPUCompiler. Yeah we can probably just bump it to 1.6 here then.

torfjelde · 2021-05-16T00:17:13Z

Dope:) Did so now.

maleadt · 2021-05-17T08:16:59Z

Looking good. I'm still concerned by the multiple reads of the input data though. Would it not be better to flip the inner and outer repetition kernels around, launching one thread per element of the input, and having a for loop in the kernel looping over every output element corresponding to this input? That would also avoid the costly division that happens now to calculate the src_inds.

One potential issue is when the array to repeat is very small. A possible solution here would be to pass in the number of elements each thread has to process, so that it would still be possible to launch one thread per output element (with the number of element to process then being one). This is similar to https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and would allow for a heuristic to determine a configuration (a simple one would be to ensure we have a couple of 100s of threads).

torfjelde · 2021-05-18T13:32:26Z

I'm still concerned by the multiple reads of the input data though. Would it not be better to flip the inner and outer repetition kernels around, launching one thread per element of the input, and having a for loop in the kernel looping over every output element corresponding to this input? That would also avoid the costly division that happens now to calculate the src_inds.

This was my first "naive" attempt, but then I thought "Given the parallizable nature of GPUs, surely it's better to parallelize over the largest of the two, i.e. output, rather than using a sequential for-loop?". But this might very likely be the wrong intuition as I have no idea how expensive memory-reads/thread spawns/etc. vs loops are on GPU:)

One potential issue is when the array to repeat is very small. A possible solution here would be to pass in the number of elements each thread has to process, so that it would still be possible to launch one thread per output element (with the number of element to process then being one). This is similar to https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and would allow for a heuristic to determine a configuration (a simple one would be to ensure we have a couple of 100s of threads).

Lemme read this and I'll get back to you!

mcabbott · 2021-06-12T19:27:05Z

It should also be possible to implement repeat in terms of broadcasting. This gist has an attempt (which probably needs to be checked & tested):
https://gist.github.com/mcabbott/80ac43cca3bee8f57809155a5240519f
I wonder how this compares to this PR's version for speed?

maleadt · 2022-07-06T06:33:03Z

Superseded by #400

torfjelde added 4 commits May 15, 2021 18:32

added impl for repeat

31437e7

added tests for new impl of repeat

8312fe6

fixed a typo in tests

8648cb0

fix for allocation of out in repeat

4040d29

torfjelde added 2 commits May 15, 2021 20:37

added check for 0-repeats

961f042

drop check for julia version as GPUArrays requires ≥ 1.6 anyways

66b8cb2

bump julia version compat entry and drop testing of v1.5

bdaabf8

maleadt marked this pull request as draft May 27, 2021 11:42

maleadt mentioned this pull request Jul 19, 2021

repeat performs scalar indexing for multi-dimensional arrays JuliaGPU/CUDA.jl#1051

Closed

awadell1 mentioned this pull request Mar 18, 2022

Add n-dimensional repeat #400

Merged

maleadt closed this Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`repeat` for Julia 1.6 and higher #357

`repeat` for Julia 1.6 and higher #357

torfjelde commented May 15, 2021

maleadt commented May 15, 2021

torfjelde commented May 15, 2021

torfjelde commented May 15, 2021

maleadt commented May 15, 2021

torfjelde commented May 16, 2021

maleadt commented May 17, 2021

torfjelde commented May 18, 2021

mcabbott commented Jun 12, 2021

maleadt commented Jul 6, 2022

repeat for Julia 1.6 and higher #357

repeat for Julia 1.6 and higher #357

Conversation

torfjelde commented May 15, 2021

maleadt commented May 15, 2021

torfjelde commented May 15, 2021

torfjelde commented May 15, 2021

maleadt commented May 15, 2021

torfjelde commented May 16, 2021

maleadt commented May 17, 2021

torfjelde commented May 18, 2021

mcabbott commented Jun 12, 2021

maleadt commented Jul 6, 2022

`repeat` for Julia 1.6 and higher #357

`repeat` for Julia 1.6 and higher #357