-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repeat
for Julia 1.6 and higher
#357
Conversation
GPUArrays is 1.6+ only, so that shouldn't matter 🙂 |
I thought this was the case because of the recent work you have done on GPUCompiler, but doesn't Project.toml says 1.5? |
Tests are passing locally now, and I dropped the check for Julia v1.6 :) |
Oh whoops, I was confusing with GPUCompiler. Yeah we can probably just bump it to 1.6 here then. |
Dope:) Did so now. |
Looking good. I'm still concerned by the multiple reads of the input data though. Would it not be better to flip the inner and outer repetition kernels around, launching one thread per element of the input, and having a for loop in the kernel looping over every output element corresponding to this input? That would also avoid the costly division that happens now to calculate the One potential issue is when the array to repeat is very small. A possible solution here would be to pass in the number of elements each thread has to process, so that it would still be possible to launch one thread per output element (with the number of element to process then being one). This is similar to https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and would allow for a heuristic to determine a configuration (a simple one would be to ensure we have a couple of 100s of threads). |
This was my first "naive" attempt, but then I thought "Given the parallizable nature of GPUs, surely it's better to parallelize over the largest of the two, i.e. output, rather than using a sequential for-loop?". But this might very likely be the wrong intuition as I have no idea how expensive memory-reads/thread spawns/etc. vs loops are on GPU:)
Lemme read this and I'll get back to you! |
It should also be possible to implement |
Superseded by #400 |
This PR adds an implementation of
Base.repeat
for arrays of arbitrary dimensionality. It unfortunately makes use ofBase._RepeatInnerOuter
which is only available on ≥ 1.6, but AFAIK this simplifies the implementation (plus my motivation was usage on Julia v1.6).I haven't looked into an implementation for < 1.6, but guessing we can at least re-use the kernel for those too if we want to support earlier versions.