Included parallelised implementation of any/all for CPUs, plus cooper…

…ative GPU tests. Moved all README examples into Manual
JuliaGPU · Dec 12, 2024 · b1a8353 · b1a8353
1 parent 99247e6
commit b1a8353
Show file tree

Hide file tree

Showing 17 changed files with 508 additions and 553 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "AcceleratedKernels"
 uuid = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
 authors = ["Andrei-Leonard Nicusan <leonard@evophase.co.uk> and contributors"]
-version = "0.2.1"
+version = "0.2.2"
 
 [deps]
 ArgCheck = "dce04be8-c92d-5529-be00-80e4d2c0e197"

diff --git a/README.md b/README.md
diff --git a/docs/src/api/accumulate.md b/docs/src/api/accumulate.md
@@ -1,6 +1,7 @@
 ### Accumulate / Prefix Sum / Scan
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.7. `accumulate`") # hide
-```
+
+```@docs
+AcceleratedKernels.accumulate!
+AcceleratedKernels.accumulate
+```
diff --git a/docs/src/api/binarysearch.md b/docs/src/api/binarysearch.md
@@ -1,6 +1,53 @@
 ### Binary Search
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.8. `searchsorted` and friends") # hide
+Find the indices where some elements `x` should be inserted into a sorted sequence `v` to maintain the sorted order. Effectively applying the Julia.Base functions in parallel on a GPU using `foreachindex`.
+- `searchsortedfirst!` (in-place), `searchsortedfirst` (allocating): index of first element in `v` >= `x[j]`.
+- `searchsortedlast!`, `searchsortedlast`: index of last element in `v` <= `x[j]`.
+- **Other names**: `thrust::upper_bound`, `std::lower_bound`.
+
+Function signature:
+```julia
+# GPU
+searchsortedfirst!(ix::AbstractGPUVector, v::AbstractGPUVector, x::AbstractGPUVector;
+                   by=identity, lt=(<), rev::Bool=false,
+                   block_size::Int=256)
+searchsortedfirst(v::AbstractGPUVector, x::AbstractGPUVector;
+                  by=identity, lt=(<), rev::Bool=false,
+                  block_size::Int=256)
+searchsortedlast!(ix::AbstractGPUVector, v::AbstractGPUVector, x::AbstractGPUVector;
+                  by=identity, lt=(<), rev::Bool=false,
+                  block_size::Int=256)
+searchsortedlast(v::AbstractGPUVector, x::AbstractGPUVector;
+                 by=identity, lt=(<), rev::Bool=false,
+                 block_size::Int=256)
+
+# CPU
+searchsortedfirst!(ix::AbstractVector, v::AbstractVector, x::AbstractVector;
+                   by=identity, lt=(<), rev::Bool=false,
+                   max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
+searchsortedfirst(v::AbstractVector, x::AbstractVector;
+                  by=identity, lt=(<), rev::Bool=false,
+                  max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
+searchsortedlast!(ix::AbstractVector, v::AbstractVector, x::AbstractVector;
+                  by=identity, lt=(<), rev::Bool=false,
+                  max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
+searchsortedlast(v::AbstractVector, x::AbstractVector;
+                 by=identity, lt=(<), rev::Bool=false,
+                 max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
+```
+
+Example:
+```julia
+import AcceleratedKernels as AK
+using Metal
+
+# Sorted array
+v = MtlArray(rand(Float32, 100_000))
+AK.merge_sort!(v)
+
+# Elements `x` to place within `v` at indices `ix`
+x = MtlArray(rand(Float32, 10_000))
+ix = MtlArray{Int}(undef, 10_000)
+
+AK.searchsortedfirst!(ix, v, x)
 ```
diff --git a/docs/src/api/foreachindex.md b/docs/src/api/foreachindex.md
@@ -1,4 +1,71 @@
-### General Looping
+### General Looping: `foreachindex` / `foraxes`
+
+General workhorses for converting normal Julia `for` loops into GPU code, for example:
+
+```julia
+# Copy kernel testing throughput
+function cpu_copy!(dst, src)
+    for i in eachindex(src)
+        dst[i] = src[i]
+    end
+end
+```
+
+Would be written for GPU as:
+
+```julia
+import AcceleratedKernels as AK
+
+function gpu_copy!(dst, src)
+    AK.foreachindex(src) do i
+        dst[i] = src[i]
+    end
+end
+```
+
+Yes, simply change `for i in eachindex(itr)` into `AK.foreachindex(itr) do i` to run it on GPUs / multithreaded - magic! (or just amazing language design)
+
+This is a parallelised for-loop over the indices of an iterable; converts normal Julia code to GPU kernels running one thread per index. On CPUs it executes static index ranges on `max_tasks` threads, with user-defined `min_elems` to be processed by each thread; if only a single thread ends up being needed, the loop is inlined and executed without spawning threads.
+- **Other names**: `Kokkos::parallel_for`, `RAJA::forall`, `thrust::transform`.
+
+
+Example:
+```julia
+import AcceleratedKernels as AK
+
+function f(a, b)
+    # Don't use global arrays inside a `foreachindex`; types must be known
+    @assert length(a) == length(b)
+    AK.foreachindex(a) do i
+        # Note that we don't have to explicitly pass b into the lambda
+        if b[i] > 0.5
+            a[i] = 1
+        else
+            a[i] = 0
+        end
+
+        # Showing arbitrary if conditions; can also be written as:
+        # @inbounds a[i] = b[i] > 0.5 ? 1 : 0
+    end
+end
+
+# Use any backend, e.g. CUDA, ROCm, oneAPI, Metal, or CPU
+using oneAPI
+v1 = oneArray{Float32}(undef, 100_000)
+v2 = oneArray(rand(Float32, 100_000))
+f(v1, v2)
+```
+
+All GPU functions allow you to specify a block size - this is often a power of two (mostly 64, 128, 256, 512); the optimum depends on the algorithm, input data and hardware - you can try the different values and `@time` or `@benchmark` them:
+```julia
+@time AK.foreachindex(f, itr_gpu, block_size=512)
+```
+
+Similarly, for performance on the CPU the overhead of spawning threads should be masked by processing more elements per thread (but there is no reason here to launch more threads than `Threads.nthreads()`, the number of threads Julia was started with); the optimum depends on how expensive `f` is - again, benchmarking is your friend:
+```julia
+@time AK.foreachindex(f, itr_cpu, max_tasks=16, min_elems=1000)
+```
+
 
 ```@docs
 AcceleratedKernels.foreachindex

diff --git a/docs/src/api/map.md b/docs/src/api/map.md
@@ -1,12 +1,6 @@
 ### Map
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.3. `map`") # hide
-```
-
----
-
 ```@docs
 AcceleratedKernels.map!
+AcceleratedKernels.map
 ```
diff --git a/docs/src/api/mapreduce.md b/docs/src/api/mapreduce.md
@@ -1,9 +1,9 @@
 ### MapReduce
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.6. `mapreduce`") # hide
-```
+Equivalent to `reduce(op, map(f, iterable))`, without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).
+- **Other names**: `transform_reduce`, some `fold` implementations include the mapping function too.
+
+**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**
 
 ---
 

diff --git a/docs/src/api/predicates.md b/docs/src/api/predicates.md
@@ -1,8 +1,12 @@
 ### Predicates
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.9. `all` / `any`") # hide
+Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found.
+- **Other names**: not often implemented standalone on GPUs, typically included as part of a reduction.
+
+
+```@docs
+AcceleratedKernels.any
+AcceleratedKernels.all
 ```
 
 **Note on the `cooperative` keyword**: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set `cooperative=false` to use a safer `mapreduce`-based implementation.
diff --git a/docs/src/api/reduce.md b/docs/src/api/reduce.md
@@ -1,9 +1,9 @@
 ### Reductions
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.5. `reduce`") # hide
-```
+Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.
+- **Other names**: `Kokkos:parallel_reduce`, `fold`, `aggregate`.
+
+**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**
 
 ---
 

diff --git a/docs/src/api/sort.md b/docs/src/api/sort.md
@@ -1,6 +1,62 @@
 ###  `sort` and friends
 
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.4. `sort` and friends") # hide
-```
+Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:
+- `sort!` (in-place), `sort` (out-of-place)
+- `sortperm!`, `sortperm`
+- **Other names**: `sort`, `sort_team`, `sort_team_by_key`, `stable_sort` or variations in Kokkos, RAJA, Thrust that I know of.
+
+Function signature:
+```julia
+sort!(v::AbstractGPUVector;
+      lt=isless, by=identity, rev::Bool=false, order::Base.Order.Ordering=Base.Order.Forward,
+      block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing)
+
+sortperm!(ix::AbstractGPUVector, v::AbstractGPUVector;
+          lt=isless, by=identity, rev::Bool=false, order::Base.Order.Ordering=Base.Order.Forward,
+          block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing)
+```
+
+Specific implementations that the interfaces above forward to:
+- `merge_sort!` (in-place), `merge_sort` (out-of-place) - sort arbitrary objects with custom comparisons.
+- `merge_sort_by_key!`, `merge_sort_by_key` - sort a vector of keys along with a "payload", a vector of corresponding values.
+- `merge_sortperm!`, `merge_sortperm`, `merge_sortperm_lowmem!`, `merge_sortperm_lowmem` - compute a sorting index permutation. 
+
+Function signature:
+```julia
+merge_sort!(v::AbstractGPUVector;
+            lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
+            block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing)
+
+merge_sort_by_key!(keys::AbstractGPUVector, values::AbstractGPUVector;
+                   lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
+                   block_size::Int=256,
+                   temp_keys::Union{Nothing, AbstractGPUVector}=nothing,
+                   temp_values::Union{Nothing, AbstractGPUVector}=nothing)
+
+merge_sortperm!(ix::AbstractGPUVector, v::AbstractGPUVector;
+                lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
+                inplace::Bool=false, block_size::Int=256,
+                temp_ix::Union{Nothing, AbstractGPUVector}=nothing,
+                temp_v::Union{Nothing, AbstractGPUVector}=nothing)
+
+merge_sortperm_lowmem!(ix::AbstractGPUVector, v::AbstractGPUVector;
+                       lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
+                       block_size::Int=256,
+                       temp::Union{Nothing, AbstractGPUVector}=nothing)
+```
+
+Example:
+```julia
+import AcceleratedKernels as AK
+using AMDGPU
+
+v = ROCArray(rand(Int32, 100_000))
+AK.sort!(v)
+```
+
+As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the `temp` argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.:
+```julia
+v = ROCArray(rand(Float32, 100_000))
+temp = similar(v)
+AK.sort!(v, temp=temp)
+```
diff --git a/docs/src/api/using_backends.md b/docs/src/api/using_backends.md
@@ -1,5 +1,35 @@
 ### Using Different Backends
-```@example
-import AcceleratedKernels as AK # hide
-AK.DocHelpers.readme_section("### 5.1. Using Different Backends") # hide
-```
+
+For any of the examples here, simply use a different GPU array and AcceleratedKernels.jl will pick the right backend:
+```julia
+# Intel Graphics
+using oneAPI
+v = oneArray{Int32}(undef, 100_000)             # Empty array
+
+# AMD ROCm
+using AMDGPU
+v = ROCArray{Float64}(1:100_000)                # A range converted to Float64
+
+# Apple Metal
+using Metal
+v = MtlArray(rand(Float32, 100_000))            # Transfer from host to device
+
+# NVidia CUDA
+using CUDA
+v = CuArray{UInt32}(0:5:100_000)                # Range with explicit step size
+
+# Transfer GPU array back
+v_host = Array(v)
+```
+
+All publicly-exposed functions have CPU implementations with unified parameter interfaces:
+
+```julia
+import AcceleratedKernels as AK
+v = Vector(-1000:1000)                          # Normal CPU array
+AK.reduce(+, v, max_tasks=Threads.nthreads())
+```
+
+Note the `reduce` and `mapreduce` CPU implementations forward arguments to [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), an excellent package for multithreading. The focus of AcceleratedKernels.jl is to provide a unified interface to high-performance implementations of common algorithmic kernels, for both CPUs and GPUs - if you need fine-grained control over threads, scheduling, communication for specialised algorithms (e.g. with highly unequal workloads), consider using [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl) or [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) directly.
+
+There is ongoing work on multithreaded CPU `sort` and `accumulate` implementations - at the moment, they fall back to single-threaded algorithms; the rest of the library is fully parallelised for both CPUs and GPUs.
diff --git a/prototype/truth_test.jl b/prototype/truth_test.jl
@@ -6,15 +6,16 @@ using PProf
 
 using KernelAbstractions
 # using oneAPI
-using CUDA
+# using CUDA
+using Metal
 
 import AcceleratedKernels as AK
 
 
 Random.seed!(0)
 
 
-v = CuArray(1:100)
+v = MtlArray(1:100)
 
 @assert AK.any(x->x<0, v, cooperative=false) === false
 @assert AK.any(x->x>99, v, cooperative=false) === true
@@ -35,18 +36,10 @@ println("simple all tests passed")
 
 
 
-v = CuArray(1:10_000_000)
+v = Array(1:10_000_000)
 
 println("AcceleratedKernels any (reduce based):")
 display(@benchmark(AK.any(x->x>9_999, v, cooperative=false)))
 
 println("AcceleratedKernels any (coop based):")
 display(@benchmark(AK.any(x->x>9_999, v, cooperative=true)))
-
-println("oneAPI minimum:")
-display(@benchmark(any(x->x>9_999, v)))
-
-println("CPU minimum:")
-vh = Array(v)
-display(@benchmark(any(x->x>9_999, vh)))
-
diff --git a/src/AcceleratedKernels.jl b/src/AcceleratedKernels.jl
@@ -7,7 +7,7 @@
 module AcceleratedKernels
 
 
-# No functionality exported by this package before discussion with community
+# No functions exported by default, to allow having the same names as Base without conflicts
 
 
 # Internal dependencies