Skip to content

Commit

Permalink
Included parallelised implementation of any/all for CPUs, plus cooper…
Browse files Browse the repository at this point in the history
…ative GPU tests. Moved all README examples into Manual
  • Loading branch information
anicusan committed Dec 12, 2024
1 parent 99247e6 commit b1a8353
Show file tree
Hide file tree
Showing 17 changed files with 508 additions and 553 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "AcceleratedKernels"
uuid = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
authors = ["Andrei-Leonard Nicusan <leonard@evophase.co.uk> and contributors"]
version = "0.2.1"
version = "0.2.2"

[deps]
ArgCheck = "dce04be8-c92d-5529-be00-80e4d2c0e197"
Expand Down
492 changes: 16 additions & 476 deletions README.md

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions docs/src/api/accumulate.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
### Accumulate / Prefix Sum / Scan

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.7. `accumulate`") # hide
```

```@docs
AcceleratedKernels.accumulate!
AcceleratedKernels.accumulate
```
53 changes: 50 additions & 3 deletions docs/src/api/binarysearch.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,53 @@
### Binary Search

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.8. `searchsorted` and friends") # hide
Find the indices where some elements `x` should be inserted into a sorted sequence `v` to maintain the sorted order. Effectively applying the Julia.Base functions in parallel on a GPU using `foreachindex`.
- `searchsortedfirst!` (in-place), `searchsortedfirst` (allocating): index of first element in `v` >= `x[j]`.
- `searchsortedlast!`, `searchsortedlast`: index of last element in `v` <= `x[j]`.
- **Other names**: `thrust::upper_bound`, `std::lower_bound`.

Function signature:
```julia
# GPU
searchsortedfirst!(ix::AbstractGPUVector, v::AbstractGPUVector, x::AbstractGPUVector;
by=identity, lt=(<), rev::Bool=false,
block_size::Int=256)
searchsortedfirst(v::AbstractGPUVector, x::AbstractGPUVector;
by=identity, lt=(<), rev::Bool=false,
block_size::Int=256)
searchsortedlast!(ix::AbstractGPUVector, v::AbstractGPUVector, x::AbstractGPUVector;
by=identity, lt=(<), rev::Bool=false,
block_size::Int=256)
searchsortedlast(v::AbstractGPUVector, x::AbstractGPUVector;
by=identity, lt=(<), rev::Bool=false,
block_size::Int=256)

# CPU
searchsortedfirst!(ix::AbstractVector, v::AbstractVector, x::AbstractVector;
by=identity, lt=(<), rev::Bool=false,
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
searchsortedfirst(v::AbstractVector, x::AbstractVector;
by=identity, lt=(<), rev::Bool=false,
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
searchsortedlast!(ix::AbstractVector, v::AbstractVector, x::AbstractVector;
by=identity, lt=(<), rev::Bool=false,
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
searchsortedlast(v::AbstractVector, x::AbstractVector;
by=identity, lt=(<), rev::Bool=false,
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000)
```

Example:
```julia
import AcceleratedKernels as AK
using Metal

# Sorted array
v = MtlArray(rand(Float32, 100_000))
AK.merge_sort!(v)

# Elements `x` to place within `v` at indices `ix`
x = MtlArray(rand(Float32, 10_000))
ix = MtlArray{Int}(undef, 10_000)

AK.searchsortedfirst!(ix, v, x)
```
69 changes: 68 additions & 1 deletion docs/src/api/foreachindex.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,71 @@
### General Looping
### General Looping: `foreachindex` / `foraxes`

General workhorses for converting normal Julia `for` loops into GPU code, for example:

```julia
# Copy kernel testing throughput
function cpu_copy!(dst, src)
for i in eachindex(src)
dst[i] = src[i]
end
end
```

Would be written for GPU as:

```julia
import AcceleratedKernels as AK

function gpu_copy!(dst, src)
AK.foreachindex(src) do i
dst[i] = src[i]
end
end
```

Yes, simply change `for i in eachindex(itr)` into `AK.foreachindex(itr) do i` to run it on GPUs / multithreaded - magic! (or just amazing language design)

This is a parallelised for-loop over the indices of an iterable; converts normal Julia code to GPU kernels running one thread per index. On CPUs it executes static index ranges on `max_tasks` threads, with user-defined `min_elems` to be processed by each thread; if only a single thread ends up being needed, the loop is inlined and executed without spawning threads.
- **Other names**: `Kokkos::parallel_for`, `RAJA::forall`, `thrust::transform`.


Example:
```julia
import AcceleratedKernels as AK

function f(a, b)
# Don't use global arrays inside a `foreachindex`; types must be known
@assert length(a) == length(b)
AK.foreachindex(a) do i
# Note that we don't have to explicitly pass b into the lambda
if b[i] > 0.5
a[i] = 1
else
a[i] = 0
end

# Showing arbitrary if conditions; can also be written as:
# @inbounds a[i] = b[i] > 0.5 ? 1 : 0
end
end

# Use any backend, e.g. CUDA, ROCm, oneAPI, Metal, or CPU
using oneAPI
v1 = oneArray{Float32}(undef, 100_000)
v2 = oneArray(rand(Float32, 100_000))
f(v1, v2)
```

All GPU functions allow you to specify a block size - this is often a power of two (mostly 64, 128, 256, 512); the optimum depends on the algorithm, input data and hardware - you can try the different values and `@time` or `@benchmark` them:
```julia
@time AK.foreachindex(f, itr_gpu, block_size=512)
```

Similarly, for performance on the CPU the overhead of spawning threads should be masked by processing more elements per thread (but there is no reason here to launch more threads than `Threads.nthreads()`, the number of threads Julia was started with); the optimum depends on how expensive `f` is - again, benchmarking is your friend:
```julia
@time AK.foreachindex(f, itr_cpu, max_tasks=16, min_elems=1000)
```


```@docs
AcceleratedKernels.foreachindex
Expand Down
8 changes: 1 addition & 7 deletions docs/src/api/map.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
### Map

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.3. `map`") # hide
```

---

```@docs
AcceleratedKernels.map!
AcceleratedKernels.map
```
8 changes: 4 additions & 4 deletions docs/src/api/mapreduce.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
### MapReduce

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.6. `mapreduce`") # hide
```
Equivalent to `reduce(op, map(f, iterable))`, without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).
- **Other names**: `transform_reduce`, some `fold` implementations include the mapping function too.

**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**

---

Expand Down
10 changes: 7 additions & 3 deletions docs/src/api/predicates.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
### Predicates

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.9. `all` / `any`") # hide
Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found.
- **Other names**: not often implemented standalone on GPUs, typically included as part of a reduction.


```@docs
AcceleratedKernels.any
AcceleratedKernels.all
```

**Note on the `cooperative` keyword**: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set `cooperative=false` to use a safer `mapreduce`-based implementation.
8 changes: 4 additions & 4 deletions docs/src/api/reduce.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
### Reductions

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.5. `reduce`") # hide
```
Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.
- **Other names**: `Kokkos:parallel_reduce`, `fold`, `aggregate`.

**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**

---

Expand Down
64 changes: 60 additions & 4 deletions docs/src/api/sort.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,62 @@
### `sort` and friends

```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.4. `sort` and friends") # hide
```
Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:
- `sort!` (in-place), `sort` (out-of-place)
- `sortperm!`, `sortperm`
- **Other names**: `sort`, `sort_team`, `sort_team_by_key`, `stable_sort` or variations in Kokkos, RAJA, Thrust that I know of.

Function signature:
```julia
sort!(v::AbstractGPUVector;
lt=isless, by=identity, rev::Bool=false, order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing)

sortperm!(ix::AbstractGPUVector, v::AbstractGPUVector;
lt=isless, by=identity, rev::Bool=false, order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing)
```

Specific implementations that the interfaces above forward to:
- `merge_sort!` (in-place), `merge_sort` (out-of-place) - sort arbitrary objects with custom comparisons.
- `merge_sort_by_key!`, `merge_sort_by_key` - sort a vector of keys along with a "payload", a vector of corresponding values.
- `merge_sortperm!`, `merge_sortperm`, `merge_sortperm_lowmem!`, `merge_sortperm_lowmem` - compute a sorting index permutation.

Function signature:
```julia
merge_sort!(v::AbstractGPUVector;
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing)

merge_sort_by_key!(keys::AbstractGPUVector, values::AbstractGPUVector;
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
block_size::Int=256,
temp_keys::Union{Nothing, AbstractGPUVector}=nothing,
temp_values::Union{Nothing, AbstractGPUVector}=nothing)

merge_sortperm!(ix::AbstractGPUVector, v::AbstractGPUVector;
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
inplace::Bool=false, block_size::Int=256,
temp_ix::Union{Nothing, AbstractGPUVector}=nothing,
temp_v::Union{Nothing, AbstractGPUVector}=nothing)

merge_sortperm_lowmem!(ix::AbstractGPUVector, v::AbstractGPUVector;
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractGPUVector}=nothing)
```

Example:
```julia
import AcceleratedKernels as AK
using AMDGPU

v = ROCArray(rand(Int32, 100_000))
AK.sort!(v)
```

As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the `temp` argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.:
```julia
v = ROCArray(rand(Float32, 100_000))
temp = similar(v)
AK.sort!(v, temp=temp)
```
38 changes: 34 additions & 4 deletions docs/src/api/using_backends.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
### Using Different Backends
```@example
import AcceleratedKernels as AK # hide
AK.DocHelpers.readme_section("### 5.1. Using Different Backends") # hide
```

For any of the examples here, simply use a different GPU array and AcceleratedKernels.jl will pick the right backend:
```julia
# Intel Graphics
using oneAPI
v = oneArray{Int32}(undef, 100_000) # Empty array

# AMD ROCm
using AMDGPU
v = ROCArray{Float64}(1:100_000) # A range converted to Float64

# Apple Metal
using Metal
v = MtlArray(rand(Float32, 100_000)) # Transfer from host to device

# NVidia CUDA
using CUDA
v = CuArray{UInt32}(0:5:100_000) # Range with explicit step size

# Transfer GPU array back
v_host = Array(v)
```

All publicly-exposed functions have CPU implementations with unified parameter interfaces:

```julia
import AcceleratedKernels as AK
v = Vector(-1000:1000) # Normal CPU array
AK.reduce(+, v, max_tasks=Threads.nthreads())
```

Note the `reduce` and `mapreduce` CPU implementations forward arguments to [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), an excellent package for multithreading. The focus of AcceleratedKernels.jl is to provide a unified interface to high-performance implementations of common algorithmic kernels, for both CPUs and GPUs - if you need fine-grained control over threads, scheduling, communication for specialised algorithms (e.g. with highly unequal workloads), consider using [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl) or [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) directly.

There is ongoing work on multithreaded CPU `sort` and `accumulate` implementations - at the moment, they fall back to single-threaded algorithms; the rest of the library is fully parallelised for both CPUs and GPUs.
15 changes: 4 additions & 11 deletions prototype/truth_test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,16 @@ using PProf

using KernelAbstractions
# using oneAPI
using CUDA
# using CUDA
using Metal

import AcceleratedKernels as AK


Random.seed!(0)


v = CuArray(1:100)
v = MtlArray(1:100)

@assert AK.any(x->x<0, v, cooperative=false) === false
@assert AK.any(x->x>99, v, cooperative=false) === true
Expand All @@ -35,18 +36,10 @@ println("simple all tests passed")



v = CuArray(1:10_000_000)
v = Array(1:10_000_000)

println("AcceleratedKernels any (reduce based):")
display(@benchmark(AK.any(x->x>9_999, v, cooperative=false)))

println("AcceleratedKernels any (coop based):")
display(@benchmark(AK.any(x->x>9_999, v, cooperative=true)))

println("oneAPI minimum:")
display(@benchmark(any(x->x>9_999, v)))

println("CPU minimum:")
vh = Array(v)
display(@benchmark(any(x->x>9_999, vh)))

2 changes: 1 addition & 1 deletion src/AcceleratedKernels.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
module AcceleratedKernels


# No functionality exported by this package before discussion with community
# No functions exported by default, to allow having the same names as Base without conflicts


# Internal dependencies
Expand Down
Loading

0 comments on commit b1a8353

Please sign in to comment.