-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Included parallelised implementation of any/all for CPUs, plus cooper…
…ative GPU tests. Moved all README examples into Manual
- Loading branch information
Showing
17 changed files
with
508 additions
and
553 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
### Accumulate / Prefix Sum / Scan | ||
|
||
```@example | ||
import AcceleratedKernels as AK # hide | ||
AK.DocHelpers.readme_section("### 5.7. `accumulate`") # hide | ||
``` | ||
|
||
```@docs | ||
AcceleratedKernels.accumulate! | ||
AcceleratedKernels.accumulate | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,53 @@ | ||
### Binary Search | ||
|
||
```@example | ||
import AcceleratedKernels as AK # hide | ||
AK.DocHelpers.readme_section("### 5.8. `searchsorted` and friends") # hide | ||
Find the indices where some elements `x` should be inserted into a sorted sequence `v` to maintain the sorted order. Effectively applying the Julia.Base functions in parallel on a GPU using `foreachindex`. | ||
- `searchsortedfirst!` (in-place), `searchsortedfirst` (allocating): index of first element in `v` >= `x[j]`. | ||
- `searchsortedlast!`, `searchsortedlast`: index of last element in `v` <= `x[j]`. | ||
- **Other names**: `thrust::upper_bound`, `std::lower_bound`. | ||
|
||
Function signature: | ||
```julia | ||
# GPU | ||
searchsortedfirst!(ix::AbstractGPUVector, v::AbstractGPUVector, x::AbstractGPUVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
block_size::Int=256) | ||
searchsortedfirst(v::AbstractGPUVector, x::AbstractGPUVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
block_size::Int=256) | ||
searchsortedlast!(ix::AbstractGPUVector, v::AbstractGPUVector, x::AbstractGPUVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
block_size::Int=256) | ||
searchsortedlast(v::AbstractGPUVector, x::AbstractGPUVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
block_size::Int=256) | ||
|
||
# CPU | ||
searchsortedfirst!(ix::AbstractVector, v::AbstractVector, x::AbstractVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000) | ||
searchsortedfirst(v::AbstractVector, x::AbstractVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000) | ||
searchsortedlast!(ix::AbstractVector, v::AbstractVector, x::AbstractVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000) | ||
searchsortedlast(v::AbstractVector, x::AbstractVector; | ||
by=identity, lt=(<), rev::Bool=false, | ||
max_tasks::Int=Threads.nthreads(), min_elems::Int=1000) | ||
``` | ||
|
||
Example: | ||
```julia | ||
import AcceleratedKernels as AK | ||
using Metal | ||
|
||
# Sorted array | ||
v = MtlArray(rand(Float32, 100_000)) | ||
AK.merge_sort!(v) | ||
|
||
# Elements `x` to place within `v` at indices `ix` | ||
x = MtlArray(rand(Float32, 10_000)) | ||
ix = MtlArray{Int}(undef, 10_000) | ||
|
||
AK.searchsortedfirst!(ix, v, x) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,6 @@ | ||
### Map | ||
|
||
```@example | ||
import AcceleratedKernels as AK # hide | ||
AK.DocHelpers.readme_section("### 5.3. `map`") # hide | ||
``` | ||
|
||
--- | ||
|
||
```@docs | ||
AcceleratedKernels.map! | ||
AcceleratedKernels.map | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,12 @@ | ||
### Predicates | ||
|
||
```@example | ||
import AcceleratedKernels as AK # hide | ||
AK.DocHelpers.readme_section("### 5.9. `all` / `any`") # hide | ||
Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found. | ||
- **Other names**: not often implemented standalone on GPUs, typically included as part of a reduction. | ||
|
||
|
||
```@docs | ||
AcceleratedKernels.any | ||
AcceleratedKernels.all | ||
``` | ||
|
||
**Note on the `cooperative` keyword**: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set `cooperative=false` to use a safer `mapreduce`-based implementation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,62 @@ | ||
### `sort` and friends | ||
|
||
```@example | ||
import AcceleratedKernels as AK # hide | ||
AK.DocHelpers.readme_section("### 5.4. `sort` and friends") # hide | ||
``` | ||
Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs: | ||
- `sort!` (in-place), `sort` (out-of-place) | ||
- `sortperm!`, `sortperm` | ||
- **Other names**: `sort`, `sort_team`, `sort_team_by_key`, `stable_sort` or variations in Kokkos, RAJA, Thrust that I know of. | ||
|
||
Function signature: | ||
```julia | ||
sort!(v::AbstractGPUVector; | ||
lt=isless, by=identity, rev::Bool=false, order::Base.Order.Ordering=Base.Order.Forward, | ||
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing) | ||
|
||
sortperm!(ix::AbstractGPUVector, v::AbstractGPUVector; | ||
lt=isless, by=identity, rev::Bool=false, order::Base.Order.Ordering=Base.Order.Forward, | ||
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing) | ||
``` | ||
|
||
Specific implementations that the interfaces above forward to: | ||
- `merge_sort!` (in-place), `merge_sort` (out-of-place) - sort arbitrary objects with custom comparisons. | ||
- `merge_sort_by_key!`, `merge_sort_by_key` - sort a vector of keys along with a "payload", a vector of corresponding values. | ||
- `merge_sortperm!`, `merge_sortperm`, `merge_sortperm_lowmem!`, `merge_sortperm_lowmem` - compute a sorting index permutation. | ||
|
||
Function signature: | ||
```julia | ||
merge_sort!(v::AbstractGPUVector; | ||
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward, | ||
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing) | ||
|
||
merge_sort_by_key!(keys::AbstractGPUVector, values::AbstractGPUVector; | ||
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward, | ||
block_size::Int=256, | ||
temp_keys::Union{Nothing, AbstractGPUVector}=nothing, | ||
temp_values::Union{Nothing, AbstractGPUVector}=nothing) | ||
|
||
merge_sortperm!(ix::AbstractGPUVector, v::AbstractGPUVector; | ||
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward, | ||
inplace::Bool=false, block_size::Int=256, | ||
temp_ix::Union{Nothing, AbstractGPUVector}=nothing, | ||
temp_v::Union{Nothing, AbstractGPUVector}=nothing) | ||
|
||
merge_sortperm_lowmem!(ix::AbstractGPUVector, v::AbstractGPUVector; | ||
lt=(<), by=identity, rev::Bool=false, order::Ordering=Forward, | ||
block_size::Int=256, | ||
temp::Union{Nothing, AbstractGPUVector}=nothing) | ||
``` | ||
|
||
Example: | ||
```julia | ||
import AcceleratedKernels as AK | ||
using AMDGPU | ||
|
||
v = ROCArray(rand(Int32, 100_000)) | ||
AK.sort!(v) | ||
``` | ||
|
||
As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the `temp` argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.: | ||
```julia | ||
v = ROCArray(rand(Float32, 100_000)) | ||
temp = similar(v) | ||
AK.sort!(v, temp=temp) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,35 @@ | ||
### Using Different Backends | ||
```@example | ||
import AcceleratedKernels as AK # hide | ||
AK.DocHelpers.readme_section("### 5.1. Using Different Backends") # hide | ||
``` | ||
|
||
For any of the examples here, simply use a different GPU array and AcceleratedKernels.jl will pick the right backend: | ||
```julia | ||
# Intel Graphics | ||
using oneAPI | ||
v = oneArray{Int32}(undef, 100_000) # Empty array | ||
|
||
# AMD ROCm | ||
using AMDGPU | ||
v = ROCArray{Float64}(1:100_000) # A range converted to Float64 | ||
|
||
# Apple Metal | ||
using Metal | ||
v = MtlArray(rand(Float32, 100_000)) # Transfer from host to device | ||
|
||
# NVidia CUDA | ||
using CUDA | ||
v = CuArray{UInt32}(0:5:100_000) # Range with explicit step size | ||
|
||
# Transfer GPU array back | ||
v_host = Array(v) | ||
``` | ||
|
||
All publicly-exposed functions have CPU implementations with unified parameter interfaces: | ||
|
||
```julia | ||
import AcceleratedKernels as AK | ||
v = Vector(-1000:1000) # Normal CPU array | ||
AK.reduce(+, v, max_tasks=Threads.nthreads()) | ||
``` | ||
|
||
Note the `reduce` and `mapreduce` CPU implementations forward arguments to [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), an excellent package for multithreading. The focus of AcceleratedKernels.jl is to provide a unified interface to high-performance implementations of common algorithmic kernels, for both CPUs and GPUs - if you need fine-grained control over threads, scheduling, communication for specialised algorithms (e.g. with highly unequal workloads), consider using [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl) or [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) directly. | ||
|
||
There is ongoing work on multithreaded CPU `sort` and `accumulate` implementations - at the moment, they fall back to single-threaded algorithms; the rest of the library is fully parallelised for both CPUs and GPUs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.