Brackets wrapper for Kokkos (or "brak" for short) is a header-only library that proposes a wrapper class to access Kokkos views with a plain old data C array syntax, using brackets.
This library is especially useful if you want to start porting a code to Kokkos by updating the data structures first, while keeping the loops untouched.
The best way is to use CMake.
Get the library in your project:
git clone https://github.com/cexa-project/brak.git path/to/brak
In your main CMake file:
add_subdirectory(path/to/brak)
target_link_libraries(
my-lib
PRIVATE
Brak::brak
)
In your main CMake file:
include(FetchContent)
FetchContent_Declare(
brak
GIT_REPOSITORY https://github.com/CExA-project/brak.git
GIT_TAG master
)
FetchContent_MakeAvailable(brak)
target_link_libraries(
my-lib
PRIVATE
Brak::brak
)
Get, then install the project:
git clone https://github.com/cexa-project/brak.git
cd brak
cmake -B build -DCMAKE_INSTALL_PREFIX=path/to/install -DCMAKE_BUILD_TYPE=Release # other Kokkos options here if needed
cmake --install build
In your main CMake file:
find_package(Brak REQUIRED)
target_link_libraries(
my-lib
PRIVATE
Brak::brak
)
Alternatively, you can also copy include/brak
in your project and start using it.
You can build tests with the CMake option BRAK_ENABLE_TESTS
, and run them with ctest
.
If you don't have a GPU available when compiling with a GPU backend activated, you have to disable the CMake option BRAK_ENABLE_GTEST_DISCOVER_TESTS
.
You can build examples with the CMake option BRAK_ENABLE_EXAMPLES
.
They should be run individually.
Benchmarks are built with the CMake option BRAK_ENABLE_BENCHMARKS
.
They should be run individually.
The API documentation is handled by Doxygen (1.9.1 or newer) and is built with the CMake option BRAK_ENABLE_DOCUMENTATION
.
The private API is not included by default and is added with the option BRAK_ENABLE_DOCUMENTATION_DEVMODE
.
The documentation is built with the target docs
.
The library allows to wrap a Kokkos view to use it like a plain old data C array. If the number of pair of brackets is the same as the rank of the view, then the resulting object is a scalar:
#include <cassert>
#include <Kokkos_Core.hpp>
#include "brak/wrapper_subview.hpp"
// or
#include "brak/wrapper_array.hpp"
void doSomething() {
Kokkos::View<int ********, Kokkos::HostSpace> data{"data", 2, 2, 2, 2, 2, 2, 2, 2};
brak::WrapperSubview dataWrapper{data};
// or
brak::WrapperArray dataWrapper{data};
dataWrapper[0][0][0][0][0][0][0][0] = 10;
assert(data(0, 0, 0, 0, 0, 0, 0, 0) == 10);
}
To achieve this, two implementations are proposed (they share the same API) in the next section.
Is it also possible to use the parentheses operator, like Kokkos views:
dataWrapper(0, 0, 0, 0, 0, 0, 0, 0) = 20;
assert(data(0, 0, 0, 0, 0, 0, 0, 0) == 20);
Keep in mind however that not using the wrapped view up to it's scalar value results in a Brak object:
auto subDataWrapper = dataWrapper[0][0][0][0];
static_assert(!std::is_same_v<decltype(subDataWrapper), int ****>);
subDataWrapper[0][0][0][0] = 30;
assert(data(0, 0, 0, 0, 0, 0, 0, 0) == 30);
A Brak object of any depth level can still be used either with brackets, or with parentheses too:
subDataWrapper(0, 0, 0, 0) = 40;
assert(data(0, 0, 0, 0, 0, 0, 0, 0) == 40);
It is possible to retrieve the current wrapped view with the getView
method:
auto subData = dataWrapper[0][0][0][0].getView();
static_assert(Kokkos::is_view<decltype(subData)>::value);
It is also possible to get the raw pointer of the current wrapped view with the defer operator, even if this may lead to unpredictable behaviors:
auto subPointer = *(dataWrapper[0][0][0][0]);
static_assert(std::is_pointer_v<decltype(*subPointer)>);
With this approach, the class brak::WrapperSubview
wraps a view, and each call to the brackets operator gives a new instance of the class wrapping a subview of a rank lowered by one.
The subview is unmanaged, in order to disable reference counting and increase performance.
This approach is not efficient in terms of performance at compile time and at runtime, due to the remaining reference counting that could not be disabled.
This implementation can be still interesting as if you don't go up to the scalar value, the intermediate object returned by the brackets operator is still useable somehow (it's a subview, after all).
With this different approach, the class brak::WrapperArray
wraps a view, and each call to the brackets operator gives a sub-wrapper that also stores an array of the requested indices.
The subsequent wrapper contains an unmanaged version of the initial view, in order to disable reference counting and increase performance.
This approach has performance that are on par with Kokkos views.
Benchmarks done using an Intel Core i7-13800H and a NVIDIA A500 GPU, for a release build (unless specified in the details), all times in seconds. Performance ratios are expressed with standard deviation, between parentheses.
Implementation | Build Serial | Access Serial | Nested-for Serial | Parallel-for Serial | Parallel-for OpenMP | Parallel-for Cuda |
---|---|---|---|---|---|---|
Wrapper subview | 984 × 10-3 | 9.89 × 10-9 | 2337e × 10-3 | 726 × 10-3 | 353 × 10-3 | 97.3 × 10-3 |
Wrapper array | 805 × 10-3 | 0.392 × 10-9 | 36.1e × 10-3 | 451 × 10-3 | 309 × 10-3 | 89.1 × 10-3 |
Reference view | 768 × 10-3 | 1.14 × 10-9 | 58.5e × 10-3 | 443 × 10-3 | 332 × 10-3 | 87.7 × 10-3 |
Benchmarks are detailed in the next sections.
In terms of compilation time, building a code using a subview wrapper is 1.28 (3 %) times slower than a code using a reference view, and a code using an array wrapper is 1.05 (3 %) times slower.
When accessing a single element, a subview wrapper is 8.7 (1 %) times slower than a view, and an array wrapper is 2.9 (1 %) times faster. The later is due to reference counting being disabled for wrappers. Though using it, the subview wrapper does not benefit of it much, but the same order of magnitude of execution time can be obtained if the initial view is already unmanaged.
For a more realistic use of the arrays, a subview wrapper is 40 (1 %) times slower than a view, and an array wrapper is 1.6 (2 %) times faster. Frequent accesses to data is less well handled by the subview wrapper. Using an already unmanaged view brings the performance of the subview wrapper similar to the use of Kokkos views.
For a heavy access of elements, a subview wrapper is 1.64 (2 %) times slower than a view for CPU serial execution. It is 1.06 (7 %) times slower, respectively 1.11 (0.4 %) times slower, for CPU parallel execution, respectively GPU execution, meaning that parallel execution tends to lower the difference. An array wrapper is 1.02 (2 %) times slower, respectively 1.8 (7 %) faster and 1.02 (0.3 %) times slower, for CPU serial execution, respectively CPU parallel execution and GPU execution, which shows that this implementation has a limited impact on performance.
This compile benchmark consists in compiling in debug mode a function that creates a view of rank 8 of dimension 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 (256 elements) containing 4 bits integers (1.024 kB) and that accesses and sets its element 1, 1, 1, 1, 1, 1, 1, 1 to 10.
This benchmark uses a view of rank 8 of dimension 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 (256 elements) containing 4 bits integers (1.024 kB). It consists in measuring the time to access and set the element 1, 1, 1, 1, 1, 1, 1, 1 to 10.
This benchmark uses two views of rank 3 of dimension 30 × 30 × 30 (27 × 103 elements) containing 4 bits integers each (216 kB).
It consists in measuring the time to update one view from the other with a stencil, then to swap the two views.
This benchmark loosely relates to the heat equation.
Loops are performed using traditional nested for
loops.
This benchmark uses a view of rank 6 of dimension 30 × 30 × 30 × 30 × 30 × 30 (729 × 106 elements) containing 4 bits integers (2.916 GB).
It consists in measuring the time for a Kokkos parallel_for
loop to fill all the elements of the view with the sum of their coordinates.
The time spent in launching the kernel is counterbalanced by the large size of the view.