diff --git a/docs/contributing.md b/docs/contributing.md index 347dc688ec..04d701406f 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -1,7 +1,7 @@ --- has_children: true has_toc: true -nav_order: 4 +nav_order: 5 --- # Contributing diff --git a/docs/extended_api.md b/docs/extended_api.md new file mode 100644 index 0000000000..4fa0abe74d --- /dev/null +++ b/docs/extended_api.md @@ -0,0 +1,29 @@ +--- +has_children: true +has_toc: false +nav_order: 3 +--- + +# Extended API + +## [Headers](./extended_api/headers.md) + +### [\](./extended_api/headers/pipeline.md) + +## [Concepts](./extended_api/concepts.md) + +### [Group](./extended_api/concepts/group.md) + +### [Shape](./extended_api/concepts/shape.md) + +## [Synchronization library](./extended_api/synchronization_library.md) + +### [pipeline](./extended_api/synchronization_library/pipeline.md) + +### [pipeline_shared_state](./extended_api/synchronization_library/pipeline_shared_state.md) + +## [Asynchronous operations library](./extended_api/asynchronous_operations_library.md) + +### [aligned_size_t](./extended_api/asynchronous_operations_library/aligned_size_t.md) + +### [memcpy_async](./extended_api/asynchronous_operations_library/memcpy_async.md) diff --git a/docs/extended_api/asynchronous_operations_library.md b/docs/extended_api/asynchronous_operations_library.md new file mode 100644 index 0000000000..e4c31157bf --- /dev/null +++ b/docs/extended_api/asynchronous_operations_library.md @@ -0,0 +1,17 @@ +--- +parent: Extended API +has_children: true +has_toc: false +--- + +# Asynchronous operations library + +The asynchronous operations library provides components for asynchronous data movement. + +## Shapes + +| [aligned_size_t](./asynchronous_operations_library/aligned_size_t.md) | defines an extent of bytes with a statically defined alignment `(class template)` | + +## Asynchronous operations + +| [memcpy_async](./asynchronous_operations_library/memcpy_async.md) | asynchronously copies one buffer to another `(function template)` | diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t.md new file mode 100644 index 0000000000..68b4bf973e --- /dev/null +++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t.md @@ -0,0 +1,61 @@ +--- +grand_parent: Extended API +parent: Asynchronous operations library +--- + +# cuda::**aligned_size_t** + +Defined in header [``](../headers/barrier.md) + +Defined in header [``](../headers/pipeline.md) + +```c++ +template +struct aligned_size_t; +``` + +The class template `cuda::aligned_size_t` is a _shape_ representing an extent of bytes with a statically defined (address and size) alignment. + +## Template parameters + +| Alignment | the address & size alignement of the byte extent | + +## Data members + +| [align](./aligned_size_t/align.md) | the alignment of the byte extent | +| [value](./aligned_size_t/value.md) | the size of the byte extent | + +## Member functions + +| [(constructor)](./aligned_size_t/constructor.md) | constructs an _aligned size_ | +| (destructor) [implicitly declared] | trivial implicit destructor | +| operator= [implicitly declared] | trivial implicit copy/move assignment | +| operator size_t | implicit conversion to [`size_t`](https://en.cppreference.com/w/cpp/types/size_t) | + +## Notes + +If `value` is not a multiple of `align` the behavior is undefined. + +If `Alignment` is not a [valid alignment](https://en.cppreference.com/w/c/language/object#Alignment) the behavior is undefined. + +## Example + +```c++ +#include + +__global__ void example_kernel(void * dst, void * src, size_t size) +{ + cuda::barrier barrier; + init(&barrier, 1); + + // Implementation cannot make assumptions about alignment + cuda::memcpy_async(dst, src, size, barrier); + + // Implementation can assume that dst, src and size are 16-bytes aligned and may optimize accordingly + cuda::memcpy_async(dst, src, cuda::aligned_size_t<16>(size), barrier); + + barrier.arrive_and_wait(); +} +``` + +[See it on Godbolt](https://godbolt.org/z/v7Ev9E){: .btn } diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md new file mode 100644 index 0000000000..31057005a0 --- /dev/null +++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md @@ -0,0 +1,11 @@ +--- +nav_exclude: true +--- + +# cuda::aligned_size_t\::**align** + +```c++ +static constexpr size_t align = Alignment; +``` + +Represents the alignment (address and size) of the byte extent. diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md new file mode 100644 index 0000000000..a17bb88ffa --- /dev/null +++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md @@ -0,0 +1,15 @@ +--- +nav_exclude: true +--- + +# cuda::aligned_size_t\::**aligned_size_t** + +```c++ +explicit aligned_size_t(size_t size); +``` + +Constructs an `aligned_size_t` _shape_. + +## Notes + +If `size` is not a multiple of `Alignment` the behavior is undefined. diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md new file mode 100644 index 0000000000..2faa2a2656 --- /dev/null +++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md @@ -0,0 +1,11 @@ +--- +nav_exclude: true +--- + +# cuda::aligned_size_t\::**value** + +```c++ +size_t value; +``` + +Represents the size of the byte extent. diff --git a/docs/extended_api/asynchronous_operations_library/memcpy_async.md b/docs/extended_api/asynchronous_operations_library/memcpy_async.md new file mode 100644 index 0000000000..04c7666f40 --- /dev/null +++ b/docs/extended_api/asynchronous_operations_library/memcpy_async.md @@ -0,0 +1,64 @@ +--- +grand_parent: Extended API +parent: Asynchronous operations library +--- + +# cuda::**memcpy_async** + +Defined in header [``](../../api/synchronization_library/barrier.md) + +```c++ +template +void memcpy_async(void * destination, void const * source, Shape size, barrier & barrier); // (1) + +template +void memcpy_async(Group const & group, void * destination, void const * source, Shape size, barrier & barrier); // (2) +``` + +Defined in header [``](../headers/pipeline.md) + +```c++ +template +void memcpy_async(void * destination, void const * source, Shape size, pipeline & pipeline); // (3) + +template +void memcpy_async(Group const & group, void * destination, void const * source, Shape size, pipeline & pipeline); // (4) +``` + +Asynchronously copies `size` bytes from the memory location pointed to by `source` to the memory location pointed to by `destination`. +Both objects are reinterpreted as arrays of `unsigned char`. + +`cuda::memcpy_async` have similar constraints to [`std::memcpy`](https://en.cppreference.com/w/cpp/string/byte/memcpy), namely: +* If the objects overlap, the behavior is undefined. +* If either `destination` or `source` is an invalid or null pointer, the behavior is undefined (even if `count` is zero). +* If the objects are [potentially-overlapping](https://en.cppreference.com/w/cpp/language/object#Subobjects) the behavior is undefined. +* If the objects are not of [`TriviallyCopyable`](https://en.cppreference.com/w/cpp/named_req/TriviallyCopyable) type the program is ill-formed, no diagnostic required. + +If _Shape_ is [`cuda::aligned_size_t`](./aligned_size_t.md)), `source` and `destination` are both required to be aligned on [`cuda::aligned_size_t::align`](./aligned_size_t/align.md), else the behavior is undefined. + +If `pipeline` is in a _quitted state_ (see [`pipeline::quit`](../synchronization_library/pipeline/quit.md)), the behavior is undefined. + +1. Binds the asynchronous copy completion to `barrier` and issues the copy in the current thread. +2. Binds the asynchronous copy completion to `barrier` and cooperatively issues the copy across all threads in `group`. +3. Binds the asynchronous copy completion to `pipeline` and issues the copy in the current thread +4. Binds the asynchronous copy completion to `pipeline` and cooperatively issues the copy across all threads in `group`. + +## Template parameters + +| Group | a type satisfying the [_group concept_](../concepts/group.md) | +| Shape | a type satisfying the [_shape concept_](../concepts/shape.md) (see [`size_t`](https://en.cppreference.com/w/c/types/size_t) and [`cuda::aligned_size_t`](./aligned_size_t.md)) | + +## Parameters + +| group | the group of threads | +| destination | pointer to the memory location to copy to | +| source | pointer to the memory location to copy from | +| size | the number of bytes to copy | +| barrier | the barrier object used to wait on the copy completion | +| pipeline | the pipeline object used to wait on the copy completion | + +## Example + +```c++ +TODO +``` diff --git a/docs/extended_api/concepts.md b/docs/extended_api/concepts.md new file mode 100644 index 0000000000..a5c4c01dca --- /dev/null +++ b/docs/extended_api/concepts.md @@ -0,0 +1,11 @@ +--- +parent: Extended API +has_children: true +has_toc: false +nav_order: 1 +--- + +# Concepts + +| [Group](./concepts/group.md) | defines the requirements of a type that represents a group of cooperating threads | +| [Shape](./concepts/shape.md) | defines the requirements of a type that represents a byte extent with a particular memory layout | diff --git a/docs/extended_api/concepts/group.md b/docs/extended_api/concepts/group.md new file mode 100644 index 0000000000..ea1df4c1c9 --- /dev/null +++ b/docs/extended_api/concepts/group.md @@ -0,0 +1,46 @@ +--- +grand_parent: Extended API +parent: Concepts +--- + +# Group + +```c++ +struct Group { + static constexpr cuda::thread_scope thread_scope; + integral size() const; + integral thread_rank() const; + void sync() const; +}; +``` + +The _Group concept_ defines the requirements of a type that represents a group of cooperating threads. + +## Data members + +| thread_scope | the scope at which `Group::sync()` synchronizes memory operations and thread execution | + +## Member functions + +| size | returns the number of participating threads | +| thread_rank | returns a unique value for each participating thread (`0 <= Group::thread_rank() < Group::size()`) | +| sync | synchronizes the participating threads | + +## Notes + +This concept is defined for documentation purposes but is not materialized in the library. + +## Example + +```c++ +#include + +struct single_thread_group { + static constexpr cuda::thread_scope thread_scope = cuda::thread_scope::thread_scope_thread; + size_t size() const { return 1; } + size_t thread_rank() const { return 0; } + void sync() const { } +}; +``` + +[See it on Godbolt](https://godbolt.org/z/453r3s){: .btn } diff --git a/docs/extended_api/concepts/shape.md b/docs/extended_api/concepts/shape.md new file mode 100644 index 0000000000..af7b825a41 --- /dev/null +++ b/docs/extended_api/concepts/shape.md @@ -0,0 +1,37 @@ +--- +grand_parent: Extended API +parent: Concepts +--- + +# Shape + +```c++ +struct Shape { + operator size_t() const; +}; +``` + +The _Shape concept_ defines the requirements of a type that represents a byte extent with a particular memory layout. + +## Member functions + +| operator size_t | implicit conversion to [`size_t`](https://en.cppreference.com/w/cpp/types/size_t) | + +## Notes + +This concept is defined for documentation purposes but is not materialized in the library. + +## Example + +```c++ +// A size that carries an alignment hint +template +struct aligned_size { + static constexpr size_t align = Align; + size_t size; + aligned_size(size_t s) : size(s) {} + operator size_t() const { return size; } +}; +``` + +[See it on Godbolt](https://godbolt.org/z/hbajKo){: .btn } diff --git a/docs/extended_api/headers.md b/docs/extended_api/headers.md new file mode 100644 index 0000000000..ab9b29a799 --- /dev/null +++ b/docs/extended_api/headers.md @@ -0,0 +1,12 @@ +--- +parent: Extended API +has_children: true +has_toc: false +nav_order: 0 +--- + +# Headers + +## Synchronization library + +| [\](./headers/pipeline.md) | [Pipelines](./synchronization_library/pipeline.md) and corresponding [memcpy_async](./asynchronous_operations_library/memcpy_async.md) overloads | diff --git a/docs/extended_api/headers/pipeline.md b/docs/extended_api/headers/pipeline.md new file mode 100644 index 0000000000..fb9e4152ba --- /dev/null +++ b/docs/extended_api/headers/pipeline.md @@ -0,0 +1,119 @@ +--- +grand_parent: Extended API +parent: Headers +--- + +# \ + +This header is part of the [synchronization library](../synchronization_library.md). + +## Classes + +| [aligned_size_t](../asynchronous_operations_library/aligned_size_t.md) | defines an extent of bytes with a statically defined alignment `(class template)` | +| [pipeline](../synchronization_library/pipeline.md) | _pipeline_ class template `(class template)` | +| [pipeline_shared_state](../synchronization_library/pipeline_shared_state.md) | _pipeline shared state_ for inter-thread coordination `(class template)` | +| [pipeline_role](../synchronization_library/pipeline_role.md) | defines producer/consumer role for a thread participating in a _pipeline_ `(enum)` | + +## Functions + +| [make_pipeline](../synchronization_library/make_pipeline.md) | creates a _pipeline_ object `(function template)` | +| [pipeline_consumer_wait_prior](../synchronization_library/pipeline_consumer_wait_prior.md) | blocks the current thread until all operations committed up to a prior _pipeline stage_ complete `(function template)`| +| [pipeline_producer_commit](../synchronization_library/pipeline_producer_commit.md) | binds operations previously issued by the current thread to a _barrier_ `(function template)` | +| [memcpy_async](../asynchronous_operations_library/memcpy_async.md) | asynchronously copies one buffer to another `(function template)` | + +## Synopsis + +```c++ +namespace cuda { + template + struct aligned_size_t; + + enum class pipeline_role : /* unspecified */ { + producer, + consumer + }; + + template + class pipeline_shared_state; + + template + class pipeline; + + pipeline make_pipeline(); + + template + pipeline make_pipeline(const Group & group, pipeline_shared_state * shared_state); + + template + pipeline make_pipeline(const Group & group, pipeline_shared_state * shared_state, size_t producer_count); + + template + pipeline make_pipeline(const Group & group, pipeline_shared_state * shared_state, pipeline_role role); + + template + void pipeline_consumer_wait_prior(pipeline & pipeline); + + template + void pipeline_producer_commit(pipeline & pipeline, barrier & barrier); + + template + void memcpy_async(void * destination, void const * source, Shape size, pipeline & pipeline); + + template + void memcpy_async(Group const & group, void * destination, void const * source, Shape size, pipeline & pipeline); +} +``` + +## Class template `cuda::aligned_size_t` + +```c++ +template +struct aligned_size_t { + static constexpr size_t align = Alignment; + size_t value; + explicit aligned_size_t(size_t size); + operator size_t() const; +}; +``` + + +## Class template `cuda::pipeline_shared_state` + +```c++ +namespace cuda { + template + class pipeline_shared_state { + pipeline_shared_state() = default; + pipeline_shared_state(const pipeline_shared_state &) = delete; + pipeline_shared_state(pipeline_shared_state &&) = delete; + pipeline_shared_state & operator=(pipeline_shared_state &&) = delete; + pipeline_shared_state & operator=(const pipeline_shared_state &) = delete; + }; +} +``` + +## Class template `cuda::pipeline` + +```c++ +namespace cuda { + template + class pipeline { + pipeline(pipeline &&) = default; + pipeline(const pipeline &) = delete; + pipeline & operator=(pipeline &&) = delete; + pipeline & operator=(const pipeline &) = delete; + ~pipeline(); + + void producer_acquire(); + void producer_commit(); + void consumer_wait(); + template + bool consumer_wait_for(const std::chrono::duration & duration); + template + bool consumer_wait_until(const std::chrono::time_point & time_point); + void consumer_release(); + + bool quit(); + }; +} +``` diff --git a/docs/extended_api/synchronization_library.md b/docs/extended_api/synchronization_library.md new file mode 100644 index 0000000000..c3fa70926b --- /dev/null +++ b/docs/extended_api/synchronization_library.md @@ -0,0 +1,24 @@ +--- +parent: Extended API +has_children: true +has_toc: false +--- + +# Synchronization library + +The synchronization library provides components for thread and asynchronous operations coordination. + +## Synchronization types + +| [pipeline](./synchronization_library/pipeline.md) | _pipeline_ class template `(class template)` | +| [pipeline_shared_state](./synchronization_library/pipeline_shared_state.md) | _pipeline shared state_ for inter-thread coordination `(class template)` | +| [pipeline_role](./synchronization_library/pipeline_role.md) | defines producer/consumer role for a thread participating in a _pipeline_ `(enum)` | + +## Synchronization types factories + +| [make_pipeline](./synchronization_library/make_pipeline.md) | creates a _pipeline_ object `(function template)` | + +## Operations on synchronization types + +| [pipeline_consumer_wait_prior](./synchronization_library/pipeline_consumer_wait_prior.md) | blocks the current thread until all operations committed up to a prior _pipeline stage_ complete `(function template)`| +| [pipeline_producer_commit](./synchronization_library/pipeline_producer_commit.md) | binds operations previously issued by the current thread to a _barrier_ `(function template)` | diff --git a/docs/extended_api/synchronization_library/make_pipeline.md b/docs/extended_api/synchronization_library/make_pipeline.md new file mode 100644 index 0000000000..e5f1475750 --- /dev/null +++ b/docs/extended_api/synchronization_library/make_pipeline.md @@ -0,0 +1,85 @@ +--- +grand_parent: Extended API +parent: Synchronization library +--- + +# cuda::**make_pipeline** + +Defined in header [``](../headers/pipeline.md) + +```c++ +pipeline make_pipeline(); // (1) + +template +pipeline make_pipeline(const Group & group, pipeline_shared_state * shared_state); // (2) + +template +pipeline make_pipeline(const Group & group, pipeline_shared_state * shared_state, size_t producer_count); // (3) + +template +pipeline make_pipeline(const Group & group, pipeline_shared_state * shared_state, pipeline_role role); // (4) +``` + +1. Creates a _unified pipeline_ such that the calling thread is the only participating thread and performs both producer and consumer actions. +2. Creates a _unified pipeline_ such that all the threads in `group` are performing both producer and consumer actions. +3. Creates a _partitioned pipeline_ such that `producer_threads` number of threads in `group` are performing producer actions while the others + are performing consumer actions. +4. Creates a _partitioned pipeline_ where each thread's role is explicitly specified. + +All threads in `group` acquire collective ownership of the `shared_state` storage. + +`make_pipeline` must be invoked by every threads in `group` such that `group::sync` may be invoked. + +`shared_state` and `producer_count` must be uniform across all threads in `group`, else the behavior is undefined. + +`producer_count` must be strictly inferior to `group::size`, else the behavior is undefined. + +## Template parameters + +| Group | a type satisfying the [_Group concept_](../concepts/group.md) | + +## Parameters + +| group | the group of threads | +| shared_state | a pointer to an object of type [`cuda::pipeline_shared_state`](./pipeline_shared_state.md) with `Scope` including all the threads in `group` | +| producer_count | the number of _producer threads_ in the pipeline | +| role | the role of the current thread in the pipeline | + +## Return value + +A `cuda::pipeline` object. + +## Example + +```c++ +#include +#include + +// Disables `pipeline_shared_state` initialization warning +#pragma diag_suppress static_var_with_dynamic_init + +__global__ void example_kernel() +{ + __shared__ cuda::pipeline_shared_state pss_1; + __shared__ cuda::pipeline_shared_state pss_2; + __shared__ cuda::pipeline_shared_state pss_3; + + auto group = cooperative_groups::this_thread_block(); + + // Create a thread scoped pipeline + cuda::pipeline p_0 = cuda::make_pipeline(); + + // Create a unified block-scoped pipeline + cuda::pipeline p_1 = cuda::make_pipeline(group, &pss_1); + + // Create a partitioned block-scoped pipeline where half the threads are producers + size_t producer_count = group.size() / 2; + cuda::pipeline p_2 = cuda::make_pipeline(group, &pss_2, producer_count); + + // Create a partitioned block-scoped pipeline where all threads with an even thread_rank are producers + auto thread_role = (group.thread_rank() % 2) ? cuda::pipeline_role::producer : cuda::pipeline_role::consumer; + cuda::pipeline p_3 = cuda::make_pipeline(group, &pss_3, thread_role); +} +``` + +[See it on Godbolt](https://godbolt.org/z/Y1zv5G){: .btn } diff --git a/docs/extended_api/synchronization_library/pipeline.md b/docs/extended_api/synchronization_library/pipeline.md new file mode 100644 index 0000000000..88bb99df1e --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline.md @@ -0,0 +1,100 @@ +--- +grand_parent: Extended API +parent: Synchronization library +--- + +# cuda::**pipeline** + +Defined in header [``](../headers/pipeline.md) + +```c++ +template +class pipeline; +``` + +The class template `cuda::pipeline` provides a coordination mechanism allowing to pipeline multiple operations in a sequence of stages. + +A thread interacts with a _pipeline stage_ using the following pattern: +1. Acquire the pipeline stage +2. Commit some operations to the stage +3. Wait for the previously committed operations to complete +4. Release the pipeline stage + +For thread scopes other than `thread_scope_thread`, a [`pipeline_shared_state`](./pipeline_shared_state.md) is required to coordinate the participating threads. + +_Pipelines_ can be either _unified_ or _partitioned_. +In a _unified pipeline_, all the participating threads are both producers and consumers. +In a _partitioned pipeline_, each participating thread is either a producer or a consumer. + +## Template parameters + +### Scope + +A [`cuda::thread_scope`](../../api/synchronization_library/thread_scopes.md) denoting a scope including all the threads participating in the _pipeline_. + +## Member functions + +| (constructor) [deleted] | `pipeline` is not constructible | +| [(destructor)](./pipeline/destructor.md) | destroys the `pipeline` | +| operator= [deleted] | `pipeline` is not assignable | +| [producer_acquire](./pipeline/producer_acquire.md) | blocks the current thread until the next _pipeline stage_ is available | +| [producer_commit](./pipeline/producer_commit.md) | commits operations previously issued by the current thread to the current _pipeline stage_ | +| [consumer_wait](./pipeline/consumer_wait.md) | blocks the current thread until all operations committed to the current _pipeline stage_ complete | +| [consumer_wait_for](./pipeline/consumer_wait.md) | blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration | +| [consumer_wait_until](./pipeline/consumer_wait.md) | blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached | +| [consumer_release](./pipeline/consumer_release.md) | release the current _pipeline stage_ | +| [quit](./pipeline/quit.md) | quits current thread's participation in the _pipeline_ | + +## Notes + +A thread role cannot change during the lifetime of the pipeline object. + +## Example + +```c++ +#include +#include + +// Disables `pipeline_shared_state` initialization warning +#pragma diag_suppress static_var_with_dynamic_init + +template +__device__ void compute(T * ptr); + +template +__global__ void example_kernel(T * global1, T * global2, size_t subset_count) +{ + extern __shared__ T s[]; + auto group = cooperative_groups::this_thread_block(); + T * shared[2] = { s, s + 2 * group.size() }; + + // Create a CUDA pipeline + constexpr unsigned stages_count = 2; + __shared__ cuda::pipeline_shared_state shared_state; + auto pipeline = cuda::make_pipeline(group, &shared_state); + + // Prime the pipeline + pipeline.producer_acquire(); + cuda::memcpy_async(group, shared[0], &global1[0], sizeof(T) * group.size(), pipeline); + cuda::memcpy_async(group, shared[0] + group.size(), &global2[0], sizeof(T) * group.size(), pipeline); + pipeline.producer_commit(); + + // Pipelined copy/compute + for (size_t subset = 1; subset < subset_count; ++subset) { + pipeline.producer_acquire(); + cuda::memcpy_async(group, shared[subset % 2], &global1[subset * group.size()], sizeof(T) * group.size(), pipeline); + cuda::memcpy_async(group, shared[subset % 2] + group.size(), &global2[subset * group.size()], sizeof(T) * group.size(), pipeline); + pipeline.producer_commit(); + pipeline.consumer_wait(); + compute(shared[(subset - 1) % 2]); + pipeline.consumer_release(); + } + + // Drain the pipeline + pipeline.consumer_wait(); + compute(shared[(subset_count - 1) % 2]); + pipeline.consumer_release(); +} +``` + +[See it on Godbolt](https://godbolt.org/z/javfx9){: .btn } diff --git a/docs/extended_api/synchronization_library/pipeline/consumer_release.md b/docs/extended_api/synchronization_library/pipeline/consumer_release.md new file mode 100644 index 0000000000..cc48efd11f --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline/consumer_release.md @@ -0,0 +1,17 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline\::**consumer_release** + +```c++ +void consumer_release(); +``` + +Releases the current _pipeline stage_. + +## Notes + +If this method is called from a _producer thread_ the behavior is undefined. + +If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined. diff --git a/docs/extended_api/synchronization_library/pipeline/consumer_wait.md b/docs/extended_api/synchronization_library/pipeline/consumer_wait.md new file mode 100644 index 0000000000..932227a0fa --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline/consumer_wait.md @@ -0,0 +1,34 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline\::**consumer_wait**, cuda::pipeline\::**consumer_wait_for**, cuda::pipeline\::**consumer_wait_until** + +```c++ +void consumer_wait(); // (1) + +template +bool consumer_wait_for(const std::chrono::duration & duration); // (2) + +template +bool consumer_wait_until(const std::chrono::time_point & time_point); // (3) +``` + +1. blocks the current thread until all operations committed to the current _pipeline stage_ complete +2. blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration +3. blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached + +## Parameters + +| duration | an object of type `cuda::std::chrono::duration` representing the maximum time to spend waiting | +| time_point | an object of type `cuda::std::chrono::time_point` representing the time when to stop waiting | + +## Return value + +`false` if the _wait_ timed out, `true` otherwise. + +## Notes + +If this method is called from a _producer thread_ the behavior is undefined. + +If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined. diff --git a/docs/extended_api/synchronization_library/pipeline/destructor.md b/docs/extended_api/synchronization_library/pipeline/destructor.md new file mode 100644 index 0000000000..3e8cc3c92c --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline/destructor.md @@ -0,0 +1,15 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline\::**~pipeline** + +```c++ +~pipeline(); +``` + +Destructs the pipeline. + +## Notes + +Calls [`cuda::pipeline::quit`](./quit.md) if it was not called by the current thread. diff --git a/docs/extended_api/synchronization_library/pipeline/producer_acquire.md b/docs/extended_api/synchronization_library/pipeline/producer_acquire.md new file mode 100644 index 0000000000..a04385f63c --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline/producer_acquire.md @@ -0,0 +1,17 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline\::**producer_acquire** + +```c++ +void producer_acquire(); +``` + +Blocks the current thread until the next _pipeline stage_ is available. + +## Notes + +If this method is called from a _consumer thread_ the behavior is undefined. + +If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined. diff --git a/docs/extended_api/synchronization_library/pipeline/producer_commit.md b/docs/extended_api/synchronization_library/pipeline/producer_commit.md new file mode 100644 index 0000000000..1295fc8de5 --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline/producer_commit.md @@ -0,0 +1,17 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline\::**producer_commit** + +```c++ +void producer_commit(); +``` + +Commits operations previously issued by the current thread to the current _pipeline stage_. + +## Notes + +If this method is called from a _consumer thread_ the behavior is undefined. + +If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined. diff --git a/docs/extended_api/synchronization_library/pipeline/quit.md b/docs/extended_api/synchronization_library/pipeline/quit.md new file mode 100644 index 0000000000..dd804ff7fb --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline/quit.md @@ -0,0 +1,19 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline\::**quit** + +```c++ +bool quit(); +``` + +Quits the current thread's participation in the collective ownership of the corresponding _shared state_ ([`cuda::pipeline_shared_state`](../pipeline_shared_state.md)). Ownership of the _shared state_ is released by the last invoking thread. + +## Return value + +`true` if ownership of the _shared state_ was released, otherwise `false`. + +## Notes + +The behavior undefined if any operation other than [`~pipeline`](./destructor.md) is issued by the current thread after quitting. diff --git a/docs/extended_api/synchronization_library/pipeline_consumer_wait_prior.md b/docs/extended_api/synchronization_library/pipeline_consumer_wait_prior.md new file mode 100644 index 0000000000..5284aa205a --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline_consumer_wait_prior.md @@ -0,0 +1,57 @@ +--- +grand_parent: Extended API +parent: Synchronization library +--- + +# cuda::**pipeline_consumer_wait_prior** + +Defined in header [``](../headers/pipeline.md) + +```c++ +template +void pipeline_consumer_wait_prior(pipeline & pipeline); +``` + +Let *Stage* be the pipeline stage `Prior` stages before the current one (counting the current one). Blocks the current +thread until all operations committed to _pipeline stages_ up to *Stage* complete. All stages up to *Stage* (exclusive) +are implicitly released. + +## Template parameters + +| Prior | The index of the pipeline stage *Stage* (_see above_) counting up from the current one. The index of the current stage is `0`. | + +## Parameters + +| pipeline | The thread-scoped `cuda::pipeline` object to wait on | + +## Notes + +If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./pipeline/quit.md)), the behavior is undefined. + +## Example + +```c++ +#include + +__global__ void example_kernel(uint64_t * global, size_t element_count) +{ + extern __shared__ uint64_t shared[]; + + cuda::pipeline pipe = cuda::make_pipeline(); + for (size_t i = 0; i < element_count; ++i) { + pipe.producer_acquire(); + cuda::memcpy_async(shared + i, global + i, sizeof(*global), pipe); + pipe.producer_commit(); + } + + // Wait for operations committed in all stages but the last one + cuda::pipeline_consumer_wait_prior<1>(pipe); + pipe.consumer_release(); + + // Wait for operations committed in all stages + cuda::pipeline_consumer_wait_prior<0>(pipe); + pipe.consumer_release(); +} +``` + +[See it on Godbolt](https://godbolt.org/z/j83v3G){: .btn } diff --git a/docs/extended_api/synchronization_library/pipeline_producer_commit.md b/docs/extended_api/synchronization_library/pipeline_producer_commit.md new file mode 100644 index 0000000000..b0fb640687 --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline_producer_commit.md @@ -0,0 +1,52 @@ +--- +grand_parent: Extended API +parent: Synchronization library +--- + +# cuda::**pipeline_producer_commit** + +Defined in header [``](../headers/pipeline.md) + +```c++ +template +void pipeline_producer_commit(pipeline & pipeline, barrier & barrier); +``` + +Binds operations previously issued by the current thread to the named `barrier` such that a `barrier::arrive` is performed on completion. The bind operation implicitly increments the barrier's current phase to account for the subsequent `barrier::arrive`, resulting in a net change of 0. + +## Parameters + +| pipeline | the thread-scoped `cuda::pipeline` object to wait on | +| barrier | the barrier to arrive on | + +## Notes + +If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./pipeline/quit.md)), the behavior is undefined. + +## Example + +```c++ +#include + +// Disables `barrier` initialization warning +#pragma diag_suppress static_var_with_dynamic_init + +__global__ void example_kernel(uint64_t * global, size_t element_count) +{ + extern __shared__ uint64_t shared[]; + __shared__ cuda::barrier barrier; + + init(&barrier, 1); + cuda::pipeline pipe = cuda::make_pipeline(); + + pipe.producer_acquire(); + for (size_t i = 0; i < element_count; ++i) { + cuda::memcpy_async(shared + i, global + i, sizeof(*global), pipe); + } + pipeline_producer_commit(pipe, barrier); + barrier.arrive_and_wait(); + pipe.consumer_release(); +} +``` + +[See it on Godbolt](https://godbolt.org/z/x5n8zY){: .btn } diff --git a/docs/extended_api/synchronization_library/pipeline_role.md b/docs/extended_api/synchronization_library/pipeline_role.md new file mode 100644 index 0000000000..cb2dca0391 --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline_role.md @@ -0,0 +1,25 @@ +--- +grand_parent: Extended API +parent: Synchronization library +--- + +# cuda::**pipeline_role** + +Defined in header [``](../headers/pipeline.md) + +```c++ +enum class pipeline_role : /* unspecified */ { + producer, + consumer +}; +``` +`cuda::pipeline_role` specifies the role of a particular thread in a partitioned producer/consumer pipeline. + +## Constants + +| producer | a producer thread generates data (e.g. by issuing [`memcpy_async`](../asynchronous_operations_library/memcpy_async.md) operations) | +| consumer | a consumer thread consumes data (e.g. by waiting for previously [`memcpy_async`](../asynchronous_operations_library/memcpy_async.md) operations to complete) | + +## Example + +See [cuda::make_pipeline](./make_pipeline.md#example). diff --git a/docs/extended_api/synchronization_library/pipeline_shared_state.md b/docs/extended_api/synchronization_library/pipeline_shared_state.md new file mode 100644 index 0000000000..9d9a408a7f --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline_shared_state.md @@ -0,0 +1,52 @@ +--- +grand_parent: Extended API +parent: Synchronization library +--- + +# cuda::**pipeline_shared_state** + +Defined in header [``](../headers/pipeline.md) + +```c++ +template +class pipeline_shared_state; +``` + +The class template `cuda::pipeline_shared_state` is a storage type used to coordinate the threads participating in a `cuda::pipeline`. + +## Template parameters + +| Scope | A [`cuda::thread_scope`](../../api/synchronization_library/thread_scopes.md) denoting a scope including all the threads participating in the `cuda::pipeline`. `Scope` cannot be `thread_scope_thread`.| +| StagesCount | The number of stages for the _pipeline_. | + +## Member functions + +| [(constructor)](./pipeline_shared_state/constructor.md) | constructs a `pipeline_shared_state` | +| [(destructor)](./pipeline_shared_state/destructor.md) | destroys the `pipeline_shared_state` | +| operator= [deleted] | `pipeline_shared_state` is not assignable | + +## Example + +```c++ +#include + +// Disables `pipeline_shared_state` initialization warning +#pragma diag_suppress static_var_with_dynamic_init + +__global__ void example_kernel(char * device_buffer, char * sysmem_buffer) +{ + // Allocate a 2 stage block scoped shared state in shared memory + __shared__ cuda::pipeline_shared_state pss_1; + + // Allocate a 2 stage block scoped shared state in device memory + auto * pss_2 = new cuda::pipeline_shared_state; + + // Construct a 2 stage device scoped shared state in device memory + auto * pss_3 = new(device_buffer) cuda::pipeline_shared_state; + + // Construct a 2 stage system scoped shared state in system memory + auto * pss_4 = new(sysmem_buffer) cuda::pipeline_shared_state; +} +``` + +[See it on Godbolt](https://godbolt.org/z/xMMxYM){: .btn } diff --git a/docs/extended_api/synchronization_library/pipeline_shared_state/constructor.md b/docs/extended_api/synchronization_library/pipeline_shared_state/constructor.md new file mode 100644 index 0000000000..2f109a28e5 --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline_shared_state/constructor.md @@ -0,0 +1,40 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline_shared_state\::**pipeline_shared_state** + +```c++ +pipeline_shared_state(); // (1) +pipeline_shared_state(const pipeline_shared_state &) = delete; // (2) +pipeline_shared_state(pipeline_shared_state &&) = delete; // (3) +``` + +1. Constructs the pipeline shared state. +2. Copy constructor is deleted. +3. Move constructor is deleted. + +## Notes + +Static declaration of `pipeline_shared_state` within device code currently emits the following warning: + +``` +warning: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function +``` + +It can be silenced using `#pragma diag_suppress static_var_with_dynamic_init`. + +## Example + +```c++ +#include + +#pragma diag_suppress static_var_with_dynamic_init + +__global__ void example_kernel() +{ + __shared__ cuda::pipeline_shared_state shared_state; +} +``` + +[See it on Godbolt](https://godbolt.org/z/n1zoea){: .btn } diff --git a/docs/extended_api/synchronization_library/pipeline_shared_state/destructor.md b/docs/extended_api/synchronization_library/pipeline_shared_state/destructor.md new file mode 100644 index 0000000000..123537fa71 --- /dev/null +++ b/docs/extended_api/synchronization_library/pipeline_shared_state/destructor.md @@ -0,0 +1,11 @@ +--- +nav_exclude: true +--- + +# cuda::pipeline_shared_state\::**~pipeline_shared_state** + +```c++ +~pipeline_shared_state(); +``` + +Destructs the pipeline shared state. diff --git a/docs/releases.md b/docs/releases.md index 9dbf82dcc5..9bf9b8bde6 100644 --- a/docs/releases.md +++ b/docs/releases.md @@ -1,7 +1,7 @@ --- has_children: true has_toc: true -nav_order: 3 +nav_order: 4 --- # Releases