diff --git a/docs/contributing.md b/docs/contributing.md
index 347dc688ec..04d701406f 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -1,7 +1,7 @@
 ---
 has_children: true
 has_toc: true
-nav_order: 4
+nav_order: 5
 ---
 
 # Contributing
diff --git a/docs/extended_api.md b/docs/extended_api.md
new file mode 100644
index 0000000000..4fa0abe74d
--- /dev/null
+++ b/docs/extended_api.md
@@ -0,0 +1,29 @@
+---
+has_children: true
+has_toc: false
+nav_order: 3
+---
+
+# Extended API
+
+## [Headers](./extended_api/headers.md)
+
+### [\<cuda/pipeline>](./extended_api/headers/pipeline.md)
+
+## [Concepts](./extended_api/concepts.md)
+
+### [Group](./extended_api/concepts/group.md)
+
+### [Shape](./extended_api/concepts/shape.md)
+
+## [Synchronization library](./extended_api/synchronization_library.md)
+
+### [pipeline](./extended_api/synchronization_library/pipeline.md)
+
+### [pipeline_shared_state](./extended_api/synchronization_library/pipeline_shared_state.md)
+
+## [Asynchronous operations library](./extended_api/asynchronous_operations_library.md)
+
+### [aligned_size_t](./extended_api/asynchronous_operations_library/aligned_size_t.md)
+
+### [memcpy_async](./extended_api/asynchronous_operations_library/memcpy_async.md)
diff --git a/docs/extended_api/asynchronous_operations_library.md b/docs/extended_api/asynchronous_operations_library.md
new file mode 100644
index 0000000000..e4c31157bf
--- /dev/null
+++ b/docs/extended_api/asynchronous_operations_library.md
@@ -0,0 +1,17 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+---
+
+# Asynchronous operations library
+
+The asynchronous operations library provides components for asynchronous data movement.
+
+## Shapes
+
+| [aligned_size_t](./asynchronous_operations_library/aligned_size_t.md) | defines an extent of bytes with a statically defined alignment `(class template)` |
+
+## Asynchronous operations
+
+| [memcpy_async](./asynchronous_operations_library/memcpy_async.md) | asynchronously copies one buffer to another `(function template)` |
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t.md
new file mode 100644
index 0000000000..68b4bf973e
--- /dev/null
+++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t.md
@@ -0,0 +1,61 @@
+---
+grand_parent: Extended API
+parent: Asynchronous operations library
+---
+
+# cuda::**aligned_size_t**
+
+Defined in header [`<cuda/barrier>`](../headers/barrier.md)
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<size_t Alignment>
+struct aligned_size_t;
+```
+
+The class template `cuda::aligned_size_t` is a _shape_ representing an extent of bytes with a statically defined (address and size) alignment.
+
+## Template parameters
+
+| Alignment | the address & size alignement of the byte extent |
+
+## Data members
+
+| [align](./aligned_size_t/align.md) | the alignment of the byte extent |
+| [value](./aligned_size_t/value.md) | the size of the byte extent      |
+
+## Member functions
+
+| [(constructor)](./aligned_size_t/constructor.md) | constructs an _aligned size_                                                      |
+| (destructor) [implicitly declared]               | trivial implicit destructor                                                       |
+| operator= [implicitly declared]                  | trivial implicit copy/move assignment                                             |
+| operator size_t                                  | implicit conversion to [`size_t`](https://en.cppreference.com/w/cpp/types/size_t) |
+
+## Notes
+
+If `value` is not a multiple of `align` the behavior is undefined.
+
+If `Alignment` is not a [valid alignment](https://en.cppreference.com/w/c/language/object#Alignment) the behavior is undefined.
+
+## Example
+
+```c++
+#include <cuda/barrier>
+
+__global__ void example_kernel(void * dst, void * src, size_t size)
+{
+    cuda::barrier<cuda::thread_scope_system> barrier;
+    init(&barrier, 1);
+
+    // Implementation cannot make assumptions about alignment
+    cuda::memcpy_async(dst, src, size, barrier);
+
+    // Implementation can assume that dst, src and size are 16-bytes aligned and may optimize accordingly
+    cuda::memcpy_async(dst, src, cuda::aligned_size_t<16>(size), barrier);
+
+    barrier.arrive_and_wait();
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/v7Ev9E){: .btn }
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md
new file mode 100644
index 0000000000..31057005a0
--- /dev/null
+++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md
@@ -0,0 +1,11 @@
+---
+nav_exclude: true
+---
+
+# cuda::aligned_size_t\<Alignment>::**align**
+
+```c++
+static constexpr size_t align = Alignment;
+```
+
+Represents the alignment (address and size) of the byte extent.
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md
new file mode 100644
index 0000000000..a17bb88ffa
--- /dev/null
+++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md
@@ -0,0 +1,15 @@
+---
+nav_exclude: true
+---
+
+# cuda::aligned_size_t\<Alignment>::**aligned_size_t**
+
+```c++
+explicit aligned_size_t(size_t size);
+```
+
+Constructs an `aligned_size_t` _shape_.
+
+## Notes
+
+If `size` is not a multiple of `Alignment` the behavior is undefined.
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md
new file mode 100644
index 0000000000..2faa2a2656
--- /dev/null
+++ b/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md
@@ -0,0 +1,11 @@
+---
+nav_exclude: true
+---
+
+# cuda::aligned_size_t\<Alignment>::**value**
+
+```c++
+size_t value;
+```
+
+Represents the size of the byte extent.
diff --git a/docs/extended_api/asynchronous_operations_library/memcpy_async.md b/docs/extended_api/asynchronous_operations_library/memcpy_async.md
new file mode 100644
index 0000000000..04c7666f40
--- /dev/null
+++ b/docs/extended_api/asynchronous_operations_library/memcpy_async.md
@@ -0,0 +1,64 @@
+---
+grand_parent: Extended API
+parent: Asynchronous operations library
+---
+
+# cuda::**memcpy_async**
+
+Defined in header [`<cuda/barrier>`](../../api/synchronization_library/barrier.md)
+
+```c++
+template<typename Shape, thread_scope Scope>
+void memcpy_async(void * destination, void const * source, Shape size, barrier<Scope> & barrier);                        // (1)
+
+template<typename Group, typename Shape, thread_scope Scope>
+void memcpy_async(Group const & group, void * destination, void const * source, Shape size, barrier<Scope> & barrier);   // (2)
+```
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<typename Shape, thread_scope Scope>
+void memcpy_async(void * destination, void const * source, Shape size, pipeline<Scope> & pipeline);                      // (3)
+
+template<typename Group, typename Shape, thread_scope Scope>
+void memcpy_async(Group const & group, void * destination, void const * source, Shape size, pipeline<Scope> & pipeline); // (4)
+```
+
+Asynchronously copies `size` bytes from the memory location pointed to by `source` to the memory location pointed to by `destination`.
+Both objects are reinterpreted as arrays of `unsigned char`.
+
+`cuda::memcpy_async` have similar constraints to [`std::memcpy`](https://en.cppreference.com/w/cpp/string/byte/memcpy), namely:
+* If the objects overlap, the behavior is undefined.
+* If either `destination` or `source` is an invalid or null pointer, the behavior is undefined (even if `count` is zero).
+* If the objects are [potentially-overlapping](https://en.cppreference.com/w/cpp/language/object#Subobjects) the behavior is undefined.
+* If the objects are not of [`TriviallyCopyable`](https://en.cppreference.com/w/cpp/named_req/TriviallyCopyable) type the program is ill-formed, no diagnostic required.
+
+If _Shape_ is [`cuda::aligned_size_t`](./aligned_size_t.md)), `source` and `destination` are both required to be aligned on [`cuda::aligned_size_t::align`](./aligned_size_t/align.md), else the behavior is undefined.
+
+If `pipeline` is in a _quitted state_ (see [`pipeline::quit`](../synchronization_library/pipeline/quit.md)), the behavior is undefined.
+
+1. Binds the asynchronous copy completion to `barrier` and issues the copy in the current thread.
+2. Binds the asynchronous copy completion to `barrier` and cooperatively issues the copy across all threads in `group`.
+3. Binds the asynchronous copy completion to `pipeline` and issues the copy in the current thread
+4. Binds the asynchronous copy completion to `pipeline` and cooperatively issues the copy across all threads in `group`.
+
+## Template parameters
+
+| Group | a type satisfying the [_group concept_](../concepts/group.md)                                                                                                                  |
+| Shape | a type satisfying the [_shape concept_](../concepts/shape.md) (see [`size_t`](https://en.cppreference.com/w/c/types/size_t) and [`cuda::aligned_size_t`](./aligned_size_t.md)) |
+
+## Parameters
+
+| group       | the group of threads                                    |
+| destination | pointer to the memory location to copy to               |
+| source      | pointer to the memory location to copy from             |
+| size        | the number of bytes to copy                             |
+| barrier     | the barrier object used to wait on the copy completion  |
+| pipeline    | the pipeline object used to wait on the copy completion |
+
+## Example
+
+```c++
+TODO
+```
diff --git a/docs/extended_api/concepts.md b/docs/extended_api/concepts.md
new file mode 100644
index 0000000000..a5c4c01dca
--- /dev/null
+++ b/docs/extended_api/concepts.md
@@ -0,0 +1,11 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+nav_order: 1
+---
+
+# Concepts
+
+| [Group](./concepts/group.md) | defines the requirements of a type that represents a group of cooperating threads                |
+| [Shape](./concepts/shape.md) | defines the requirements of a type that represents a byte extent with a particular memory layout |
diff --git a/docs/extended_api/concepts/group.md b/docs/extended_api/concepts/group.md
new file mode 100644
index 0000000000..ea1df4c1c9
--- /dev/null
+++ b/docs/extended_api/concepts/group.md
@@ -0,0 +1,46 @@
+---
+grand_parent: Extended API
+parent: Concepts
+---
+
+# Group
+
+```c++
+struct Group {
+  static constexpr cuda::thread_scope thread_scope;
+  integral size() const;
+  integral thread_rank() const;
+  void sync() const;
+};
+```
+
+The _Group concept_ defines the requirements of a type that represents a group of cooperating threads.
+
+## Data members
+
+| thread_scope | the scope at which `Group::sync()` synchronizes memory operations and thread execution |
+
+## Member functions
+
+| size        | returns the number of participating threads                                                        |
+| thread_rank | returns a unique value for each participating thread (`0 <= Group::thread_rank() < Group::size()`) |
+| sync        | synchronizes the participating threads                                                             |
+
+## Notes
+
+This concept is defined for documentation purposes but is not materialized in the library.
+
+## Example
+
+```c++
+#include <cuda/atomic>
+
+struct single_thread_group {
+    static constexpr cuda::thread_scope thread_scope = cuda::thread_scope::thread_scope_thread;
+    size_t size() const { return 1; }
+    size_t thread_rank() const { return 0; }
+    void sync() const { }
+};
+```
+
+[See it on Godbolt](https://godbolt.org/z/453r3s){: .btn }
diff --git a/docs/extended_api/concepts/shape.md b/docs/extended_api/concepts/shape.md
new file mode 100644
index 0000000000..af7b825a41
--- /dev/null
+++ b/docs/extended_api/concepts/shape.md
@@ -0,0 +1,37 @@
+---
+grand_parent: Extended API
+parent: Concepts
+---
+
+# Shape
+
+```c++
+struct Shape {
+    operator size_t() const;
+};
+```
+
+The _Shape concept_ defines the requirements of a type that represents a byte extent with a particular memory layout.
+
+## Member functions
+
+| operator size_t | implicit conversion to [`size_t`](https://en.cppreference.com/w/cpp/types/size_t) |
+
+## Notes
+
+This concept is defined for documentation purposes but is not materialized in the library.
+
+## Example
+
+```c++
+// A size that carries an alignment hint
+template <size_t Align>
+struct aligned_size {
+    static constexpr size_t align = Align;
+    size_t size;
+    aligned_size(size_t s) : size(s) {}
+    operator size_t() const { return size; }
+};
+```
+
+[See it on Godbolt](https://godbolt.org/z/hbajKo){: .btn }
diff --git a/docs/extended_api/headers.md b/docs/extended_api/headers.md
new file mode 100644
index 0000000000..ab9b29a799
--- /dev/null
+++ b/docs/extended_api/headers.md
@@ -0,0 +1,12 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+nav_order: 0
+---
+
+# Headers
+
+## Synchronization library
+
+| [\<pipeline\>](./headers/pipeline.md) | [Pipelines](./synchronization_library/pipeline.md) and corresponding [memcpy_async](./asynchronous_operations_library/memcpy_async.md) overloads |
diff --git a/docs/extended_api/headers/pipeline.md b/docs/extended_api/headers/pipeline.md
new file mode 100644
index 0000000000..fb9e4152ba
--- /dev/null
+++ b/docs/extended_api/headers/pipeline.md
@@ -0,0 +1,119 @@
+---
+grand_parent: Extended API
+parent: Headers
+---
+
+# \<cuda/**pipeline**>
+
+This header is part of the [synchronization library](../synchronization_library.md).
+
+## Classes
+
+| [aligned_size_t](../asynchronous_operations_library/aligned_size_t.md)       | defines an extent of bytes with a statically defined alignment `(class template)`  |
+| [pipeline](../synchronization_library/pipeline.md)                           | _pipeline_ class template `(class template)`                                       |
+| [pipeline_shared_state](../synchronization_library/pipeline_shared_state.md) | _pipeline shared state_ for inter-thread coordination `(class template)`           |
+| [pipeline_role](../synchronization_library/pipeline_role.md)                 | defines producer/consumer role for a thread participating in a _pipeline_ `(enum)` |
+
+## Functions
+
+| [make_pipeline](../synchronization_library/make_pipeline.md)                               | creates a _pipeline_ object `(function template)`                                                                     |
+| [pipeline_consumer_wait_prior](../synchronization_library/pipeline_consumer_wait_prior.md) | blocks the current thread until all operations committed up to a prior _pipeline stage_ complete `(function template)`|
+| [pipeline_producer_commit](../synchronization_library/pipeline_producer_commit.md)         | binds operations previously issued by the current thread to a _barrier_ `(function template)`                         |
+| [memcpy_async](../asynchronous_operations_library/memcpy_async.md)                         | asynchronously copies one buffer to another `(function template)`                                                     |
+
+## Synopsis
+
+```c++
+namespace cuda {
+    template<size_t Alignment>
+    struct aligned_size_t;
+
+    enum class pipeline_role : /* unspecified */ {
+        producer,
+        consumer
+    };
+
+    template<thread_scope Scope, uint8_t StagesCount>
+    class pipeline_shared_state;
+
+    template<thread_scope Scope>
+    class pipeline;
+
+    pipeline<thread_scope_thread> make_pipeline();
+
+    template<class Group, thread_scope Scope, uint8_t StagesCount>
+    pipeline<Scope> make_pipeline(const Group & group, pipeline_shared_state<Scope, StagesCount> * shared_state);
+
+    template<class Group, thread_scope Scope, uint8_t StagesCount>
+    pipeline<Scope> make_pipeline(const Group & group, pipeline_shared_state<Scope, StagesCount> * shared_state, size_t producer_count);
+
+    template<class Group, thread_scope Scope, uint8_t StagesCount>
+    pipeline<Scope> make_pipeline(const Group & group, pipeline_shared_state<Scope, StagesCount> * shared_state, pipeline_role role);
+
+    template<uint8_t Prior>
+    void pipeline_consumer_wait_prior(pipeline<thread_scope_thread> & pipeline);
+
+    template<thread_scope Scope>
+    void pipeline_producer_commit(pipeline<thread_scope_thread> & pipeline, barrier<Scope> & barrier);
+
+    template<typename Shape, thread_scope Scope>
+    void memcpy_async(void * destination, void const * source, Shape size, pipeline<Scope> & pipeline);
+
+    template<typename Group, typename Shape, thread_scope Scope>
+    void memcpy_async(Group const & group, void * destination, void const * source, Shape size, pipeline<Scope> & pipeline);
+}
+```
+
+## Class template `cuda::aligned_size_t`
+
+```c++
+template<size_t Alignment>
+struct aligned_size_t {
+    static constexpr size_t align = Alignment;
+    size_t value;
+    explicit aligned_size_t(size_t size);
+    operator size_t() const;
+};
+```
+
+
+## Class template `cuda::pipeline_shared_state`
+
+```c++
+namespace cuda {
+    template<thread_scope Scope, uint8_t StagesCount>
+    class pipeline_shared_state {
+        pipeline_shared_state() = default;
+        pipeline_shared_state(const pipeline_shared_state &) = delete;
+        pipeline_shared_state(pipeline_shared_state &&) = delete;
+        pipeline_shared_state & operator=(pipeline_shared_state &&) = delete;
+        pipeline_shared_state & operator=(const pipeline_shared_state &) =  delete;
+    };
+}
+```
+
+## Class template `cuda::pipeline`
+
+```c++
+namespace cuda {
+    template<thread_scope Scope>
+    class pipeline {
+        pipeline(pipeline &&) = default;
+        pipeline(const pipeline &) = delete;
+        pipeline & operator=(pipeline &&) = delete;
+        pipeline & operator=(const pipeline &) = delete;
+        ~pipeline();
+
+        void producer_acquire();
+        void producer_commit();
+        void consumer_wait();
+        template<class Rep, class Period>
+        bool consumer_wait_for(const std::chrono::duration<Rep, Period> & duration);
+        template<class Clock, class Duration>
+        bool consumer_wait_until(const std::chrono::time_point<Clock, Duration> & time_point);
+        void consumer_release();
+
+        bool quit();
+    };
+}
+```
diff --git a/docs/extended_api/synchronization_library.md b/docs/extended_api/synchronization_library.md
new file mode 100644
index 0000000000..c3fa70926b
--- /dev/null
+++ b/docs/extended_api/synchronization_library.md
@@ -0,0 +1,24 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+---
+
+# Synchronization library
+
+The synchronization library provides components for thread and asynchronous operations coordination.
+
+## Synchronization types
+
+| [pipeline](./synchronization_library/pipeline.md)                           | _pipeline_ class template `(class template)`                                       |
+| [pipeline_shared_state](./synchronization_library/pipeline_shared_state.md) | _pipeline shared state_ for inter-thread coordination `(class template)`           |
+| [pipeline_role](./synchronization_library/pipeline_role.md)                 | defines producer/consumer role for a thread participating in a _pipeline_ `(enum)` |
+
+## Synchronization types factories
+
+| [make_pipeline](./synchronization_library/make_pipeline.md) | creates a _pipeline_ object `(function template)` |
+
+## Operations on synchronization types
+
+| [pipeline_consumer_wait_prior](./synchronization_library/pipeline_consumer_wait_prior.md) | blocks the current thread until all operations committed up to a prior _pipeline stage_ complete `(function template)`|
+| [pipeline_producer_commit](./synchronization_library/pipeline_producer_commit.md)         | binds operations previously issued by the current thread to a _barrier_ `(function template)`                         |
diff --git a/docs/extended_api/synchronization_library/make_pipeline.md b/docs/extended_api/synchronization_library/make_pipeline.md
new file mode 100644
index 0000000000..e5f1475750
--- /dev/null
+++ b/docs/extended_api/synchronization_library/make_pipeline.md
@@ -0,0 +1,85 @@
+---
+grand_parent: Extended API
+parent: Synchronization library
+---
+
+# cuda::**make_pipeline**
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+pipeline<thread_scope_thread> make_pipeline();                                                                                       // (1)
+
+template<class Group, thread_scope Scope, uint8_t StagesCount>
+pipeline<Scope> make_pipeline(const Group & group, pipeline_shared_state<Scope, StagesCount> * shared_state);                        // (2)
+
+template<class Group, thread_scope Scope, uint8_t StagesCount>
+pipeline<Scope> make_pipeline(const Group & group, pipeline_shared_state<Scope, StagesCount> * shared_state, size_t producer_count); // (3)
+
+template<class Group, thread_scope Scope, uint8_t StagesCount>
+pipeline<Scope> make_pipeline(const Group & group, pipeline_shared_state<Scope, StagesCount> * shared_state, pipeline_role role);    // (4)
+```
+
+1. Creates a _unified pipeline_ such that the calling thread is the only participating thread and performs both producer and consumer actions.
+2. Creates a _unified pipeline_ such that all the threads in `group` are performing both producer and consumer actions.
+3. Creates a _partitioned pipeline_ such that `producer_threads` number of threads in `group` are performing producer actions while the others
+   are performing consumer actions. 
+4. Creates a _partitioned pipeline_ where each thread's role is explicitly specified.
+
+All threads in `group` acquire collective ownership of the `shared_state` storage.
+
+`make_pipeline` must be invoked by every threads in `group` such that `group::sync` may be invoked.
+
+`shared_state` and `producer_count` must be uniform across all threads in `group`, else the behavior is undefined.
+
+`producer_count` must be strictly inferior to `group::size`, else the behavior is undefined.
+
+## Template parameters
+
+| Group | a type satisfying the [_Group concept_](../concepts/group.md) |
+
+## Parameters
+
+| group          | the group of threads                                                                                                                                |
+| shared_state   | a pointer to an object of type [`cuda::pipeline_shared_state<Scope>`](./pipeline_shared_state.md) with `Scope` including all the threads in `group` |
+| producer_count | the number of _producer threads_ in the pipeline                                                                                                    |
+| role           | the role of the current thread in the pipeline                                                                                                      |
+
+## Return value
+
+A `cuda::pipeline` object.
+
+## Example
+
+```c++
+#include <cuda/pipeline>
+#include <cooperative_groups.h>
+
+// Disables `pipeline_shared_state` initialization warning
+#pragma diag_suppress static_var_with_dynamic_init
+
+__global__ void example_kernel()
+{
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope_block, 2> pss_1;
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope_block, 2> pss_2;
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope_block, 2> pss_3;
+
+    auto group = cooperative_groups::this_thread_block();
+
+    // Create a thread scoped pipeline
+    cuda::pipeline<cuda::thread_scope_thread> p_0 = cuda::make_pipeline();
+
+    // Create a unified block-scoped pipeline
+    cuda::pipeline<cuda::thread_scope_block> p_1 = cuda::make_pipeline(group, &pss_1);
+
+    // Create a partitioned block-scoped pipeline where half the threads are producers
+    size_t producer_count = group.size() / 2;
+    cuda::pipeline<cuda::thread_scope_block> p_2 = cuda::make_pipeline(group, &pss_2, producer_count);
+
+    // Create a partitioned block-scoped pipeline where all threads with an even thread_rank are producers
+    auto thread_role = (group.thread_rank() % 2) ? cuda::pipeline_role::producer : cuda::pipeline_role::consumer;
+    cuda::pipeline<cuda::thread_scope_block> p_3 = cuda::make_pipeline(group, &pss_3, thread_role);
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/Y1zv5G){: .btn }
diff --git a/docs/extended_api/synchronization_library/pipeline.md b/docs/extended_api/synchronization_library/pipeline.md
new file mode 100644
index 0000000000..88bb99df1e
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline.md
@@ -0,0 +1,100 @@
+---
+grand_parent: Extended API
+parent: Synchronization library
+---
+
+# cuda::**pipeline**
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<cuda::thread_scope Scope>
+class pipeline;
+```
+
+The class template `cuda::pipeline` provides a coordination mechanism allowing to pipeline multiple operations in a sequence of stages.
+
+A thread interacts with a _pipeline stage_ using the following pattern:
+1. Acquire the pipeline stage
+2. Commit some operations to the stage
+3. Wait for the previously committed operations to complete
+4. Release the pipeline stage
+
+For thread scopes other than `thread_scope_thread`, a [`pipeline_shared_state`](./pipeline_shared_state.md) is required to coordinate the participating threads.
+
+_Pipelines_ can be either _unified_ or _partitioned_.
+In a _unified pipeline_, all the participating threads are both producers and consumers.
+In a _partitioned pipeline_, each participating thread is either a producer or a consumer.
+
+## Template parameters
+
+### Scope
+
+A [`cuda::thread_scope`](../../api/synchronization_library/thread_scopes.md) denoting a scope including all the threads participating in the _pipeline_.
+
+## Member functions
+
+| (constructor) [deleted]                            | `pipeline` is not constructible                                                                                                                  |
+| [(destructor)](./pipeline/destructor.md)           | destroys the `pipeline`                                                                                                                          |
+| operator= [deleted]                                | `pipeline` is not assignable                                                                                                                     |
+| [producer_acquire](./pipeline/producer_acquire.md) | blocks the current thread until the next _pipeline stage_ is available                                                                           |
+| [producer_commit](./pipeline/producer_commit.md)   | commits operations previously issued by the current thread to the current _pipeline stage_                                                       |
+| [consumer_wait](./pipeline/consumer_wait.md)       | blocks the current thread until all operations committed to the current _pipeline stage_ complete                                                |
+| [consumer_wait_for](./pipeline/consumer_wait.md)   | blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration        |
+| [consumer_wait_until](./pipeline/consumer_wait.md) | blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached |
+| [consumer_release](./pipeline/consumer_release.md) | release the current _pipeline stage_                                                                                                             |
+| [quit](./pipeline/quit.md)                         | quits current thread's participation in the _pipeline_                                                                                           |
+
+## Notes
+
+A thread role cannot change during the lifetime of the pipeline object.
+
+## Example
+
+```c++
+#include <cuda/pipeline>
+#include <cooperative_groups.h>
+
+// Disables `pipeline_shared_state` initialization warning
+#pragma diag_suppress static_var_with_dynamic_init
+
+template <typename T>
+__device__ void compute(T * ptr);
+
+template <typename T>
+__global__ void example_kernel(T * global1, T * global2, size_t subset_count)
+{
+    extern __shared__ T s[];
+    auto group = cooperative_groups::this_thread_block();
+    T * shared[2] = { s, s + 2 * group.size() };
+
+    // Create a CUDA pipeline
+     constexpr unsigned stages_count = 2;
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, stages_count> shared_state;
+    auto pipeline = cuda::make_pipeline(group, &shared_state);
+
+    // Prime the pipeline
+    pipeline.producer_acquire();
+    cuda::memcpy_async(group, shared[0], &global1[0], sizeof(T) * group.size(), pipeline);
+    cuda::memcpy_async(group, shared[0] + group.size(), &global2[0], sizeof(T) * group.size(), pipeline);
+    pipeline.producer_commit();
+
+    // Pipelined copy/compute
+    for (size_t subset = 1; subset < subset_count; ++subset) {
+        pipeline.producer_acquire();
+        cuda::memcpy_async(group, shared[subset % 2], &global1[subset * group.size()], sizeof(T) * group.size(), pipeline);
+        cuda::memcpy_async(group, shared[subset % 2] + group.size(), &global2[subset * group.size()], sizeof(T) * group.size(), pipeline);
+        pipeline.producer_commit();
+        pipeline.consumer_wait();
+        compute(shared[(subset - 1) % 2]);
+        pipeline.consumer_release();
+    }
+
+    // Drain the pipeline
+    pipeline.consumer_wait();
+    compute(shared[(subset_count - 1) % 2]);
+    pipeline.consumer_release();
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/javfx9){: .btn }
diff --git a/docs/extended_api/synchronization_library/pipeline/consumer_release.md b/docs/extended_api/synchronization_library/pipeline/consumer_release.md
new file mode 100644
index 0000000000..cc48efd11f
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline/consumer_release.md
@@ -0,0 +1,17 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline\<Scope>::**consumer_release**
+
+```c++
+void consumer_release();
+```
+
+Releases the current _pipeline stage_.
+
+## Notes
+
+If this method is called from a _producer thread_ the behavior is undefined.
+
+If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined.
diff --git a/docs/extended_api/synchronization_library/pipeline/consumer_wait.md b/docs/extended_api/synchronization_library/pipeline/consumer_wait.md
new file mode 100644
index 0000000000..932227a0fa
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline/consumer_wait.md
@@ -0,0 +1,34 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline\<Scope>::**consumer_wait**, cuda::pipeline\<Scope>::**consumer_wait_for**, cuda::pipeline\<Scope>::**consumer_wait_until**
+
+```c++
+void consumer_wait();                                                                  // (1)
+
+template<class Rep, class Period>
+bool consumer_wait_for(const std::chrono::duration<Rep, Period> & duration);           // (2)
+
+template<class Clock, class Duration>
+bool consumer_wait_until(const std::chrono::time_point<Clock, Duration> & time_point); // (3)
+```
+
+1. blocks the current thread until all operations committed to the current _pipeline stage_ complete
+2. blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration
+3. blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached
+
+## Parameters
+
+| duration   | an object of type `cuda::std::chrono::duration` representing the maximum time to spend waiting |
+| time_point | an object of type `cuda::std::chrono::time_point` representing the time when to stop waiting   |
+
+## Return value
+
+`false` if the _wait_ timed out, `true` otherwise.
+
+## Notes
+
+If this method is called from a _producer thread_ the behavior is undefined.
+
+If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined.
diff --git a/docs/extended_api/synchronization_library/pipeline/destructor.md b/docs/extended_api/synchronization_library/pipeline/destructor.md
new file mode 100644
index 0000000000..3e8cc3c92c
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline/destructor.md
@@ -0,0 +1,15 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline\<Scope>::**~pipeline**
+
+```c++
+~pipeline();
+```
+
+Destructs the pipeline.
+
+## Notes
+
+Calls [`cuda::pipeline<scope>::quit`](./quit.md) if it was not called by the current thread.
diff --git a/docs/extended_api/synchronization_library/pipeline/producer_acquire.md b/docs/extended_api/synchronization_library/pipeline/producer_acquire.md
new file mode 100644
index 0000000000..a04385f63c
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline/producer_acquire.md
@@ -0,0 +1,17 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline\<Scope>::**producer_acquire**
+
+```c++
+void producer_acquire();
+```
+
+Blocks the current thread until the next _pipeline stage_ is available.
+
+## Notes
+
+If this method is called from a _consumer thread_ the behavior is undefined.
+
+If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined.
diff --git a/docs/extended_api/synchronization_library/pipeline/producer_commit.md b/docs/extended_api/synchronization_library/pipeline/producer_commit.md
new file mode 100644
index 0000000000..1295fc8de5
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline/producer_commit.md
@@ -0,0 +1,17 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline\<Scope>::**producer_commit**
+
+```c++
+void producer_commit();
+```
+
+Commits operations previously issued by the current thread to the current _pipeline stage_.
+
+## Notes
+
+If this method is called from a _consumer thread_ the behavior is undefined.
+
+If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./quit.md)), the behavior is undefined.
diff --git a/docs/extended_api/synchronization_library/pipeline/quit.md b/docs/extended_api/synchronization_library/pipeline/quit.md
new file mode 100644
index 0000000000..dd804ff7fb
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline/quit.md
@@ -0,0 +1,19 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline\<Scope>::**quit**
+
+```c++
+bool quit();
+```
+
+Quits the current thread's participation in the collective ownership of the corresponding _shared state_ ([`cuda::pipeline_shared_state`](../pipeline_shared_state.md)). Ownership of the _shared state_ is released by the last invoking thread.
+
+## Return value
+
+`true` if ownership of the _shared state_ was released, otherwise `false`.
+
+## Notes
+
+The behavior undefined if any operation other than [`~pipeline`](./destructor.md) is issued by the current thread after quitting.
diff --git a/docs/extended_api/synchronization_library/pipeline_consumer_wait_prior.md b/docs/extended_api/synchronization_library/pipeline_consumer_wait_prior.md
new file mode 100644
index 0000000000..5284aa205a
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline_consumer_wait_prior.md
@@ -0,0 +1,57 @@
+---
+grand_parent: Extended API
+parent: Synchronization library
+---
+
+# cuda::**pipeline_consumer_wait_prior**
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<uint8_t Prior>
+void pipeline_consumer_wait_prior(pipeline<thread_scope_thread> & pipeline);
+```
+
+Let *Stage* be the pipeline stage `Prior` stages before the current one (counting the current one). Blocks the current
+thread until all operations committed to _pipeline stages_ up to *Stage* complete. All stages up to *Stage* (exclusive)
+are implicitly released.
+
+## Template parameters
+
+| Prior | The index of the pipeline stage *Stage* (_see above_) counting up from the current one. The index of the current stage is `0`. |
+
+## Parameters
+
+| pipeline | The thread-scoped `cuda::pipeline` object to wait on |
+
+## Notes
+
+If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./pipeline/quit.md)), the behavior is undefined.
+
+## Example
+
+```c++
+#include <cuda/pipeline>
+
+__global__ void example_kernel(uint64_t * global, size_t element_count)
+{
+    extern __shared__ uint64_t shared[];
+
+    cuda::pipeline<cuda::thread_scope_thread> pipe = cuda::make_pipeline();
+    for (size_t i = 0; i < element_count; ++i) {
+        pipe.producer_acquire();
+        cuda::memcpy_async(shared + i, global + i, sizeof(*global), pipe);
+        pipe.producer_commit();
+    }
+
+    // Wait for operations committed in all stages but the last one
+    cuda::pipeline_consumer_wait_prior<1>(pipe);
+    pipe.consumer_release();
+
+    // Wait for operations committed in all stages
+    cuda::pipeline_consumer_wait_prior<0>(pipe);
+    pipe.consumer_release();
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/j83v3G){: .btn }
diff --git a/docs/extended_api/synchronization_library/pipeline_producer_commit.md b/docs/extended_api/synchronization_library/pipeline_producer_commit.md
new file mode 100644
index 0000000000..b0fb640687
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline_producer_commit.md
@@ -0,0 +1,52 @@
+---
+grand_parent: Extended API
+parent: Synchronization library
+---
+
+# cuda::**pipeline_producer_commit**
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<thread_scope Scope>
+void pipeline_producer_commit(pipeline<thread_scope_thread> & pipeline, barrier<Scope> & barrier);
+```
+
+Binds operations previously issued by the current thread to the named `barrier` such that a `barrier::arrive` is performed on completion. The bind operation implicitly increments the barrier's current phase to account for the subsequent `barrier::arrive`, resulting in a net change of 0.
+
+## Parameters
+
+| pipeline | the thread-scoped `cuda::pipeline` object to wait on |
+| barrier  | the barrier to arrive on                             |
+
+## Notes
+
+If the pipeline is in a _quitted state_ (see [`pipeline::quit`](./pipeline/quit.md)), the behavior is undefined.
+
+## Example
+
+```c++
+#include <cuda/pipeline>
+
+// Disables `barrier` initialization warning
+#pragma diag_suppress static_var_with_dynamic_init
+
+__global__ void example_kernel(uint64_t * global, size_t element_count)
+{
+    extern __shared__ uint64_t shared[];
+    __shared__ cuda::barrier<cuda::thread_scope_block> barrier;
+
+    init(&barrier, 1);
+    cuda::pipeline<cuda::thread_scope_thread> pipe = cuda::make_pipeline();
+
+    pipe.producer_acquire();
+    for (size_t i = 0; i < element_count; ++i) {
+        cuda::memcpy_async(shared + i, global + i, sizeof(*global), pipe);
+    }
+    pipeline_producer_commit(pipe, barrier);
+    barrier.arrive_and_wait();
+    pipe.consumer_release();
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/x5n8zY){: .btn }
diff --git a/docs/extended_api/synchronization_library/pipeline_role.md b/docs/extended_api/synchronization_library/pipeline_role.md
new file mode 100644
index 0000000000..cb2dca0391
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline_role.md
@@ -0,0 +1,25 @@
+---
+grand_parent: Extended API
+parent: Synchronization library
+---
+
+# cuda::**pipeline_role**
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+enum class pipeline_role : /* unspecified */ {
+    producer,
+    consumer
+};
+```
+`cuda::pipeline_role` specifies the role of a particular thread in a partitioned producer/consumer pipeline.
+
+## Constants
+
+| producer | a producer thread generates data (e.g. by issuing [`memcpy_async`](../asynchronous_operations_library/memcpy_async.md) operations)                           |
+| consumer | a consumer thread consumes data (e.g. by waiting for previously [`memcpy_async`](../asynchronous_operations_library/memcpy_async.md) operations to complete) |
+
+## Example
+
+See [cuda::make_pipeline](./make_pipeline.md#example).
diff --git a/docs/extended_api/synchronization_library/pipeline_shared_state.md b/docs/extended_api/synchronization_library/pipeline_shared_state.md
new file mode 100644
index 0000000000..9d9a408a7f
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline_shared_state.md
@@ -0,0 +1,52 @@
+---
+grand_parent: Extended API
+parent: Synchronization library
+---
+
+# cuda::**pipeline_shared_state**
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<cuda::thread_scope Scope, uint8_t StagesCount>
+class pipeline_shared_state;
+```
+
+The class template `cuda::pipeline_shared_state` is a storage type used to coordinate the threads participating in a `cuda::pipeline`.
+
+## Template parameters
+
+| Scope       | A [`cuda::thread_scope`](../../api/synchronization_library/thread_scopes.md) denoting a scope including all the threads participating in the `cuda::pipeline`. `Scope` cannot be `thread_scope_thread`.|
+| StagesCount | The number of stages for the _pipeline_.                                                                                                                                                               |
+
+## Member functions
+
+| [(constructor)](./pipeline_shared_state/constructor.md) | constructs a `pipeline_shared_state`      |
+| [(destructor)](./pipeline_shared_state/destructor.md)   | destroys the `pipeline_shared_state`      |
+| operator= [deleted]                                     | `pipeline_shared_state` is not assignable |
+
+## Example
+
+```c++
+#include <cuda/pipeline>
+
+// Disables `pipeline_shared_state` initialization warning
+#pragma diag_suppress static_var_with_dynamic_init
+
+__global__ void example_kernel(char * device_buffer, char * sysmem_buffer)
+{
+    // Allocate a 2 stage block scoped shared state in shared memory
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, 2> pss_1;
+
+    // Allocate a 2 stage block scoped shared state in device memory
+    auto * pss_2 = new cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, 2>;
+
+    // Construct a 2 stage device scoped shared state in device memory
+    auto * pss_3 = new(device_buffer) cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_device, 2>;
+
+    // Construct a 2 stage system scoped shared state in system memory
+    auto * pss_4 = new(sysmem_buffer) cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_system, 2>;
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/xMMxYM){: .btn }
diff --git a/docs/extended_api/synchronization_library/pipeline_shared_state/constructor.md b/docs/extended_api/synchronization_library/pipeline_shared_state/constructor.md
new file mode 100644
index 0000000000..2f109a28e5
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline_shared_state/constructor.md
@@ -0,0 +1,40 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline_shared_state\<Scope, StagesCount>::**pipeline_shared_state**
+
+```c++
+pipeline_shared_state();                                       // (1)
+pipeline_shared_state(const pipeline_shared_state &) = delete; // (2)
+pipeline_shared_state(pipeline_shared_state &&) = delete;      // (3)
+```
+
+1. Constructs the pipeline shared state.
+2. Copy constructor is deleted.
+3. Move constructor is deleted.
+
+## Notes
+
+Static declaration of `pipeline_shared_state` within device code currently emits the following warning:
+
+```
+warning: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
+```
+
+It can be silenced using `#pragma diag_suppress static_var_with_dynamic_init`.
+
+## Example
+
+```c++
+#include <cuda/pipeline>
+
+#pragma diag_suppress static_var_with_dynamic_init
+
+__global__ void example_kernel()
+{
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope_block, 2> shared_state;
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/n1zoea){: .btn }
diff --git a/docs/extended_api/synchronization_library/pipeline_shared_state/destructor.md b/docs/extended_api/synchronization_library/pipeline_shared_state/destructor.md
new file mode 100644
index 0000000000..123537fa71
--- /dev/null
+++ b/docs/extended_api/synchronization_library/pipeline_shared_state/destructor.md
@@ -0,0 +1,11 @@
+---
+nav_exclude: true
+---
+
+# cuda::pipeline_shared_state\<Scope, StagesCount>::**~pipeline_shared_state**
+
+```c++
+~pipeline_shared_state();
+```
+
+Destructs the pipeline shared state.
diff --git a/docs/releases.md b/docs/releases.md
index 9dbf82dcc5..9bf9b8bde6 100644
--- a/docs/releases.md
+++ b/docs/releases.md
@@ -1,7 +1,7 @@
 ---
 has_children: true
 has_toc: true
-nav_order: 3
+nav_order: 4
 ---
 
 # Releases