NVIDIA · c0riolis · Nov 17, 2020 · Oct 21, 2020
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -1,7 +1,7 @@
 ---
 has_children: true
 has_toc: true
-nav_order: 4
+nav_order: 5
 ---
 
 # Contributing

diff --git a/docs/extended_api.md b/docs/extended_api.md
@@ -0,0 +1,29 @@
+---
+has_children: true
+has_toc: false
+nav_order: 3
+---
+
+# Extended API
+
+## [Headers](./extended_api/headers.md)
+
+### [\<cuda/pipeline>](./extended_api/headers/pipeline.md)
+
+## [Concepts](./extended_api/concepts.md)
+
+### [Group](./extended_api/concepts/group.md)
+
+### [Shape](./extended_api/concepts/shape.md)
+
+## [Synchronization library](./extended_api/synchronization_library.md)
+
+### [pipeline](./extended_api/synchronization_library/pipeline.md)
+
+### [pipeline_shared_state](./extended_api/synchronization_library/pipeline_shared_state.md)
+
+## [Asynchronous operations library](./extended_api/asynchronous_operations_library.md)
+
+### [aligned_size_t](./extended_api/asynchronous_operations_library/aligned_size_t.md)
+
+### [memcpy_async](./extended_api/asynchronous_operations_library/memcpy_async.md)
diff --git a/docs/extended_api/asynchronous_operations_library.md b/docs/extended_api/asynchronous_operations_library.md
@@ -0,0 +1,17 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+---
+
+# Asynchronous operations library
+
+The asynchronous operations library provides components for asynchronous data movement.
+
+## Shapes
+
+| [aligned_size_t](./asynchronous_operations_library/aligned_size_t.md) | defines an extent of bytes with a statically defined alignment `(class template)` |
+
+## Asynchronous operations
+
+| [memcpy_async](./asynchronous_operations_library/memcpy_async.md) | asynchronously copies one buffer to another `(function template)` |
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t.md
@@ -0,0 +1,61 @@
+---
+grand_parent: Extended API
+parent: Asynchronous operations library
+---
+
+# cuda::**aligned_size_t**
+
+Defined in header [`<cuda/barrier>`](../headers/barrier.md)
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<size_t Alignment>
+struct aligned_size_t;
+```
+
+The class template `cuda::aligned_size_t` is a _shape_ representing an extent of bytes with a statically defined (address and size) alignment.
+
+## Template parameters
+
+| Alignment | the address & size alignement of the byte extent |
+
+## Data members
+
+| [align](./aligned_size_t/align.md) | the alignment of the byte extent |
+| [value](./aligned_size_t/value.md) | the size of the byte extent      |
+
+## Member functions
+
+| [(constructor)](./aligned_size_t/constructor.md) | constructs an _aligned size_                                                      |
+| (destructor) [implicitly declared]               | trivial implicit destructor                                                       |
+| operator= [implicitly declared]                  | trivial implicit copy/move assignment                                             |
+| operator size_t                                  | implicit conversion to [`size_t`](https://en.cppreference.com/w/cpp/types/size_t) |
+
+## Notes
+
+If `value` is not a multiple of `align` the behavior is undefined.
+
+If `Alignment` is not a [valid alignment](https://en.cppreference.com/w/c/language/object#Alignment) the behavior is undefined.
+
+## Example
+
+```c++
+#include <cuda/barrier>
+
+__global__ void example_kernel(void * dst, void * src, size_t size)
+{
+    cuda::barrier<cuda::thread_scope_system> barrier;
+    init(&barrier, 1);
+
+    // Implementation cannot make assumptions about alignment
+    cuda::memcpy_async(dst, src, size, barrier);
+
+    // Implementation can assume that dst, src and size are 16-bytes aligned and may optimize accordingly
+    cuda::memcpy_async(dst, src, cuda::aligned_size_t<16>(size), barrier);
+
+    barrier.arrive_and_wait();
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/v7Ev9E){: .btn }
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/align.md
@@ -0,0 +1,11 @@
+---
+nav_exclude: true
+---
+
+# cuda::aligned_size_t\<Alignment>::**align**
+
+```c++
+static constexpr size_t align = Alignment;
+```
+
+Represents the alignment (address and size) of the byte extent.
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/constructor.md
@@ -0,0 +1,15 @@
+---
+nav_exclude: true
+---
+
+# cuda::aligned_size_t\<Alignment>::**aligned_size_t**
+
+```c++
+explicit aligned_size_t(size_t size);
+```
+
+Constructs an `aligned_size_t` _shape_.
+
+## Notes
+
+If `size` is not a multiple of `Alignment` the behavior is undefined.
diff --git a/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md b/docs/extended_api/asynchronous_operations_library/aligned_size_t/value.md
@@ -0,0 +1,11 @@
+---
+nav_exclude: true
+---
+
+# cuda::aligned_size_t\<Alignment>::**value**
+
+```c++
+size_t value;
+```
+
+Represents the size of the byte extent.
diff --git a/docs/extended_api/asynchronous_operations_library/memcpy_async.md b/docs/extended_api/asynchronous_operations_library/memcpy_async.md
@@ -0,0 +1,64 @@
+---
+grand_parent: Extended API
+parent: Asynchronous operations library
+---
+
+# cuda::**memcpy_async**
+
+Defined in header [`<cuda/barrier>`](../../api/synchronization_library/barrier.md)
+
+```c++
+template<typename Shape, thread_scope Scope>
+void memcpy_async(void * destination, void const * source, Shape size, barrier<Scope> & barrier);                        // (1)
+
+template<typename Group, typename Shape, thread_scope Scope>
+void memcpy_async(Group const & group, void * destination, void const * source, Shape size, barrier<Scope> & barrier);   // (2)
+```
+
+Defined in header [`<cuda/pipeline>`](../headers/pipeline.md)
+
+```c++
+template<typename Shape, thread_scope Scope>
+void memcpy_async(void * destination, void const * source, Shape size, pipeline<Scope> & pipeline);                      // (3)
+
+template<typename Group, typename Shape, thread_scope Scope>
+void memcpy_async(Group const & group, void * destination, void const * source, Shape size, pipeline<Scope> & pipeline); // (4)
+```
+
+Asynchronously copies `size` bytes from the memory location pointed to by `source` to the memory location pointed to by `destination`.
+Both objects are reinterpreted as arrays of `unsigned char`.
+
+`cuda::memcpy_async` have similar constraints to [`std::memcpy`](https://en.cppreference.com/w/cpp/string/byte/memcpy), namely:
+* If the objects overlap, the behavior is undefined.
+* If either `destination` or `source` is an invalid or null pointer, the behavior is undefined (even if `count` is zero).
+* If the objects are [potentially-overlapping](https://en.cppreference.com/w/cpp/language/object#Subobjects) the behavior is undefined.
+* If the objects are not of [`TriviallyCopyable`](https://en.cppreference.com/w/cpp/named_req/TriviallyCopyable) type the program is ill-formed, no diagnostic required.
+
+If _Shape_ is [`cuda::aligned_size_t`](./aligned_size_t.md)), `source` and `destination` are both required to be aligned on [`cuda::aligned_size_t::align`](./aligned_size_t/align.md), else the behavior is undefined.
+
+If `pipeline` is in a _quitted state_ (see [`pipeline::quit`](../synchronization_library/pipeline/quit.md)), the behavior is undefined.
+
+1. Binds the asynchronous copy completion to `barrier` and issues the copy in the current thread.
+2. Binds the asynchronous copy completion to `barrier` and cooperatively issues the copy across all threads in `group`.
+3. Binds the asynchronous copy completion to `pipeline` and issues the copy in the current thread
+4. Binds the asynchronous copy completion to `pipeline` and cooperatively issues the copy across all threads in `group`.
+
+## Template parameters
+
+| Group | a type satisfying the [_group concept_](../concepts/group.md)                                                                                                                  |
+| Shape | a type satisfying the [_shape concept_](../concepts/shape.md) (see [`size_t`](https://en.cppreference.com/w/c/types/size_t) and [`cuda::aligned_size_t`](./aligned_size_t.md)) |
+
+## Parameters
+
+| group       | the group of threads                                    |
+| destination | pointer to the memory location to copy to               |
+| source      | pointer to the memory location to copy from             |
+| size        | the number of bytes to copy                             |
+| barrier     | the barrier object used to wait on the copy completion  |
+| pipeline    | the pipeline object used to wait on the copy completion |
+
+## Example
+
+```c++
+TODO
+```
diff --git a/docs/extended_api/concepts.md b/docs/extended_api/concepts.md
@@ -0,0 +1,11 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+nav_order: 1
+---
+
+# Concepts
+
+| [Group](./concepts/group.md) | defines the requirements of a type that represents a group of cooperating threads                |
+| [Shape](./concepts/shape.md) | defines the requirements of a type that represents a byte extent with a particular memory layout |
diff --git a/docs/extended_api/concepts/group.md b/docs/extended_api/concepts/group.md
@@ -0,0 +1,46 @@
+---
+grand_parent: Extended API
+parent: Concepts
+---
+
+# Group
+
+```c++
+struct Group {
+  static constexpr cuda::thread_scope thread_scope;
+  integral size() const;
+  integral thread_rank() const;
+  void sync() const;
+};
+```
+
+The _Group concept_ defines the requirements of a type that represents a group of cooperating threads.
+
+## Data members
+
+| thread_scope | the scope at which `Group::sync()` synchronizes memory operations and thread execution |
+
+## Member functions
+
+| size        | returns the number of participating threads                                                        |
+| thread_rank | returns a unique value for each participating thread (`0 <= Group::thread_rank() < Group::size()`) |
+| sync        | synchronizes the participating threads                                                             |
+
+## Notes
+
+This concept is defined for documentation purposes but is not materialized in the library.
+
+## Example
+
+```c++
+#include <cuda/atomic>
+
+struct single_thread_group {
+    static constexpr cuda::thread_scope thread_scope = cuda::thread_scope::thread_scope_thread;
+    size_t size() const { return 1; }
+    size_t thread_rank() const { return 0; }
+    void sync() const { }
+};
+```
+
+[See it on Godbolt](https://godbolt.org/z/453r3s){: .btn }
diff --git a/docs/extended_api/concepts/shape.md b/docs/extended_api/concepts/shape.md
@@ -0,0 +1,37 @@
+---
+grand_parent: Extended API
+parent: Concepts
+---
+
+# Shape
+
+```c++
+struct Shape {
+    operator size_t() const;
+};
+```
+
+The _Shape concept_ defines the requirements of a type that represents a byte extent with a particular memory layout.
+
+## Member functions
+
+| operator size_t | implicit conversion to [`size_t`](https://en.cppreference.com/w/cpp/types/size_t) |
+
+## Notes
+
+This concept is defined for documentation purposes but is not materialized in the library.
+
+## Example
+
+```c++
+// A size that carries an alignment hint
+template <size_t Align>
+struct aligned_size {
+    static constexpr size_t align = Align;
+    size_t size;
+    aligned_size(size_t s) : size(s) {}
+    operator size_t() const { return size; }
+};
+```
+
+[See it on Godbolt](https://godbolt.org/z/hbajKo){: .btn }
diff --git a/docs/extended_api/headers.md b/docs/extended_api/headers.md
@@ -0,0 +1,12 @@
+---
+parent: Extended API
+has_children: true
+has_toc: false
+nav_order: 0
+---
+
+# Headers
+
+## Synchronization library
+
+| [\<pipeline\>](./headers/pipeline.md) | [Pipelines](./synchronization_library/pipeline.md) and corresponding [memcpy_async](./asynchronous_operations_library/memcpy_async.md) overloads |