Merge copy-on-write feature branch into branch-23.04 (#12619)

This PR contains changes from #11718 primarily that will enable Copy on write feature in cudf. This PR introduces `copy-on-write`. As the name suggests when `copy-on-write` is enabled, when there is a shallow copy of a column made, both the columns share the same memory and only when there is a write operation being performed on either the parent or any of it's copies a true copy will be triggered. Copy-on-write(`c-o-w`) can be enabled in two ways: 1. Setting `CUDF_COPY_ON_WRITE` environment variable to `1` will enable `c-o-w`, unsetting will disable `c-o-w`. 2. Setting `copy_on_write` option in `cudf` options by doing `cudf.set_option("copy_on_write", True)` to enable it and `cudf.set_option("copy_on_write", False)` to disable it. Note: Copy-on-write is not being enabled by default, it is being introduced as an opt-in. A valid performance comparison can be done only with `copy_on_write=OFF` + `.copy(deep=True)` vs `copy_on_write=ON` + `.copy(deep=False)`: ```python In [1]: import cudf In [2]: s = cudf.Series(range(0, 100000000)) # branch-23.02 : 1209MiB # This-PR : 1209MiB In [3]: s_copy = s.copy(deep=True) #branch-23.02 In [3]: s_copy = s.copy(deep=False) #This-PR # branch-23.02 : 1973MiB # This-PR : 1209MiB In [4]: s Out[4]: 0 0 1 1 2 2 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 In [5]: s_copy Out[5]: 0 0 1 1 2 2 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 In [6]: s[2] = 10001 # branch-23.02 : 3121MiB # This-PR : 3121MiB In [7]: s Out[7]: 0 0 1 1 2 10001 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 In [8]: s_copy Out[8]: 0 0 1 1 2 2 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 ``` Stats around the performance and memory gains : - [x] Memory usage of new copies will be 0 GPU memory additional overhead i.e., users will save 2x, 5x, 10x,...20x memory usage for making 2x, 5x, 10x,...20x deep copies respectively. **So, The more you copy the more you save 😉**(_as long as you don't write on all of them_) - [x] **copying times are now cut by 99%** for all dtypes when copy-on-write is enabled(`copy_on_write=OFF` + `.copy(deep=True)` vs `copy_on_write=ON` + `.copy(deep=False)`). ```python In [1]: import cudf In [2]: df = cudf.DataFrame({'a': range(0, 1000000)}) In [3]: df = cudf.DataFrame({'a': range(0, 100000000)}) In [4]: df['b'] = df.a.astype('str') # GPU memory usage # branch-23.02 : 2345MiB # This-PR : 2345MiB In [5]: df Out[5]: a b 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 ... ... ... 99999995 99999995 99999995 99999996 99999996 99999996 99999997 99999997 99999997 99999998 99999998 99999998 99999999 99999999 99999999 [100000000 rows x 2 columns] In [6]: def make_two_copies(df, deep): ...: return df.copy(deep=deep), df.copy(deep=deep) ...: In [7]: x, y = make_two_copies(df, deep=True) # branch-23.02 In [7]: x, y = make_two_copies(df, deep=False) # This PR # GPU memory usage # branch-23.02 : 6147MiB # This-PR : 2345MiB In [8]: %timeit make_two_copies(df, deep=True) # branch-23.02 In [8]: %timeit make_two_copies(df, deep=False) # This PR # Execution times # branch-23.02 : 135 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This-PR : 100 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` - [x] Even when `copy-on-write` is disabled, `string`, `list` & `struct` columns **deep copies are now 99% faster** ```python In [1]: import cudf In [2]: s = cudf.Series(range(0, 100000000), dtype='str') In [3]: %timeit s.copy(deep=True) # branch-23.02 : 28.3 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This PR : 19.9 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [9]: s = cudf.Series([[1, 2], [2, 3], [3, 4], [4, 5], [6, 7]]* 10000000) In [10]: %timeit s.copy(deep=True) # branch-23.02 : 25.7 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This-PR : 44.2 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [4]: df = cudf.DataFrame({'a': range(0, 100000000), 'b': range(0, 100000000)[::-1]}) In [5]: s = df.to_struct() In [6]: %timeit s.copy(deep=True) # branch-23.02 : 42.5 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This-PR : 89.7 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` - [x] Add pytests - [x] Docs page explaining copy on write and how to enable/disable it. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12619
rapidsai · Feb 16, 2023 · 506a479 · 506a479
1 parent e4ffcbb
commit 506a479
Show file tree

Hide file tree

Showing 29 changed files with 1,209 additions and 90 deletions.
diff --git a/docs/cudf/source/developer_guide/library_design.md b/docs/cudf/source/developer_guide/library_design.md
@@ -229,6 +229,7 @@ Additionally, parameters are:
     of `<X>` in bytes. This introduces a modest overhead and is **disabled by default**. Furthermore, this is a
     *soft* limit. The memory usage might exceed the limit if too many buffers are unspillable.
 
+(Buffer-design)=
 #### Design
 
 Spilling consists of two components:
@@ -314,3 +315,188 @@ The pandas API also includes a number of helper objects, such as `GroupBy`, `Rol
 cuDF implements corresponding objects with the same APIs.
 Internally, these objects typically interact with cuDF objects at the Frame layer via composition.
 However, for performance reasons they frequently access internal attributes and methods of `Frame` and its subclasses.
+
+
+(copy-on-write-dev-doc)=
+
+## Copy-on-write
+
+This section describes the internal implementation details of the copy-on-write feature.
+It is recommended that developers familiarize themselves with [the user-facing documentation](copy-on-write-user-doc) of this functionality before reading through the internals
+below.
+
+The core copy-on-write implementation relies on the `CopyOnWriteBuffer` class.
+When the cudf option `"copy_on_write"` is `True`, `as_buffer` will always return a `CopyOnWriteBuffer`.
+This subclass of `cudf.Buffer` contains all the mechanisms to enable copy-on-write behavior.
+The class stores [weak references](https://docs.python.org/3/library/weakref.html) to every existing `CopyOnWriteBuffer` in `CopyOnWriteBuffer._instances`, a mapping from `ptr` keys to `WeakSet`s containing references to `CopyOnWriteBuffer` objects.
+This means that all `CopyOnWriteBuffer`s that point to the same device memory are contained in the same `WeakSet` (corresponding to the same `ptr` key) in `CopyOnWriteBuffer._instances`.
+This data structure is then used to determine whether or not to make a copy when a write operation is performed on a `Column` (see below).
+If multiple buffers point to the same underlying memory, then a copy must be made whenever a modification is attempted.
+
+
+### Eager copies when exposing to third-party libraries
+
+If a `Column`/`CopyOnWriteBuffer` is exposed to a third-party library via `__cuda_array_interface__`, we are no longer able to track whether or not modification of the buffer has occurred. Hence whenever
+someone accesses data through the `__cuda_array_interface__`, we eagerly trigger the copy by calling
+`_unlink_shared_buffers` which ensures a true copy of underlying device data is made and
+unlinks the buffer from any shared "weak" references. Any future copy requests must also trigger a true physical copy (since we cannot track the lifetime of the third-party object). To handle this we also mark the `Column`/`CopyOnWriteBuffer` as
+`obj._zero_copied=True` thus indicating that any future shallow-copy requests will trigger a true physical copy
+rather than a copy-on-write shallow copy with weak references.
+
+### Obtaining a read-only object
+
+A read-only object can be quite useful for operations that will not
+mutate the data. This can be achieved by calling `._get_cuda_array_interface(readonly=True)`, and creating a `SimpleNameSpace` object around it.
+This will not trigger a deep copy even if the `CopyOnWriteBuffer`
+has weak references. This API should only be used when the lifetime of the proxy object is restricted to cudf's internal code execution. Handing this out to external libraries or user-facing APIs will lead to untracked references and undefined copy-on-write behavior. We currently use this API for device to host
+copies like in `ColumnBase.data_array_view(mode="read")` which is used for `Column.values_host`.
+
+
+### Internal access to raw data pointers
+
+Since it is unsafe to access the raw pointer associated with a buffer when
+copy-on-write is enabled, in addition to the readonly proxy object described above,
+access to the pointer is gated through `Buffer.get_ptr`. This method accepts a mode
+argument through which the caller indicates how they will access the data associated
+with the buffer. If only read-only access is required (`mode="read"`), this indicates
+that the caller has no intention of modifying the buffer through this pointer.
+In this case, any shallow copies are not unlinked. In contrast, if modification is
+required one may pass `mode="write"`, provoking unlinking of any shallow copies.
+
+
+### Variable width data types
+Weak references are implemented only for fixed-width data types as these are only column
+types that can be mutated in place.
+Requests for deep copies of variable width data types always return shallow copies of the Columns, because these
+types don't support real in-place mutation of the data.
+Internally, we mimic in-place mutations using `_mimic_inplace`, but the resulting data is always a deep copy of the underlying data.
+
+
+### Examples
+
+When copy-on-write is enabled, taking a shallow copy of a `Series` or a `DataFrame` does not
+eagerly create a copy of the data. Instead, it produces a view that will be lazily
+copied when a write operation is performed on any of its copies.
+
+Let's create a series:
+
+```python
+>>> import cudf
+>>> cudf.set_option("copy_on_write", True)
+>>> s1 = cudf.Series([1, 2, 3, 4])
+```
+
+Make a copy of `s1`:
+```python
+>>> s2 = s1.copy(deep=False)
+```
+
+Make another copy, but of `s2`:
+```python
+>>> s3 = s2.copy(deep=False)
+```
+
+Viewing the data and memory addresses show that they all point to the same device memory:
+```python
+>>> s1
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> s2
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> s3
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+
+>>> s1.data._ptr
+139796315897856
+>>> s2.data._ptr
+139796315897856
+>>> s3.data._ptr
+139796315897856
+```
+
+Now, when we perform a write operation on one of them, say on `s2`, a new copy is created
+for `s2` on device and then modified:
+
+```python
+>>> s2[0:2] = 10
+>>> s2
+0    10
+1    10
+2     3
+3     4
+dtype: int64
+>>> s1
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> s3
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+If we inspect the memory address of the data, `s1` and `s3` still share the same address but `s2` has a new one:
+
+```python
+>>> s1.data._ptr
+139796315897856
+>>> s3.data._ptr
+139796315897856
+>>> s2.data._ptr
+139796315899392
+```
+
+Now, performing write operation on `s1` will trigger a new copy on device memory as there
+is a weak reference being shared in `s3`:
+
+```python
+>>> s1[0:2] = 11
+>>> s1
+0    11
+1    11
+2     3
+3     4
+dtype: int64
+>>> s2
+0    10
+1    10
+2     3
+3     4
+dtype: int64
+>>> s3
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+If we inspect the memory address of the data, the addresses of `s2` and `s3` remain unchanged, but `s1`'s memory address has changed because of a copy operation performed during the writing:
+
+```python
+>>> s2.data._ptr
+139796315899392
+>>> s3.data._ptr
+139796315897856
+>>> s1.data._ptr
+139796315879723
+```
+
+cuDF's copy-on-write implementation is motivated by the pandas proposals documented here:
+1. [Google doc](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u)
+2. [Github issue](https://github.com/pandas-dev/pandas/issues/36195)
diff --git a/docs/cudf/source/user_guide/copy-on-write.md b/docs/cudf/source/user_guide/copy-on-write.md
@@ -0,0 +1,179 @@
+(copy-on-write-user-doc)=
+
+# Copy-on-write
+
+Copy-on-write is a memory management strategy that allows multiple cuDF objects containing the same data to refer to the same memory address as long as neither of them modify the underlying data.
+With this approach, any operation that generates an unmodified view of an object (such as copies, slices, or methods like `DataFrame.head`) returns a new object that points to the same memory as the original.
+However, when either the existing or new object is _modified_, a copy of the data is made prior to the modification, ensuring that the changes do not propagate between the two objects.
+This behavior is best understood by looking at the examples below.
+
+The default behaviour in cuDF is for copy-on-write to be disabled, so to use it, one must explicitly
+opt in by setting a cuDF option. It is recommended to set the copy-on-write at the beginning of the
+script execution, because when this setting is changed in middle of a script execution there will
+be un-intended behavior where the objects created when copy-on-write is enabled will still have the
+copy-on-write behavior whereas the objects created when copy-on-write is disabled will have
+different behavior.
+
+## Enabling copy-on-write
+
+1. Use `cudf.set_option`:
+
+    ```python
+    >>> import cudf
+    >>> cudf.set_option("copy_on_write", True)
+    ```
+
+2. Set the environment variable ``CUDF_COPY_ON_WRITE`` to ``1`` prior to the
+launch of the Python interpreter:
+
+    ```bash
+    export CUDF_COPY_ON_WRITE="1" python -c "import cudf"
+    ```
+
+## Disabling copy-on-write
+
+
+Copy-on-write can be disabled by setting the ``copy_on_write`` option to ``False``:
+
+```python
+>>> cudf.set_option("copy_on_write", False)
+```
+
+## Making copies
+
+There are no additional changes required in the code to make use of copy-on-write.
+
+```python
+>>> series = cudf.Series([1, 2, 3, 4])
+```
+
+Performing a shallow copy will create a new Series object pointing to the
+same underlying device memory:
+
+```python
+>>> copied_series = series.copy(deep=False)
+>>> series
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> copied_series
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+When a write operation is performed on either ``series`` or
+``copied_series``, a true physical copy of the data is created:
+
+```python
+>>> series[0:2] = 10
+>>> series
+0    10
+1    10
+2     3
+3     4
+dtype: int64
+>>> copied_series
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+
+## Notes
+
+When copy-on-write is enabled, there is no longer a concept of views when
+slicing or indexing. In this sense, indexing behaves as one would expect for
+built-in Python containers like `lists`, rather than indexing `numpy arrays`.
+Modifying a "view" created by cuDF will always trigger a copy and will not
+modify the original object.
+
+Copy-on-write produces much more consistent copy semantics. Since every object is a copy of the original, users no longer have to think about when modifications may unexpectedly happen in place. This will bring consistency across operations and bring cudf and pandas behavior into alignment when copy-on-write is enabled for both. Here is one example where pandas and cudf are currently inconsistent without copy-on-write enabled:
+
+```python
+
+>>> import pandas as pd
+>>> s = pd.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+dtype: int64
+>>> s
+0    10
+1     2
+2     3
+3     4
+4     5
+dtype: int64
+
+>>> import cudf
+>>> s = cudf.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+>>> s
+0    1
+1    2
+2    3
+3    4
+4    5
+dtype: int64
+```
+
+The above inconsistency is solved when copy-on-write is enabled:
+
+```python
+>>> import pandas as pd
+>>> pd.set_option("mode.copy_on_write", True)
+>>> s = pd.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+dtype: int64
+>>> s
+0    1
+1    2
+2    3
+3    4
+4    5
+dtype: int64
+
+
+>>> import cudf
+>>> cudf.set_option("copy_on_write", True)
+>>> s = cudf.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+dtype: int64
+>>> s
+0    1
+1    2
+2    3
+3    4
+4    5
+dtype: int64
+```
+
+
+### Explicit deep and shallow copies comparison
+
+
+|                     | Copy-on-Write enabled                                                                                                                                                                                          | Copy-on-Write disabled (default)                                                                               |
+|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
+| `.copy(deep=True)`  | A true copy is made and changes don't propagate to the original object.                                                                                                                            | A true copy is made and changes don't propagate to the original object.                  |
+| `.copy(deep=False)` | Memory is shared between the two objects and but any write operation on one object will trigger a true physical copy before the write is performed. Hence changes will not propagate to the original object. | Memory is shared between the two objects and changes performed on one will propagate to the other object. |
diff --git a/docs/cudf/source/user_guide/index.md b/docs/cudf/source/user_guide/index.md
@@ -13,4 +13,5 @@ guide-to-udfs
 cupy-interop
 options
 PandasCompat
+copy-on-write
 ```