Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This PR contains changes from #11718 primarily that will enable Copy on write feature in cudf. This PR introduces `copy-on-write`. As the name suggests when `copy-on-write` is enabled, when there is a shallow copy of a column made, both the columns share the same memory and only when there is a write operation being performed on either the parent or any of it's copies a true copy will be triggered. Copy-on-write(`c-o-w`) can be enabled in two ways: 1. Setting `CUDF_COPY_ON_WRITE` environment variable to `1` will enable `c-o-w`, unsetting will disable `c-o-w`. 2. Setting `copy_on_write` option in `cudf` options by doing `cudf.set_option("copy_on_write", True)` to enable it and `cudf.set_option("copy_on_write", False)` to disable it. Note: Copy-on-write is not being enabled by default, it is being introduced as an opt-in. A valid performance comparison can be done only with `copy_on_write=OFF` + `.copy(deep=True)` vs `copy_on_write=ON` + `.copy(deep=False)`: ```python In [1]: import cudf In [2]: s = cudf.Series(range(0, 100000000)) # branch-23.02 : 1209MiB # This-PR : 1209MiB In [3]: s_copy = s.copy(deep=True) #branch-23.02 In [3]: s_copy = s.copy(deep=False) #This-PR # branch-23.02 : 1973MiB # This-PR : 1209MiB In [4]: s Out[4]: 0 0 1 1 2 2 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 In [5]: s_copy Out[5]: 0 0 1 1 2 2 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 In [6]: s[2] = 10001 # branch-23.02 : 3121MiB # This-PR : 3121MiB In [7]: s Out[7]: 0 0 1 1 2 10001 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 In [8]: s_copy Out[8]: 0 0 1 1 2 2 3 3 4 4 ... 99999995 99999995 99999996 99999996 99999997 99999997 99999998 99999998 99999999 99999999 Length: 100000000, dtype: int64 ``` Stats around the performance and memory gains : - [x] Memory usage of new copies will be 0 GPU memory additional overhead i.e., users will save 2x, 5x, 10x,...20x memory usage for making 2x, 5x, 10x,...20x deep copies respectively. **So, The more you copy the more you save 😉**(_as long as you don't write on all of them_) - [x] **copying times are now cut by 99%** for all dtypes when copy-on-write is enabled(`copy_on_write=OFF` + `.copy(deep=True)` vs `copy_on_write=ON` + `.copy(deep=False)`). ```python In [1]: import cudf In [2]: df = cudf.DataFrame({'a': range(0, 1000000)}) In [3]: df = cudf.DataFrame({'a': range(0, 100000000)}) In [4]: df['b'] = df.a.astype('str') # GPU memory usage # branch-23.02 : 2345MiB # This-PR : 2345MiB In [5]: df Out[5]: a b 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 ... ... ... 99999995 99999995 99999995 99999996 99999996 99999996 99999997 99999997 99999997 99999998 99999998 99999998 99999999 99999999 99999999 [100000000 rows x 2 columns] In [6]: def make_two_copies(df, deep): ...: return df.copy(deep=deep), df.copy(deep=deep) ...: In [7]: x, y = make_two_copies(df, deep=True) # branch-23.02 In [7]: x, y = make_two_copies(df, deep=False) # This PR # GPU memory usage # branch-23.02 : 6147MiB # This-PR : 2345MiB In [8]: %timeit make_two_copies(df, deep=True) # branch-23.02 In [8]: %timeit make_two_copies(df, deep=False) # This PR # Execution times # branch-23.02 : 135 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This-PR : 100 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` - [x] Even when `copy-on-write` is disabled, `string`, `list` & `struct` columns **deep copies are now 99% faster** ```python In [1]: import cudf In [2]: s = cudf.Series(range(0, 100000000), dtype='str') In [3]: %timeit s.copy(deep=True) # branch-23.02 : 28.3 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This PR : 19.9 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [9]: s = cudf.Series([[1, 2], [2, 3], [3, 4], [4, 5], [6, 7]]* 10000000) In [10]: %timeit s.copy(deep=True) # branch-23.02 : 25.7 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This-PR : 44.2 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [4]: df = cudf.DataFrame({'a': range(0, 100000000), 'b': range(0, 100000000)[::-1]}) In [5]: s = df.to_struct() In [6]: %timeit s.copy(deep=True) # branch-23.02 : 42.5 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # This-PR : 89.7 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` - [x] Add pytests - [x] Docs page explaining copy on write and how to enable/disable it. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12619
- Loading branch information