[ENH/QST] actually inplace updates in `setitem` and friends #11990

wence- · 2022-10-25T17:17:48Z

Context

As noted in #11085, in many cases (though inconsistently right now), obtaining a view on Series (probably a DataFrame as well) using iloc[:] inadvertently behaves with pseudo-copy-on-write semantics

import cudf
import numpy as np
s = cudf.Series([1, 2, 3])
sview = s.iloc[:]
s.iloc[[1, 2]] = [4, 5]
assert np.allclose(s.values, sview.values) # => False

sview = s.iloc[:]
s.iloc[0:2] = 3
assert np.allclose(s.values, sview.values) # => True

Note: pandas is moving towards all indexing behaving with copy semantics, so for some of these cases we've already skated to the right answer :)

Why does this happen?

Most (but not all) of the __setitem__-like calls into (e.g. copy_range, scatter) libcudf do not operate in place, but instead return a new cudf::column that must be wrapped up. As a consequence, to pretend like the operation was in place, we call _mimic_inplace(...) to switch out the backing data of the Column object we're doing __setitem__ on:

import cudf
s = cudf.Series([1, 2, 3])
old_data = s._column.data
s.iloc[1:3] = [4, 5]
new_data = s._column.data
assert old_data is new_data # => False

This is kind of fine as long as there's only one object holding on to the column data, but this breaks down as soon as we have views.

Why is the status quo problematic?

The current inconsistencies make implementing copy-on-write rather delicate (and in many cases provoke more copies than needed).

Operations that to the user do not provoke a copy can overflow GPU memory:

 # on a system with 32 GB gpu memory
 import cudf
 import cupy as cp
 import numpy as np
 df = cudf.DataFrame({f"{i}": cp.ones(10**9, dtype=np.uint8) for i in range(20)}) # about 20GB
 # expectation: this behaves in place, so the operation should fit in memory.
 df.iloc[[0, 2]] = list(range(20)) # => MemoryError: std::bad_alloc: out_of_memory: CUDA error at: rmm/mr/device/cuda_memory_resource.hpp

If the scatter/copy_foo operations in libcudf had an in place then we would have lower memory pressure (as point 2) and in the (common) case where we have a target table view, could avoid a memcopy of the whole table.

Possible solutions

I don't know the history as to why the libcudf generally tends to offer "return a copy" rather than "modify in place", but one could make an effort to offer in place versions of most functions. If these operations were available, then the Cython layer could switch to calling into them. In those cases where we really want a copy, we would allocate and copy into an empty table before calling into libcudf.

Edit: modification in place only works at the libcudf level for fixed-width column types (so no strings, lists), and having in- and out-of-place modification for every operation is too much work without some significant motivating use case.

Since we need a work-around that works for string/list columns that cannot by modified in-place anyway, I don't think this issue is a sufficiently motivating use case.

The above solution is a no-go, so what else could we do?

Given that we're trying to move to copy-on-write, we could go the other way and audit all places where __setitem__ really is in place, and break that connection. ~~Note that this is not actually copy-on-write, but copy-on-read so it's not a great option.~~ Something close to this probably is copy-on-write, so looks perhaps reasonable.
Change the way _mimic_inplace(self, other, inplace=True) works: rather than rewriting where self.data points to, we could instead memcopy from other.data back into self.data and then drop other. This maintains the same memory footprint right now, at the cost of (another) full memcopy, and makes __setitem__ really behave in place (even for views). As pointed out below, this doesn't work for non-fixed-width column dtypes.

The text was updated successfully, but these errors were encountered:

shwina · 2022-10-25T19:28:41Z

Regardless of what solution (if any) we choose longer term:

Change the way _mimic_inplace(self, other, inplace=True) works:

is an acceptable solution that we should be able to implement with very little code change, and should fix our behaviour for many of the buggy cases described.

shwina · 2022-10-25T20:52:36Z

Well, I guess I lied. Copying into the buffers works only for fixed width types, not variable width types (like strings)

wence- · 2022-10-26T14:54:54Z

Can we leverage the current work introducing copy-on-write semantics (cc @galipremsagar) to square this circle in a nice way?

If we want views to behave like copies what does that mean?

sr = cudf.Series(...)

sview = sr.iloc[:] # (or sr.copy() or sr.loc[:])

# this makes a copy at the libcudf layer right now, 
# which we try and fix up to make sview == sr 
# this works in some cases but not others
sr.iloc[...] = ... 

# What if we wanted sview != sr (copy-on-write)
# Then we would just have to detach the view from sr
# when making the update (which we do in a bunch of cases already)

galipremsagar · 2022-10-26T14:56:56Z

I think we can make the view mechanism work in cudf with weak references, let me think a bit about this and get back.

wence- · 2022-10-26T15:43:22Z

I think we can make the view mechanism work in cudf with weak references, let me think a bit about this and get back.

Note that I think we are "accidentally" doing this for many cases already ~~(so it's possible no weak references are required)~~

Edit: we need reference tracking for the cases where a lazy copy is modified using an operation that is actually in-place at the libcudf level.

wence- · 2022-11-01T18:41:12Z

To summarise a discussion with @galipremsagar, @vyasr, and @mroeschke.

Although there are at present inconsistencies in CUDF behaviour, they likely do not bite in too many cases (since people on the whole don't work with views).

The copy-on-write work in #11718 will (in an opt-in manner) remove the inconsistencies by removing the concept of a view (sharing data) and making everything a copy (albeit consed lazily).

In the fixed-width column case there might be a desire to expand the number of modifications in libcudf that actually operate in place (rather than being faked post-hoc via _mimic_inplace), but this becomes a much lower priority once copy-on-write becomes available (and then default).

wence- · 2022-11-28T18:03:21Z

Since this can bite in various circumstances here are some proposals:

Keep track of views and warn on read-after-write/write-after-write

When we create a view child keep a weakref to it on the parent Series/DataFrame (and vice versa). When we do a write into parent (child) if there are any weakrefs that are live, walk through them and mark the child (parent) as invalid: any subsequent read or write on this object would raise an error.

I suspect this is very similar to how the putative copy-on-write implementation keeps track of things and forces copies at the appropriate time. If so, we could probably piggy-back this warning/error system on that implementation (for the case when copy-on-write is off).

Restructure cuDF internals so that setitem/getitem are one level of indirection higher

The problems with views arise because views effectively take references to columns inside a DataFrame or Series which a subsequently write-after-read can't know about. If a view were of a Series and not a Column, then I think the problems in this issue would not exist.

I can't scope how much work this would be, but I suspect a lot.

wence- added feature request New feature or request Needs Triage Need team to review and classify labels Oct 25, 2022

wence- mentioned this issue Oct 26, 2022

[REVIEW] Copy on write implementation #11718

Merged

5 tasks

GregoryKimball added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 30, 2022

wence- mentioned this issue Oct 31, 2022

[FEA] Copy on write as opt-in feature #12011

Closed

wence- self-assigned this Dec 1, 2022

wence- mentioned this issue Feb 16, 2023

[ENH]: Reworking of iloc and loc indexing #12793

Open

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH/QST] actually inplace updates in `setitem` and friends #11990

[ENH/QST] actually inplace updates in `setitem` and friends #11990

wence- commented Oct 25, 2022 •

edited

Loading

shwina commented Oct 25, 2022

shwina commented Oct 25, 2022

wence- commented Oct 26, 2022

galipremsagar commented Oct 26, 2022 •

edited

Loading

wence- commented Oct 26, 2022 •

edited

Loading

wence- commented Nov 1, 2022

wence- commented Nov 28, 2022

[ENH/QST] actually inplace updates in __setitem__ and friends #11990

[ENH/QST] actually inplace updates in __setitem__ and friends #11990

Comments

wence- commented Oct 25, 2022 • edited Loading

Context

Why does this happen?

Why is the status quo problematic?

Possible solutions

shwina commented Oct 25, 2022

shwina commented Oct 25, 2022

wence- commented Oct 26, 2022

galipremsagar commented Oct 26, 2022 • edited Loading

wence- commented Oct 26, 2022 • edited Loading

wence- commented Nov 1, 2022

wence- commented Nov 28, 2022

Keep track of views and warn on read-after-write/write-after-write

Restructure cuDF internals so that setitem/getitem are one level of indirection higher

[ENH/QST] actually inplace updates in `setitem` and friends #11990

[ENH/QST] actually inplace updates in `setitem` and friends #11990

wence- commented Oct 25, 2022 •

edited

Loading

galipremsagar commented Oct 26, 2022 •

edited

Loading

wence- commented Oct 26, 2022 •

edited

Loading