Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH/QST] actually inplace updates in __setitem__ and friends #11990

Open
Tracked by #12793
wence- opened this issue Oct 25, 2022 · 7 comments
Open
Tracked by #12793

[ENH/QST] actually inplace updates in __setitem__ and friends #11990

wence- opened this issue Oct 25, 2022 · 7 comments
Assignees
Labels
2 - In Progress Currently a work in progress feature request New feature or request Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Oct 25, 2022

Context

As noted in #11085, in many cases (though inconsistently right now), obtaining a view on Series (probably a DataFrame as well) using iloc[:] inadvertently behaves with pseudo-copy-on-write semantics

import cudf
import numpy as np
s = cudf.Series([1, 2, 3])
sview = s.iloc[:]
s.iloc[[1, 2]] = [4, 5]
assert np.allclose(s.values, sview.values) # => False

sview = s.iloc[:]
s.iloc[0:2] = 3
assert np.allclose(s.values, sview.values) # => True

Note: pandas is moving towards all indexing behaving with copy semantics, so for some of these cases we've already skated to the right answer :)

Why does this happen?

Most (but not all) of the __setitem__-like calls into (e.g. copy_range, scatter) libcudf do not operate in place, but instead return a new cudf::column that must be wrapped up. As a consequence, to pretend like the operation was in place, we call _mimic_inplace(...) to switch out the backing data of the Column object we're doing __setitem__ on:

import cudf
s = cudf.Series([1, 2, 3])
old_data = s._column.data
s.iloc[1:3] = [4, 5]
new_data = s._column.data
assert old_data is new_data # => False

This is kind of fine as long as there's only one object holding on to the column data, but this breaks down as soon as we have views.

Why is the status quo problematic?

  1. The current inconsistencies make implementing copy-on-write rather delicate (and in many cases provoke more copies than needed).
  2. Operations that to the user do not provoke a copy can overflow GPU memory:
     # on a system with 32 GB gpu memory
     import cudf
     import cupy as cp
     import numpy as np
     df = cudf.DataFrame({f"{i}": cp.ones(10**9, dtype=np.uint8) for i in range(20)}) # about 20GB
     # expectation: this behaves in place, so the operation should fit in memory.
     df.iloc[[0, 2]] = list(range(20)) # => MemoryError: std::bad_alloc: out_of_memory: CUDA error at: rmm/mr/device/cuda_memory_resource.hpp
  3. If the scatter/copy_foo operations in libcudf had an in place then we would have lower memory pressure (as point 2) and in the (common) case where we have a target table view, could avoid a memcopy of the whole table.

Possible solutions

I don't know the history as to why the libcudf generally tends to offer "return a copy" rather than "modify in place", but one could make an effort to offer in place versions of most functions. If these operations were available, then the Cython layer could switch to calling into them. In those cases where we really want a copy, we would allocate and copy into an empty table before calling into libcudf.

Edit: modification in place only works at the libcudf level for fixed-width column types (so no strings, lists), and having in- and out-of-place modification for every operation is too much work without some significant motivating use case.

Since we need a work-around that works for string/list columns that cannot by modified in-place anyway, I don't think this issue is a sufficiently motivating use case.

The above solution is a no-go, so what else could we do?

  • Given that we're trying to move to copy-on-write, we could go the other way and audit all places where __setitem__ really is in place, and break that connection. Note that this is not actually copy-on-write, but copy-on-read so it's not a great option. Something close to this probably is copy-on-write, so looks perhaps reasonable.
  • Change the way _mimic_inplace(self, other, inplace=True) works: rather than rewriting where self.data points to, we could instead memcopy from other.data back into self.data and then drop other. This maintains the same memory footprint right now, at the cost of (another) full memcopy, and makes __setitem__ really behave in place (even for views). As pointed out below, this doesn't work for non-fixed-width column dtypes.
@wence- wence- added feature request New feature or request Needs Triage Need team to review and classify labels Oct 25, 2022
@shwina
Copy link
Contributor

shwina commented Oct 25, 2022

Regardless of what solution (if any) we choose longer term:

Change the way _mimic_inplace(self, other, inplace=True) works:

is an acceptable solution that we should be able to implement with very little code change, and should fix our behaviour for many of the buggy cases described.

@shwina
Copy link
Contributor

shwina commented Oct 25, 2022

Well, I guess I lied. Copying into the buffers works only for fixed width types, not variable width types (like strings)

@wence-
Copy link
Contributor Author

wence- commented Oct 26, 2022

Can we leverage the current work introducing copy-on-write semantics (cc @galipremsagar) to square this circle in a nice way?

If we want views to behave like copies what does that mean?

sr = cudf.Series(...)

sview = sr.iloc[:] # (or sr.copy() or sr.loc[:])

# this makes a copy at the libcudf layer right now, 
# which we try and fix up to make sview == sr 
# this works in some cases but not others
sr.iloc[...] = ... 

# What if we wanted sview != sr (copy-on-write)
# Then we would just have to detach the view from sr
# when making the update (which we do in a bunch of cases already)

@galipremsagar
Copy link
Contributor

galipremsagar commented Oct 26, 2022

I think we can make the view mechanism work in cudf with weak references, let me think a bit about this and get back.

@wence-
Copy link
Contributor Author

wence- commented Oct 26, 2022

I think we can make the view mechanism work in cudf with weak references, let me think a bit about this and get back.

Note that I think we are "accidentally" doing this for many cases already (so it's possible no weak references are required)

Edit: we need reference tracking for the cases where a lazy copy is modified using an operation that is actually in-place at the libcudf level.

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 30, 2022
@wence-
Copy link
Contributor Author

wence- commented Nov 1, 2022

To summarise a discussion with @galipremsagar, @vyasr, and @mroeschke.

Although there are at present inconsistencies in CUDF behaviour, they likely do not bite in too many cases (since people on the whole don't work with views).

The copy-on-write work in #11718 will (in an opt-in manner) remove the inconsistencies by removing the concept of a view (sharing data) and making everything a copy (albeit consed lazily).

In the fixed-width column case there might be a desire to expand the number of modifications in libcudf that actually operate in place (rather than being faked post-hoc via _mimic_inplace), but this becomes a much lower priority once copy-on-write becomes available (and then default).

@wence-
Copy link
Contributor Author

wence- commented Nov 28, 2022

Since this can bite in various circumstances here are some proposals:

Keep track of views and warn on read-after-write/write-after-write

When we create a view child keep a weakref to it on the parent Series/DataFrame (and vice versa). When we do a write into parent (child) if there are any weakrefs that are live, walk through them and mark the child (parent) as invalid: any subsequent read or write on this object would raise an error.

I suspect this is very similar to how the putative copy-on-write implementation keeps track of things and forces copies at the appropriate time. If so, we could probably piggy-back this warning/error system on that implementation (for the case when copy-on-write is off).

Restructure cuDF internals so that setitem/getitem are one level of indirection higher

The problems with views arise because views effectively take references to columns inside a DataFrame or Series which a subsequently write-after-read can't know about. If a view were of a Series and not a Column, then I think the problems in this issue would not exist.

I can't scope how much work this would be, but I suspect a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

4 participants