-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH/QST] actually inplace updates in __setitem__
and friends
#11990
Comments
Regardless of what solution (if any) we choose longer term:
is an acceptable solution that we should be able to implement with very little code change, and should fix our behaviour for many of the buggy cases described. |
Well, I guess I lied. Copying into the buffers works only for fixed width types, not variable width types (like strings) |
Can we leverage the current work introducing copy-on-write semantics (cc @galipremsagar) to square this circle in a nice way? If we want views to behave like copies what does that mean?
|
I think we can make the view mechanism work in cudf with weak references, let me think a bit about this and get back. |
Note that I think we are "accidentally" doing this for many cases already Edit: we need reference tracking for the cases where a lazy copy is modified using an operation that is actually in-place at the libcudf level. |
To summarise a discussion with @galipremsagar, @vyasr, and @mroeschke. Although there are at present inconsistencies in CUDF behaviour, they likely do not bite in too many cases (since people on the whole don't work with views). The copy-on-write work in #11718 will (in an opt-in manner) remove the inconsistencies by removing the concept of a view (sharing data) and making everything a copy (albeit consed lazily). In the fixed-width column case there might be a desire to expand the number of modifications in libcudf that actually operate in place (rather than being faked post-hoc via |
Since this can bite in various circumstances here are some proposals: Keep track of views and warn on read-after-write/write-after-writeWhen we create a view I suspect this is very similar to how the putative copy-on-write implementation keeps track of things and forces copies at the appropriate time. If so, we could probably piggy-back this warning/error system on that implementation (for the case when copy-on-write is off). Restructure cuDF internals so that setitem/getitem are one level of indirection higherThe problems with views arise because views effectively take references to columns inside a I can't scope how much work this would be, but I suspect a lot. |
Context
As noted in #11085, in many cases (though inconsistently right now), obtaining a view on
Series
(probably aDataFrame
as well) usingiloc[:]
inadvertently behaves with pseudo-copy-on-write semanticsNote: pandas is moving towards all indexing behaving with copy semantics, so for some of these cases we've already skated to the right answer :)
Why does this happen?
Most (but not all) of the
__setitem__
-like calls into (e.g.copy_range
,scatter
)libcudf
do not operate in place, but instead return a newcudf::column
that must be wrapped up. As a consequence, to pretend like the operation was in place, we call_mimic_inplace(...)
to switch out the backing data of theColumn
object we're doing__setitem__
on:This is kind of fine as long as there's only one object holding on to the column data, but this breaks down as soon as we have views.
Why is the status quo problematic?
Possible solutions
I don't know the history as to why the libcudf generally tends to offer "return a copy" rather than "modify in place", but one could make an effort to offer in place versions of most functions. If these operations were available, then the Cython layer could switch to calling into them. In those cases where we really want a copy, we would allocate and copy into an empty table before calling into libcudf.
Edit: modification in place only works at the libcudf level for fixed-width column types (so no strings, lists), and having in- and out-of-place modification for every operation is too much work without some significant motivating use case.
Since we need a work-around that works for string/list columns that cannot by modified in-place anyway, I don't think this issue is a sufficiently motivating use case.
The above solution is a no-go, so what else could we do?
__setitem__
really is in place, and break that connection.Note that this is not actually copy-on-write, but copy-on-read so it's not a great option.Something close to this probably is copy-on-write, so looks perhaps reasonable.Change the wayAs pointed out below, this doesn't work for non-fixed-width column dtypes._mimic_inplace(self, other, inplace=True)
works: rather than rewriting whereself.data
points to, we could insteadmemcopy
fromother.data
back intoself.data
and then dropother
. This maintains the same memory footprint right now, at the cost of (another) fullmemcopy
, and makes__setitem__
really behave in place (even for views).The text was updated successfully, but these errors were encountered: