-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731
API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731
Conversation
and (dtype is None or is_dtype_equal(values.dtype, dtype)) | ||
and copy_on_sanitize | ||
): | ||
values = np.array(values, order="F", copy=copy_on_sanitize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the order="F" thing something we should be doing in general in cases with copy=True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are operating on the transposed array when copying a DataFrame object, so not needed in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #44871 took a look at preserving order when doing copy. there are some tradeoffs here
pandas/core/frame.py
Outdated
@@ -685,6 +685,8 @@ def __init__( | |||
# INFO(ArrayManager) by default copy the 2D input array to get | |||
# contiguous 1D arrays | |||
copy = True | |||
elif using_copy_on_write() and isinstance(data, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential alternative is to express this in the inverse: .. and not isinstance(data, (Series, DataFrame))
.
That would ensure that also other "array-likes" that would coerce zero-copy into a numpy array but are not exactly np.ndarray would get copied by default (not fully sure how our constructor currently handles such array like objects, though).
I think in the end only for pandas objects we can keep track of references, so only in that case the default should be a shallow copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try and see what breaks
pandas/core/construction.py
Outdated
@@ -762,6 +764,9 @@ def _try_cast( | |||
|
|||
subarr = maybe_cast_to_integer_array(arr, dtype) | |||
else: | |||
subarr = np.array(arr, dtype=dtype, copy=copy) | |||
if using_copy_on_write(): | |||
subarr = np.array(arr, dtype=dtype, copy=copy, order="F") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the order="F" specifically needed for CoW? (and can you add a comment about it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about #50756
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that it's related to that, but I don't understand why we would only do it for CoW? The default is not to copy arrays (without CoW) at the moment, which creates this inefficient layout; but so if the user manually specifies copy=True in the constructor, why not directly copy it to the desired layout in all cases without the if/else check for CoW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah did not think about this, will add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @jbrockmendel s comment above I think we should leave it out for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @jbrockmendel s comment above I think we should leave it out for now
Leave it out in general (from this PR), or you mean not do it for the default mode for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely default mode and maybe also split the copy and the layout change into two PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reiterating preference for this to be separate. The two of you are much more familiar with the CoW logic than most of the rest of us; i get antsy when i see using_copy_on_write checks appearing in new places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all order relevant changes, is more straightforward now
pandas/conftest.py
Outdated
np.random.randn(10, 3), | ||
index=index, | ||
columns=Index(["A", "B", "C"], name="exp"), | ||
copy=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the False "needed" here (did it otherwise give failures), or just for efficiency since this is an example case where we know the array is not owned by anyone else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying here causes one test to fail, which is very weird(the failure). Haven't looked closer yet, but the test is useless as soon as your read_only pr is merged.
Want to understand what's off there nevertheless though
@@ -57,7 +57,10 @@ def test_fillna_on_column_view(self, using_copy_on_write): | |||
|
|||
# i.e. we didn't create a new 49-column block | |||
assert len(df._mgr.arrays) == 1 | |||
assert np.shares_memory(df.values, arr) | |||
if using_copy_on_write: | |||
assert not np.shares_memory(df.values, arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or pass copy=False to the constructor instead?
Because now the check above about assert np.isnan(arr[:, 0]).all()
is kind of useless because arr
was copied and so of course will not be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since this is inplace=True
and there are no other references to df
, shouldn't arr
be modified also with CoW before this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are doing df[0] which is a chained assignment case. Updated the test to set copy=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, missed the [0]
..
I think many of the users/cases passing a single ndarray to DataFrame expect that to be no-copy and for it would help if we could disentangle "this makes CoW behave better" from "this makes reductions more performant" as the motivation here |
This is just a solve two things with one pr thing here, happy to remove the order if you think this causes problems. the actual problem I am trying to solve is the following:
If we don’t do a copy in the constructor, updating the array will update all dataframes as well Edit: my motivation was that if we do a copy anyway, we can also change the order as well |
# Conflicts: # pandas/tests/frame/methods/test_transpose.py
|
||
# TODO(CoW): This should raise a chained assignment error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to the list in the overview issue #48998
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx
assert df.to_numpy(copy=False).base is arr | ||
if using_copy_on_write: | ||
assert df.values.base is not arr | ||
assert df.to_numpy(copy=False).base is not arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add something like df.to_numpy(copy=False).base is df.values.base
(because I think this was part of the intention of the test to verify that to_numpy(copy=False)
didn't make a copy, and not so much that DataFrame(arr)
doesn't make a copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -129,7 +129,10 @@ def test_transpose_get_view_dt64tzget_view(self): | |||
assert result._mgr.nblocks == 1 | |||
|
|||
rtrip = result._mgr.blocks[0].values | |||
assert np.shares_memory(arr._ndarray, rtrip._ndarray) | |||
if using_copy_on_write: | |||
assert not np.shares_memory(arr._ndarray, rtrip._ndarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, for the intent of the test, I think we should still try to verify that df.T
shares the memory with df
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good! (I don't have a strong opinion about splitting off the order="F" changes. Doing it separately might make it easier to do some extra perf tests on the examples that were identified below as slower, but personally I think it's also fine to just keep as is here in the PR).
Only added some comments on the tests.
Should we add some explicit tests to copy_view/test_constructors.py
?
@@ -306,18 +306,24 @@ def test_constructor_dtype_nocast_view_2d_array( | |||
assert df2._mgr.arrays[0].flags.c_contiguous | |||
|
|||
@td.skip_array_manager_invalid_test | |||
def test_1d_object_array_does_not_copy(self): | |||
def test_1d_object_array_does_not_copy(self, using_copy_on_write): | |||
# https://github.com/pandas-dev/pandas/issues/39272 | |||
arr = np.array(["a", "b"], dtype="object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or add copy=False
here to keep the test as is? (and to keep coverage of that case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done
# https://github.com/pandas-dev/pandas/issues/39272 | ||
arr = np.array([["a", "b"], ["c", "d"]], dtype="object") | ||
df = DataFrame(arr) | ||
assert np.shares_memory(df.values, arr) | ||
if using_copy_on_write: | ||
assert not np.shares_memory(df.values, arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
elif ( | ||
using_copy_on_write() | ||
and isinstance(values, np.ndarray) | ||
and (dtype is None or is_dtype_equal(values.dtype, dtype)) | ||
and copy_on_sanitize | ||
): | ||
values = np.array(values, copy=copy_on_sanitize) | ||
values = _ensure_2d(values) | ||
|
||
elif isinstance(values, (np.ndarray, ExtensionArray, ABCSeries, Index)): | ||
# drop subclass info | ||
values = np.array(values, copy=copy_on_sanitize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, don't know if this changed from the original version or I missed it before, but what is this additional check exactly doing? Because it seems to do exactly the same as the next elif
block, and whenever you fullfill the new elif check, you would also fulfill the existing one (since the existing elif will pass whenever values
is an ndarray, but without the extra checks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this is a leftover from the order change, removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a whatsnew note?
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
One other question, but not for this PR: this does it for the DataFrame constructor; so we should still do a follow-up for the Series constructor as well? |
Good point, added |
Yes we should, will put up a pr when this is merged |
@@ -190,6 +190,10 @@ Copy-on-Write improvements | |||
of Series objects and specifying ``copy=False``, will now use a lazy copy | |||
of those Series objects for the columns of the DataFrame (:issue:`50777`) | |||
|
|||
- The :class:`DataFrame` constructor, when constructing from a NumPy array, | |||
will now copy the array by default to avoid mutating the :class:`DataFrame` | |||
when mutating the array. Specify ``copy=False`` to get the old behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a warning like (about copy=False): "in that case pandas does not guarantee correct Copy-on-Write behaviour in case the numpy array would get modified after creating the DataFrame"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Thanks! |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
I'll open a backport pr |
…das-dev#51731) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.