API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731

phofl · 2023-03-01T23:47:08Z

Parent issue: API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array #50776
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-03-03T01:07:30Z

pandas/core/internals/construction.py

+        and (dtype is None or is_dtype_equal(values.dtype, dtype))
+        and copy_on_sanitize
+    ):
+        values = np.array(values, order="F", copy=copy_on_sanitize)


is the order="F" thing something we should be doing in general in cases with copy=True?

We are operating on the transposed array when copying a DataFrame object, so not needed in that case

xref #44871 took a look at preserving order when doing copy. there are some tradeoffs here

jorisvandenbossche · 2023-03-03T09:27:22Z

pandas/core/frame.py

@@ -685,6 +685,8 @@ def __init__(
                # INFO(ArrayManager) by default copy the 2D input array to get
                # contiguous 1D arrays
                copy = True
+            elif using_copy_on_write() and isinstance(data, np.ndarray):


Potential alternative is to express this in the inverse: .. and not isinstance(data, (Series, DataFrame)).

That would ensure that also other "array-likes" that would coerce zero-copy into a numpy array but are not exactly np.ndarray would get copied by default (not fully sure how our constructor currently handles such array like objects, though).
I think in the end only for pandas objects we can keep track of references, so only in that case the default should be a shallow copy?

Let's try and see what breaks

jorisvandenbossche · 2023-03-03T09:29:34Z

pandas/core/construction.py

@@ -762,6 +764,9 @@ def _try_cast(

        subarr = maybe_cast_to_integer_array(arr, dtype)
    else:
-        subarr = np.array(arr, dtype=dtype, copy=copy)
+        if using_copy_on_write():
+            subarr = np.array(arr, dtype=dtype, copy=copy, order="F")


Why is the order="F" specifically needed for CoW? (and can you add a comment about it)

This is about #50756

Yes, I understand that it's related to that, but I don't understand why we would only do it for CoW? The default is not to copy arrays (without CoW) at the moment, which creates this inefficient layout; but so if the user manually specifies copy=True in the constructor, why not directly copy it to the desired layout in all cases without the if/else check for CoW?

Ah did not think about this, will add

Based on @jbrockmendel s comment above I think we should leave it out for now

Based on @jbrockmendel s comment above I think we should leave it out for now

Leave it out in general (from this PR), or you mean not do it for the default mode for now?

Definitely default mode and maybe also split the copy and the layout change into two PRs?

reiterating preference for this to be separate. The two of you are much more familiar with the CoW logic than most of the rest of us; i get antsy when i see using_copy_on_write checks appearing in new places

I removed all order relevant changes, is more straightforward now

jorisvandenbossche · 2023-03-03T09:34:09Z

pandas/conftest.py

+        np.random.randn(10, 3),
+        index=index,
+        columns=Index(["A", "B", "C"], name="exp"),
+        copy=False,


Is the False "needed" here (did it otherwise give failures), or just for efficiency since this is an example case where we know the array is not owned by anyone else?

Copying here causes one test to fail, which is very weird(the failure). Haven't looked closer yet, but the test is useless as soon as your read_only pr is merged.

Want to understand what's off there nevertheless though

jorisvandenbossche · 2023-03-03T09:38:10Z

pandas/tests/frame/methods/test_fillna.py

@@ -57,7 +57,10 @@ def test_fillna_on_column_view(self, using_copy_on_write):

        # i.e. we didn't create a new 49-column block
        assert len(df._mgr.arrays) == 1
-        assert np.shares_memory(df.values, arr)
+        if using_copy_on_write:
+            assert not np.shares_memory(df.values, arr)


Or pass copy=False to the constructor instead?

Because now the check above about assert np.isnan(arr[:, 0]).all() is kind of useless because arr was copied and so of course will not be modified.

Also, since this is inplace=True and there are no other references to df, shouldn't arr be modified also with CoW before this PR?

No, we are doing df[0] which is a chained assignment case. Updated the test to set copy=False

Ah, yes, missed the [0] ..

jbrockmendel · 2023-03-03T16:26:12Z

I think many of the users/cases passing a single ndarray to DataFrame expect that to be no-copy and for pd.DataFrame(arr).values to round-trip without making a copy. Asking them to pass copy=False isn't a huge burden, but it does add some friction to asking the to try out CoW, which im increasingly excited about.

it would help if we could disentangle "this makes CoW behave better" from "this makes reductions more performant" as the motivation here

phofl · 2023-03-03T16:31:34Z

This is just a solve two things with one pr thing here, happy to remove the order if you think this causes problems.

the actual problem I am trying to solve is the following:

arr = np.array([[1, 2], [3, 4]])
df = DataFrame(arr)
# Do some operations that don’t copy
arr[0, 0] = 100

If we don’t do a copy in the constructor, updating the array will update all dataframes as well

Edit: my motivation was that if we do a copy anyway, we can also change the order as well

# Conflicts: # pandas/tests/frame/methods/test_transpose.py

jorisvandenbossche · 2023-03-13T22:18:52Z

pandas/tests/frame/methods/test_fillna.py


+        # TODO(CoW): This should raise a chained assignment error


Added this to the list in the overview issue #48998

jorisvandenbossche · 2023-03-13T22:23:16Z

pandas/tests/frame/methods/test_to_numpy.py

-        assert df.to_numpy(copy=False).base is arr
+        if using_copy_on_write:
+            assert df.values.base is not arr
+            assert df.to_numpy(copy=False).base is not arr


Can you add something like df.to_numpy(copy=False).base is df.values.base (because I think this was part of the intention of the test to verify that to_numpy(copy=False) didn't make a copy, and not so much that DataFrame(arr) doesn't make a copy)

jorisvandenbossche · 2023-03-13T22:24:26Z

pandas/tests/frame/methods/test_transpose.py

@@ -129,7 +129,10 @@ def test_transpose_get_view_dt64tzget_view(self):
        assert result._mgr.nblocks == 1

        rtrip = result._mgr.blocks[0].values
-        assert np.shares_memory(arr._ndarray, rtrip._ndarray)
+        if using_copy_on_write:
+            assert not np.shares_memory(arr._ndarray, rtrip._ndarray)


Similarly here, for the intent of the test, I think we should still try to verify that df.T shares the memory with df?

jorisvandenbossche

Implementation looks good! (I don't have a strong opinion about splitting off the order="F" changes. Doing it separately might make it easier to do some extra perf tests on the examples that were identified below as slower, but personally I think it's also fine to just keep as is here in the PR).
Only added some comments on the tests.

Should we add some explicit tests to copy_view/test_constructors.py?

jorisvandenbossche · 2023-03-13T22:25:44Z

pandas/tests/frame/test_constructors.py

@@ -306,18 +306,24 @@ def test_constructor_dtype_nocast_view_2d_array(
            assert df2._mgr.arrays[0].flags.c_contiguous

    @td.skip_array_manager_invalid_test
-    def test_1d_object_array_does_not_copy(self):
+    def test_1d_object_array_does_not_copy(self, using_copy_on_write):
        # https://github.com/pandas-dev/pandas/issues/39272
        arr = np.array(["a", "b"], dtype="object")


Or add copy=False here to keep the test as is? (and to keep coverage of that case)

jorisvandenbossche · 2023-03-13T22:26:02Z

pandas/tests/frame/test_constructors.py

        # https://github.com/pandas-dev/pandas/issues/39272
        arr = np.array([["a", "b"], ["c", "d"]], dtype="object")
        df = DataFrame(arr)
-        assert np.shares_memory(df.values, arr)
+        if using_copy_on_write:
+            assert not np.shares_memory(df.values, arr)


jorisvandenbossche · 2023-03-15T20:07:17Z

pandas/core/internals/construction.py

+    elif (
+        using_copy_on_write()
+        and isinstance(values, np.ndarray)
+        and (dtype is None or is_dtype_equal(values.dtype, dtype))
+        and copy_on_sanitize
+    ):
+        values = np.array(values, copy=copy_on_sanitize)
+        values = _ensure_2d(values)
+
    elif isinstance(values, (np.ndarray, ExtensionArray, ABCSeries, Index)):
        # drop subclass info
        values = np.array(values, copy=copy_on_sanitize)


Sorry, don't know if this changed from the original version or I missed it before, but what is this additional check exactly doing? Because it seems to do exactly the same as the next elif block, and whenever you fullfill the new elif check, you would also fulfill the existing one (since the existing elif will pass whenever values is an ndarray, but without the extra checks)

Good point, this is a leftover from the order change, removed it

pandas/tests/copy_view/test_constructors.py

jorisvandenbossche

Add a whatsnew note?

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2023-03-15T20:14:57Z

One other question, but not for this PR: this does it for the DataFrame constructor; so we should still do a follow-up for the Series constructor as well?

phofl · 2023-03-15T20:16:34Z

Good point, added

phofl · 2023-03-15T20:16:58Z

Yes we should, will put up a pr when this is merged

jorisvandenbossche · 2023-03-15T20:18:49Z

doc/source/whatsnew/v2.0.0.rst

@@ -190,6 +190,10 @@ Copy-on-Write improvements
  of Series objects and specifying ``copy=False``, will now use a lazy copy
  of those Series objects for the columns of the DataFrame (:issue:`50777`)

+- The :class:`DataFrame` constructor, when constructing from a NumPy array,
+  will now copy the array by default to avoid mutating the :class:`DataFrame`
+  when mutating the array. Specify ``copy=False`` to get the old behavior.


Should we add a warning like (about copy=False): "in that case pandas does not guarantee correct Copy-on-Write behaviour in case the numpy array would get modified after creating the DataFrame"?

jorisvandenbossche · 2023-03-17T14:11:52Z

Thanks!

lumberbot-app · 2023-03-17T14:12:14Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 d534007e4cf07b8a8070f0ff9fe8875e5566f2d8

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #51731: API / CoW: Copy NumPy arrays by default in DataFrame constructor'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-51731-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51731 on branch 2.0.x (API / CoW: Copy NumPy arrays by default in DataFrame constructor)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

phofl · 2023-03-17T14:14:43Z

I'll open a backport pr

…das-dev#51731) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…efault in DataFrame constructor) (#52047) * API / CoW: Copy NumPy arrays by default in DataFrame constructor (#51731) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> * Fix test --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

phofl added 2 commits March 1, 2023 22:48

Implement copy by default for numpy array

563257e

Fix tests

f3161a3

phofl marked this pull request as draft March 1, 2023 23:47

phofl and others added 4 commits March 2, 2023 01:34

Simplify

3a95311

Fix

17cf5ae

Remove

07aa26d

Merge branch 'main' into cow_copy_numpy_array

8e84d85

jbrockmendel reviewed Mar 3, 2023

View reviewed changes

jorisvandenbossche reviewed Mar 3, 2023

View reviewed changes

phofl added 3 commits March 3, 2023 10:55

Add comment

f3ccf0f

Fix

5cdc6ad

Change logic

3e384ea

phofl added 4 commits March 4, 2023 00:37

Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array

49ee53f

Fix tests

fcc7be2

Fix test

9223836

Fix test

a474bf5

phofl marked this pull request as ready for review March 3, 2023 23:45

Merge branch 'main' into cow_copy_numpy_array

d5a0268

mroeschke added the Copy / view semantics label Mar 6, 2023

Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array

0be7fc6

# Conflicts: # pandas/tests/frame/methods/test_transpose.py

jorisvandenbossche added this to the 2.0 milestone Mar 13, 2023

jorisvandenbossche added the Constructors Series/DataFrame/Index/pd.array Constructors label Mar 13, 2023

jorisvandenbossche changed the title ~~CoW: Implement copy by default for df construction with NumPy array~~ API / CoW: Copy NumPy arrays by default in DataFrame constructor Mar 13, 2023

jorisvandenbossche mentioned this pull request Mar 13, 2023

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

jorisvandenbossche reviewed Mar 13, 2023

View reviewed changes

phofl and others added 3 commits March 15, 2023 14:34

Fix test

be9cb04

Merge branch 'main' into cow_copy_numpy_array

4bf3ee8

Fix array manager

e2eceec

jorisvandenbossche reviewed Mar 15, 2023

View reviewed changes

Remove elif

65965c6

jorisvandenbossche reviewed Mar 15, 2023

View reviewed changes

pandas/tests/copy_view/test_constructors.py Outdated Show resolved Hide resolved

jorisvandenbossche approved these changes Mar 15, 2023

View reviewed changes

Update pandas/tests/copy_view/test_constructors.py

ecb756c

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

phofl added 2 commits March 15, 2023 21:15

Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array

8e837d9

Add whatsnew

177fbbc

jorisvandenbossche reviewed Mar 15, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 15, 2023

CoW: Set copy=False in internal usages of Series/DataFrame constructors #51834

Merged

5 tasks

phofl and others added 2 commits March 16, 2023 13:10

Merge branch 'main' into cow_copy_numpy_array

2bbff3b

Add note

5bef4ba

jorisvandenbossche merged commit d534007 into pandas-dev:main Mar 17, 2023

lumberbot-app bot added the Still Needs Manual Backport label Mar 17, 2023

phofl deleted the cow_copy_numpy_array branch March 17, 2023 14:14

phofl added a commit to phofl/pandas that referenced this pull request Mar 17, 2023

API / CoW: Copy NumPy arrays by default in DataFrame constructor (pan…

3826ad7

…das-dev#51731) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

phofl mentioned this pull request Mar 17, 2023

Backport PR #51731 on branch 2.0.x (API / CoW: Copy NumPy arrays by default in DataFrame constructor) #52047

Merged

phofl removed the Still Needs Manual Backport label Mar 17, 2023

jorisvandenbossche mentioned this pull request Mar 17, 2023

CoW: Set copy=False explicitly internally for Series and DataFrame in io/pytables #52032

Merged

thomasjpfan mentioned this pull request Apr 24, 2023

MNT Use copy=False when creating DataFrames scikit-learn/scikit-learn#26272

Merged

jorisvandenbossche mentioned this pull request Feb 16, 2024

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

Merged

Uh oh!

API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731

API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731

Uh oh!

Conversation

phofl commented Mar 1, 2023 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Mar 3, 2023

Uh oh!

phofl commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl commented Mar 1, 2023 •

edited by jorisvandenbossche

Loading

phofl commented Mar 3, 2023 •

edited

Loading