Skip to content

BUG: CoW and np.asarray leads to surprising issues #52823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bashtage opened this issue Apr 21, 2023 · 9 comments
Open

BUG: CoW and np.asarray leads to surprising issues #52823

bashtage opened this issue Apr 21, 2023 · 9 comments

Comments

@bashtage
Copy link
Contributor

bashtage commented Apr 21, 2023

Reproducible Example

import pandas as pd
import numpy as np

x = pd.Series([1,2,3.0])
y = np.asarray(x)
y[1:] = np.nan

pd.options.mode.copy_on_write = True
w = pd.Series([1,2,3.0])
z = np.asarray(w)
z[1:] = np.nan

Issue Description

This probably isn't a bug but it makes interaction outside of pandas considerably harder.

It seems that it would be good to somewhere have a formal way too ensure that casts to numpy arrays are always writable. Perhaps a writeable keyword in to_numpy?

Expected Behavior

Maybe leave the CoW behavior behind when moving outside of pandas.

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.11.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 2.0.0
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.0.dev0+1817.98c51a6
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@bashtage bashtage added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2023
@bashtage
Copy link
Contributor Author

As a side note, enabling CoW on statsmodels leads to ~300 test failures.

@bashtage
Copy link
Contributor Author

xref numpy/numpy#23625

@phofl
Copy link
Member

phofl commented Apr 21, 2023

This is intended for now. There was some discussion in another issue. We can reopen if you like.

Can you share a stats models build?

@bashtage
Copy link
Contributor Author

I"m just testing locally after statsmodels/statsmodels#8811 was reported. I am planning on setting up a canary build with CoW enabled, and I'll post that when it is done.

@DeaMariaLeon DeaMariaLeon added Copy / view semantics and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2023
@jbrockmendel
Copy link
Member

Maybe leave the CoW behavior behind when moving outside of pandas.

I'm leaning towards this (also on the front-end)

@jorisvandenbossche
Copy link
Member

As a side note, enabling CoW on statsmodels leads to ~300 test failures.

@bashtage To understand what was going on in this specific case: the issue was not that you were actually relying on being able to modify those arrays, but rather that you were passing those arrays to cython code that couldn't handle read-only (const) memoryviews (statsmodels/statsmodels#8820). Is that correct?
And what that the only/main source of failures, or are there more?

I think the cython issue is going to be a very common one for code passing data from DataFrames to cython algos.
(but also: making your code robust for const arrays can be useful in general, as there are other reasons you can end up with read-only arrays, for example when converted from pyarrow in certain ways)

@bashtage
Copy link
Contributor Author

A lot of them where caes where const was needed in Cython. There were others where a DataFrame was intentionally created, e.g., to use a groupby, and then asarray was used on that. The output of the asarray when then further modified in place (or attempted), which failed.

There are bigger question as to what asarray should return when getting a DataFrame, especially when extension arrays become much more common (i.e., strings).

@jorisvandenbossche
Copy link
Member

There were others where a DataFrame was intentionally created, e.g., to use a groupby, and then asarray was used on that. The output of the asarray when then further modified in place (or attempted), which failed.

For cases like this, where you know that you created the DataFrame yourself (not passed in by the user) or is result of a calculation that is never a view of original data (an aggregation), it might be nice if we can provide an easier way to get a writeable array.

I think the current recommendation would be: if you know you can safely write to the array, "just" change the flag:

# ... code that creates a dataframe
df = ...
arr = np.asarray(df)
# I know it is safe to ignore the readonly flag
df.flags.readonly = False
# ... code that modifies `arr` in place

On the one hand, this is not super nice, and we could think about some keyword in to_numpy() to override this? (so you could do arr = df.to_numpy(keep_writeable_i_know_what_im_doing=True) but with a decent name)
On the other hand, doing this more manually also makes it very explicit that this is just overwriting that flag (and not "ensuring" the return value is writeable by making a copy of it, as one would typically expect).

@jorisvandenbossche
Copy link
Member

Note that you can already do arr = df.to_numpy(copy=True) if you want to be sure to have a writeable result, and don't care about making the extra copy (that setting the flag like the above avoids).
So I think this already should give you what a potential arr = np.asarray(df, writable=True) would do as proposed in numpy/numpy#23625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants