Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive SettingWithCopyWarning when just taking subset of columns #16550

Closed
jorisvandenbossche opened this issue May 31, 2017 · 7 comments · Fixed by #56614
Closed

False positive SettingWithCopyWarning when just taking subset of columns #16550

jorisvandenbossche opened this issue May 31, 2017 · 7 comments · Fixed by #56614
Labels
Bug Copy / view semantics Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 31, 2017

From an example of @glemaitre, it was a simple case of taking a subset of the columns and then further working with it and then raises a warning (a case where I would have thought pandas should be able to detect that it is not needed to raise a warning)

After some experimenting, it seems that it is only triggerd when the frame is first printed:

Simple case raising the false positive warning:

In [47]: df = pd.DataFrame(np.random.randn(5, 5), columns=list('ABCDE'))

In [48]: df
Out[48]: 
          A         B         C         D         E
0  0.101315 -0.940874  0.848323 -1.114318  0.093271
1  0.085363  0.201148  0.852091 -0.000424 -0.490293
2 -0.227004 -0.882167 -0.153934  0.679528  2.049475
3  0.977241 -0.661771  1.367731 -0.675444  0.544696
4 -1.347269  1.286316 -0.742564  1.247596 -0.100017

In [49]: df = df[['A', 'B', 'C']]

In [50]: df['new'] = [1, 2, 3, 4, 5]
/home/joris/miniconda3/envs/dev/bin/ipython:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/home/joris/miniconda3/envs/dev/bin/python

but when not displaying the frame, it does not give a warning:

In [44]: df = pd.DataFrame(np.random.randn(5, 5), columns=list('ABCDE'))

In [45]: df = df[['A', 'B', 'C']]

In [46]: df['new'] = [1, 2, 3, 4, 5]

This is on master:

``` In [51]: pd.show_versions() /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/xarray/core/formatting.py:16: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version. from pandas.tslib import OutOfBoundsDatetime

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0.dev+96.gef487d9
pytest: 3.0.3
pip: 9.0.1
setuptools: 34.2.0
Cython: 0.24.1
numpy: 1.11.3
scipy: 0.18.1
xarray: 0.9.5
IPython: 6.0.0
sphinx: 1.5.2
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.2
feather: 0.3.1
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.3
lxml: None
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.0.7
pandas_gbq: None
pandas_datareader: None

</details>
@jorisvandenbossche
Copy link
Member Author

Ah, IPython is keeping a reference to the output (so the actual dataframe) when you display df, so that might interfere with pandas' detection?

@jreback
Copy link
Contributor

jreback commented May 31, 2017

could be. I don't think this is bug. maybe add a test to verify?

@jorisvandenbossche
Copy link
Member Author

Tried it in a plain python console and have the exact same issue (displaying the frame triggers a false positive warning later on), so not an IPython issue

@jorisvandenbossche jorisvandenbossche added Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 1, 2017
@TomAugspurger
Copy link
Contributor

@jorisvandenbossche the regular python REPL assigns the last output value to _. Did you assign anything else to that in between? e.g. this doesn't warn:

Python 3.6.1 (default, Apr  4 2017, 09:40:21)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd; import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(5, 5), columns=list('ABCDE'))
>>> df
          A         B         C         D         E
0 -0.677381 -1.461584 -1.122520 -0.426673 -0.123061
1 -0.040582 -1.679008 -1.299333  0.391486 -0.230494
2  1.337958 -0.168764 -2.475485 -0.022909 -1.010323
3  0.366143  1.017938  0.911115  0.269221 -1.118281
4 -1.162755  0.385203  0.405872  2.177704 -1.601564
>>> 2  # just to overwrite _
2
>>> df = df[['A', 'B', 'C']]
>>> df['new'] = [1, 2, 3, 4, 5]
>>>

@jorisvandenbossche
Copy link
Member Author

Ah, yes, of course!
So if it is not the last display, it indeed does not raise then.

Could (or would we want) to try to discriminate between 'such' variables and actual variables?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 1, 2017 via email

@dumbledad
Copy link

I have the same issue ("taking a subset of the columns and then further working with it") but in my case the further working was in the suggested form which made the error message doubly confusing, i.e. this line of code

d_and_a.loc[:,'next_tool_id'] = d_and_a['tool_id'].shift(-1)

gave this error

Try using .loc[row_indexer,col_indexer] = value instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Copy / view semantics Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants