Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SettingWithCopy dependence on reference count #14150

Closed
pkch opened this issue Sep 4, 2016 · 9 comments · Fixed by #56614
Closed

SettingWithCopy dependence on reference count #14150

pkch opened this issue Sep 4, 2016 · 9 comments · Fixed by #56614
Labels
Bug Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Warnings Warnings that appear or should be added to pandas

Comments

@pkch
Copy link

pkch commented Sep 4, 2016

As a follow-up to my question, I believe the current approach is that pandas assumes that the programmer intends assignment to a child DF/S to be propagated to the parent DF/S as long as the parent has a non-zero reference count.

The problems like the ones I described would be avoided if child DataFrame always warned on assignment regardless of the reference count of the parent object - UNLESS a copy() method or DataFrame constructor was explicitly applied to the slice. (If desired query and/or filter can be documented as copy methods as well.)

(Making ref count incremented when the child is a view is not enough because the code that works normally might one day stop working - with SettingWithCopyWarning that but might be ignored if it appear on the client site - simply due to the change in the input data.)

If this approach is followed, the documentation can also be clarified as follows:

When applying indexing, filter (???) and some other operations (???) to a parent DF/S pandas may create a view or copy of the parent depending on conditions that are too complicated to describe. This creates a possibility of a subtle bug in the code when assigning to the child DF/S. To help catch such potential bugs, pandas assumes that the programmer intends assignment to a child DF/S to be propagated to the parent DF/S unless it is explicitly copied (using .copy(), .query(), DataFrame constructor, etc??? ); when pandas cannot guarantee such behavior, it generates SettingWithCopyWarning. Note that this happens regardless of whether the parent DF/S is reachable in the current context or even whether it still exists as an object.

@jreback
Copy link
Contributor

jreback commented Sep 4, 2016

pls show an explicit example

@pkch
Copy link
Author

pkch commented Sep 4, 2016

Examples (more context in my SO question):

import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
assert data[0] == 1

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data
assert data[0] == 100

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Edit to my original post: I think it would also be useful to have a method, say unlink(), that explicitly breaks the link between a DataFrame slice and its parent DataFrame. It's not the same as copy() because copy() always copies, while unlink() would be a no-op if the slice happens to already lose its link to the parent (e.g., due to the dtypes combination, in-memory arrangement, etc., or because it or or copy() was already called earlier).

@jreback
Copy link
Contributor

jreback commented Sep 4, 2016

you can just set:

is_copy = None

object dtypes always copy and are never views

this while earning bizness is a mess because we don't have copy-on-write

I don't think this is likely to be addressed in current pandas and will have to wait for pandas 2.0

that said if you want to whip up some tests that mimic current behavior we can put them in a special area so as to note behavior that should change

@pkch
Copy link
Author

pkch commented Sep 5, 2016

Of course, I understand that CopyOnWrite is the only proper resolution. But pandas 2.0 is probably 3-6 years away. I'm not sure about the amount of effort to make the warning more comprehensive (e.g., as I suggested or otherwise); if it's not that big maybe worth it? After all, in the meantime this is the only protection from these bugs.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

you are misunderstanding
we are freezing for pandas 1.0 by the end of the year
2.0 is slated for next year - so it's not 3-6 years (where did that come from?)
but more like 1-1.5

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

you are certainly welcome to PR a fix for this - but it requires s non trivial effort

@pkch
Copy link
Author

pkch commented Sep 5, 2016

Oh sorry, I didn't realize pandas 2.0 is not that far. I naively assumed 1.0-2.0 is going to take comparable time to 0-1.0. In that case, fixing SettingWithCopyWarning is not a high priority.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

yeah 1.x will be long term supported with very few
API changes

@jreback jreback added Testing pandas testing functions or related to the test suite Compat pandas objects compatability with Numpy or Python functions labels Sep 5, 2016
@jreback jreback added this to the 2.0 milestone Sep 5, 2016
@nickeubank
Copy link
Contributor

@pkch FYI, I made some progress on setting up a PR for keeping track of children of DataFrames (#12036) -- something you may have seen it looks like, but worth flagging. I set it aside since there were aspirations of CopyonWrite getting implemented in 1.0 (#11970), though now appears that it's been pushed to 2.0 (wesm/pandas2#10). You may find some of the components useful if you decide to try and get something in before 2.0.

@jreback jreback added Future and removed Future labels Mar 16, 2017
@mroeschke mroeschke added Warnings Warnings that appear or should be added to pandas Bug and removed Compat pandas objects compatability with Numpy or Python functions Testing pandas testing functions or related to the test suite labels Apr 10, 2020
@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Aug 5, 2020
@mroeschke mroeschke removed this from the 2.0 milestone Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants