Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy on Write via weakrefs #11207

Closed
wants to merge 2 commits into from

Conversation

nickeubank
Copy link
Contributor

Aims to eventually close #10954 , alternative to #10973

In the interest of advancing the discussion on this, here's a skeleton of a different copy on write implementation.

Currently

When setting on a view, converts to copy.

 df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
 intermediate = df.loc[1:1,]
 intermediate['col1'] = -99


intermediate

Out[8]: 
   col1  col2
1   -99     4

df

Out[9]: 
   col1  col2
0     1     3
1     2     4

Chained-indexing will ALWAYS fail.

df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
df.loc[1:1,]['col1'] = -99
df

Out[11]: 
   col1  col2
1     2     4

SettingWithCopy warning disabled, thought not fully removed (figure this is too early stage to really chase that down).

Forward Protection for loc setting (proof of concept, added oct 2):

 df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
 intermediate = df.loc[1:1,]

 df.loc[1,'col1'] = -99
 intermediate

  Out[11]: 
   col1  col2
0     1     3
1     2     4

Goals / Places I'd love help

  • We've discussed keeps full-column slices as views, since we can guarantee these will always be views. I created an attribute called _is_column_view that can be set to True when a slice is a full column (e.g. new = df['col1'], but I'm not sure how to actually figure out if a slice is a full column, so right now it's initialized as False and just stays that way.
  • This has no protection against forward propagation -- situations where one sets values on a Frame and they propagate to views of that Frame. There IS a framework to address it, but it's incomplete. As it stands:
    • Frames now have a _children list attribute to keep references to their child views.
    • Before setting, there's a call to a function to try and convert the children from views to copies.
      But as it stand, I don't know how to really manage those references (i.e. how to put references to children in that _children list).

@jreback
@shoyer
@JanSchulz
@TomAugspurger

@jreback
Copy link
Contributor

jreback commented Sep 30, 2015

u need to have tests

doesn't matter if it's a demo or not

these are the first things u should write to demonstrate the API

@nickeubank
Copy link
Contributor Author

@jreback ok, added!

@nickeubank
Copy link
Contributor Author

Added an example of forward protection. NOT generalizable or at all robust, just illustrative for one setting.

@shoyer
Copy link
Member

shoyer commented Oct 2, 2015

This might work in a reasonably efficient way if you make _children a WeakValueDictionary (keyed by object id?). Certainly seems better than using garbage collection in totally ad-hoc ways...

@nickeubank
Copy link
Contributor Author

@shoyer would you mind explaining a little more? You mean just a floating global dictionary as opposed to an object attribute?

Main problem I'm having now (love suggestions @jreback @shoyer ): where to place _convert_children() call. Is there a single function that's always called when data is changed where I can put it? Or a short list of functions where it should be put?

@shoyer
Copy link
Member

shoyer commented Oct 2, 2015

The problem with making _children a list is that you can't efficiently remove items from a list. Putting children in the values of dictionary solves that problem, and WeakValueDictionary even solves the cleanup for you automatically, because it removes items when they are garbage collected.

@nickeubank
Copy link
Contributor Author

oh, oh, I see. Great!

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

this list of _children would only be a very small number of items, so it won't make much difference

@shoyer
Copy link
Member

shoyer commented Oct 2, 2015

I'm still a fan of using WeakValuesDictionary rather than rolling our own
solution. Either way weak refs are a more viable approach than hijacking
the garbage collector.

On Fri, Oct 2, 2015 at 2:52 PM, Jeff Reback notifications@github.com
wrote:

this list of _children would only be a very small number of items, so it
won't make much difference


Reply to this email directly or view it on GitHub
#11207 (comment).

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

well if you get it to work, all ears.

if you do

a = df['foo']

you don't have a ref, so not really sure what you are trying to track

@nickeubank
Copy link
Contributor Author

That's the one case we wanted to stay as a view (full column slices) right?
So that might be a feature not a bug...
On Fri, Oct 2, 2015 at 3:06 PM Jeff Reback notifications@github.com wrote:

well if you get it to work, all ears.

if you do

a = df['foo']

you don't have a ref, so not really sure what you are trying to track


Reply to this email directly or view it on GitHub
#11207 (comment).

@nickeubank
Copy link
Contributor Author

Also: I don't yet understand the internals well enough to know where all I
need to put self._children.append() and _convert_children() -- this is just
an attempt to figure out the machinery in one specific case of subsetting
and setting. Guidance on bottlenecks for where to put those calls to cover
all cases would be much appreciated.
On Fri, Oct 2, 2015 at 3:10 PM Nick Eubank nickeubank@gmail.com wrote:

That's the one case we wanted to stay as a view (full column slices)
right? So that might be a feature not a bug...
On Fri, Oct 2, 2015 at 3:06 PM Jeff Reback notifications@github.com
wrote:

well if you get it to work, all ears.

if you do

a = df['foo']

you don't have a ref, so not really sure what you are trying to track


Reply to this email directly or view it on GitHub
#11207 (comment).

@nickeubank
Copy link
Contributor Author

@jreback @shoyer OK, I think this is getting close to working... Will keep working on this, but if anyone has any input or sees problems, input appreciated!

Still trying to figure out columns... (i.e. to keep df['col1'] as a view).

@nickeubank nickeubank changed the title Simple copy on write skeleton Copy on Write via weakrefs Oct 30, 2015
@nickeubank
Copy link
Contributor Author

@shoyer @jreback OK, now keeps dictionary-like column slices (v = df['col1']) as views as our one special case.

@nickeubank
Copy link
Contributor Author

Moving to new thread:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Views, Copies, and the SettingWithCopyWarning Issue
4 participants