Copy on write using weakrefs #11500

nickeubank · 2015-11-01T16:41:21Z

Working model of copy-on-write. Aims to close #10954, alternative to #10973, extension of #11207.

Copy-on-Write Behavior:

Setting on child doesn't affect parent, but still uses views when can for efficiency

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent.loc[0:0,]
child._is_view

Out[1]: True

child.loc[0:0] = -88
child

Out[2]: 
   col1  col2
0   -88   -88

parent

Out[3]: 
   col1  col2
0     1     3
1     2     4

Setting on parent doesn't affect child

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent.loc[0:0,]
child._is_view

Out[4]: True

parent.loc[0:0, 'col1'] = -88
child

Out[5]: 
   col1  col2
0     1     3

parent

Out[6]: 
   col1  col2
0   -88     3
1     2     4

One exception is dictionary-like access, where views are preserved
(as suggested here: #10954 (comment) )

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent['col1']
child._is_view

Out[7]: True

parent.loc[0:0, 'col1'] = -88
child

Out[8]: 
0   -88
1     2
Name: col1, dtype: int64

parent

Out[9]: 
   col1  col2
0   -88     3
1     2     4

Safe for views of views

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent.loc[0:0,]
child._is_view
Out[20]: True

child_of_child = child.loc[0:0,'col1':'col1']
child_of_child._is_view

Out[21]: True

child_of_child.loc[0:0, 'col1'] = -88
child_of_child

Out[22]: 
   col1
0   -88

child

Out[23]: 
   col1  col2
0     1     3

parent

Out[24]: 
   col1  col2
0     1     3
1     2     4

Chained indexing behavior now consistent

Will always fail unless first class is a dictionary-like call for a single series (since that will always be a view)

Will fail if first call not dict-like

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
parent.loc[0:0, 'col1']['col1'] = -88
parent

Out[10]: 
   col1  col2
0     1     3
1     2     4

Will always succeed if first call dict-like

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
parent['col1'].loc[0:0] = -88
parent

Out[11]: 
   col1  col2
0   -88     3
1     2     4

To Do:

Get feedback on implementation;
if sound in behavior and principles, remove all SettingWithCopy machinery
Add docs

@jreback
@TomAugspurger
@shoyer
@JanSchulz
cc @ellisonbg
cc @CarstVaartjes

shoyer · 2015-11-01T20:07:02Z

First of all -- this is really great! Thank you for pushing on this.

A few specific design issues:

There are two obvious ways to do copy-on-write for views. We could copy the parent dataframe, or we could copy all the children (and their children, recursively). Right now, it looks like you're doing both? I think that's unnecessary. If I'm right, let's consider which option has more favorable performance.
Why are you using a list instead of a WeakValueDictionary? The later seems like it might be a better choice -- it's automatically cleaned up when children are garbage collected. With your current implementation, I think you are missing the check to verify that your weakrefs are valid.

Finally, we are definitely going to need some comprehensive performance checks. How does the ASV benchmark suite compare after this change? What types of uses used to be fast but now are slow? These will need to be carefully documented.

jreback · 2015-11-01T20:09:17Z

you need to have all of the original tests in here
some might work / be modified but you cannot eliminate tests

nickeubank · 2015-11-01T20:44:23Z

Thanks both for the feedback!

@shoyer

In the simple case -- a parent who is not a view and a child who is a view -- there won't be any copying of the parent because of the if statement at 1235 of generic.py. But I'll think about weirder cases like views of views...
I thought about this, but wasn't sure how to key a dictionary. Thankfully, weakrefs are actually self-cleaning -- if they object they refer to gets cleaned, they redirect to None:

test_df = pd.DataFrame()
test_wr = weakref.ref(test_df)
test_df = 'redirect_to_string'
test_wr() is None
Out[36]: True

But you are correct I need to add a if child() is None call at the top of _convert_views_to_copies.

Could you please point me to some docs on this ASV? Never worked with it before.

@jreback Would you mind expanding on your comment / clarifying a little? I don't think I've deleted any tests. Do you mean that you think this seems promising enough it's worth digging into the testing modules to see what works and what fails with these modifications and edit the failing tests? Happy to do so, just wanted to see if people felt this path was sufficiently promising to warrant that work.

shoyer · 2015-11-01T20:54:03Z

@nickeubank

you're right that unnecessary copies aren't done if there are no views. My point is that it is unnecessary to copy both the parent AND its children -- only one is sufficient.
use id(obj) as the key. That's guaranteed to be unique as long as an object is in memory. The main reason I prefer using WeakValueDictionary to list is that the later can be a memory leak -- if you repeatedly index a DataFrame, the list will grow unbounded in size, even if the indexed dataframes themselves are garbage collected.
see here for ASV: http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

jreback · 2015-11-01T21:01:00Z

you have essentially turned off all of the old setting with copy tests
so these need to be adjusted to the new behavior
but all of the cases are valid

nickeubank · 2015-11-02T17:31:52Z

@jreback are you talking about runtime tests in core files like generic.py, or the nosetests in the test files like test_generic.py?

@shoyer Got it. Revising now. Also realized one places this breaks, so need to add another tweak to implementation. Will ping when done.

One general question: Is there a better way to know if something is a view than ._is_view? Seems to give false positives -- makes it safe, but leads to some extra copying. For example:

df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
df._is_view
Out[1]: False

df.loc[0,'col1'] = -88
df._is_view
Out[2]: True

nickeubank · 2015-11-06T23:52:35Z

@shoyer @jreback Performance checks

@shoyer Think your suggestions all integrated. Still probably some excess copying because _is_view is conservative (sometimes calls things views when they aren't), but not sure how to fix

@jreback I'm afraid I'm still a little unclear on what you're asking for. I know you do a lot on these sites so brevity is your friend, but could you please be a little more expansive / point to a specific example?

    before     after       ratio
  [eb66bccc] [cc6c9b1b]
+  116.74ms      1.99s     17.00  frame_methods.frame_insert_500_columns_end.time_frame_insert_500_columns_end
+  185.65ms      1.32s      7.10  plotting.plot_timeseries_period.time_plot_timeseries_period
+   17.64ms   120.52ms      6.83  timeseries.timeseries_series_offset_fast.time_timeseries_series_offset_fast
+   19.03ms   104.11ms      5.47  timeseries.timeseries_datetimeindex_offset_fast.time_timeseries_datetimeindex_offset_fast
+   43.55ms   234.87ms      5.39  sparse.sparse_frame_constructor.time_sparse_frame_constructor
+   45.87ms   109.18ms      2.38  frame_methods.frame_count_level_axis0_multi.time_frame_count_level_axis0_multi
-    1.78ms   781.24μs      0.44  replace.replace_replacena.time_replace_replacena
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2015-11-07T01:47:11Z

the perf issues need to be addressed

kawochen · 2015-11-07T09:05:15Z

pandas/core/generic.py

-        return self._get_item_cache(item)
+        result = self._get_item_cache(item)
+
+        if isinstance(item, str):


Why only str?

jreback · 2015-11-07T15:31:52Z

@nickeubank you changed code but you never fixed tests. e.g. SetttingWithCopyError is no longer defined. So its impossible to see if what you did works or not.

You will need to carefully address each of the existing tests, as these define the current API. These must all pass and/or change as appropriate. This ensures that the API is preserved. You can certainly change things, but these must be well-defined and explictly communicated.

Generally having green PR's gives much more confidence. This is the point of tests, to try to map out the API as much as possible. To cover edge cases and to provide consistency among operations. This of course is not guaranteed, but helps a lot. Examining code only goes so far. Python is far to dynamic a language to just rely on code inspection for correctness.

nickeubank · 2015-11-07T16:05:17Z

@jreback Ah, ok -- thanks so much for the longer explanation! Will do!

…iews

tweak so addition of new columnsdoesn't cause copying improved tests, removed copy option from merge

nickeubank · 2015-11-21T16:51:16Z

@jreback thanks again for explanation on tests -- now see wisdom. Updated tests, and fixed a few issues that came up.

Open Issues

Currently the only place views are preserved are column slices of data frames with simple indices. Is there an analogous case for panels where we can always offer views?
Are single column slices from multi-index DataFrames always views (i.e. do we have same guarantee as with single-index)?
The attribute _is_column_view I added in generic.py does not appear to be preserved when a DataFrame is pickled and re-loaded. Any suggestions on how to fix? Had to comment out one test for that...

Other notes

I removed the copy option from merge, since it's pretty inconsistent with the new copy-on-write paradigm.

Performance
Fixed main performance issue, still some work to do on dates.

    before     after       ratio
   [eb66bccc] [f4f8e562]
+   14.66ms    98.57ms      6.73  timeseries.timeseries_datetimeindex_offset_fast.time_timeseries_datetimeindex_offset_fast
+  187.66ms      1.18s      6.30  plotting.plot_timeseries_period.time_plot_timeseries_period
+   16.61ms   100.92ms      6.07  timeseries.timeseries_series_offset_fast.time_timeseries_series_offset_fast
+   24.86ms    61.02ms      2.46  frame_methods.frame_getitem_single_column2.time_frame_getitem_single_column2
+   45.47ms   108.02ms      2.38  frame_methods.frame_count_level_axis0_multi.time_frame_count_level_axis0_multi
+   26.26ms    60.99ms      2.32  frame_methods.frame_getitem_single_column.time_frame_getitem_single_column
-    1.72ms   715.28μs      0.41  replace.replace_replacena.time_replace_replacena
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

cc: @shoyer

jreback · 2015-11-23T00:09:07Z

this should work, nothing special about multiindexes
these attributes should not be pickled. Simply make them a class attribute defaulted to None.

You are still removing LOTS of tests. Don't remove any! You need to individually go thru and fix/correct them. You need to assert that the entire existing API is unchanged (which is defined via the tests). Clearly if somethign is asserting that SettingWIthCopyWarning is raised it should be changed and tested that it is a view (or not a view as the case may be).

The entire point of the test suite is to avoid changing behavior accidently.

nickeubank · 2015-12-01T04:31:21Z

@jreback

On 2: sorry, wasn't clear. It isn't attribute value I want to pickle; it's the existence of the attribute.

In[17]:
df = pd.DataFrame({"A": [1,2]})
df._is_column_view
Out[18]: False

In[19]:
df.to_pickle('test')
df2=pd.read_pickle('test')
df2._is_column_view
Traceback (most recent call last):

 File "<ipython-input-22-18fc2f02c57c>", line 1, in <module>
    df2._is_column_view

 File "/Users/Nick/github/pandas/pandas/core/generic.py", line 2290, in __getattr__
   return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute '_is_column_view'

Do I need to rebuild somewhere?

Will work on tests!

jreback · 2015-12-01T11:57:09Z

you need to define it as None on the class (e.g. notice the is_copy = None which you are removing), e.g.

class NDFrame(...):
     .....
     _is_column_view = None

then when you set it in an instance it is overriden only in that instance. This is a feature of python.

nickeubank · 2015-12-16T19:31:32Z

@jreback @shoyer Quick update:

Most things working well and I think will work eventually. Currently need to address:

Issues with __init__ in frame.py.
- Need to prevent creation of views of numpy arrays. The copy-on-write machinery is all at the pandas level, so I can only support views of other pandas objects.
- Need to make sure if passed a DataFrame, creates an actual view, doesn't just share ._data object (see Change pandas to never let objects share _data, create views instead #11855 ).
- For some reason, if I just set copy=True as default in __init__, getting a Segementation Fault in test_frame.py. Need to figure that out.
Need to prevent indexing operations from ever letting objects share _data objects instead of creating views (see Change pandas to never let objects share _data, create views instead #11855)
Need to work through all tests.

However, at the moment I need to take a hiatus on this to work on a draft of my dissertation, so I'm going to have to set this aside for ~4 weeks. Haven't dropped -- I still really believe in the importance of this and that it will work -- but taking up more time than I have at the moment (especially given how much background research I'm having to learn about python to do all this...).

jreback · 2015-12-16T20:00:15Z

@nickeubank np, thanks for the effort so far!

jreback · 2016-01-11T13:49:28Z

xref #11970

as this might take a bit, feel free to reopen if updated

jreback · 2016-01-13T20:09:01Z

@nickeubank hmm, I can't seem to reopen this, very odd.

kawochen · 2016-01-13T20:14:02Z

because HEAD has changed since

nickeubank · 2016-01-13T20:21:41Z

k will reopen as new pr

nickeubank force-pushed the copy_on_write branch from 97f914b to 72692f5 Compare November 3, 2015 21:49

kawochen reviewed Nov 7, 2015
View reviewed changes

jreback added the API Design label Nov 7, 2015

nickeubank force-pushed the copy_on_write branch from a988404 to 8e4daac Compare November 12, 2015 04:34

nickeubank force-pushed the copy_on_write branch 2 times, most recently from 5098923 to aa337ef Compare November 20, 2015 18:26

Nick Eubank added 7 commits November 20, 2015 10:56

Basic working implementation

f35187a

now keeps columns as views

90ff5ea

check for dead weakrefs

4c9e5a9

store children in weakValueDictionary instead of list

0260b91

Remove redundant copying

8cf0bbd

Add back reference to origin df to protect against broken chains of v…

b41ef8f

…iews

make _is_view more conservative on multi-block objects

1814085

tweak so addition of new columnsdoesn't cause copying improved tests, removed copy option from merge

nickeubank force-pushed the copy_on_write branch from aa337ef to 1814085 Compare November 20, 2015 18:56

jreback mentioned this pull request Dec 2, 2015

Improve(?) explanation of SettingWithCopy warning #11746

Merged

fix multi-index behavior

d62b9fa

nickeubank force-pushed the copy_on_write branch from 1dc1d98 to d62b9fa Compare December 4, 2015 22:35

fix test_merge

fba6993

nickeubank mentioned this pull request Dec 16, 2015

Change pandas to never let objects share _data, create views instead #11855

Closed

jreback closed this Jan 11, 2016

nickeubank mentioned this pull request Jan 13, 2016

ENH/INT: libpandas refactor #11970

Closed

nickeubank mentioned this pull request Jan 14, 2016

Copy on write using weakrefs (part 2) #12036

Closed

wesm mentioned this pull request Sep 5, 2016

Copy on write for views wesm/pandas2#10

Open

jorisvandenbossche added the Copy / view semantics label Sep 6, 2020

nickeubank mentioned this pull request Jun 28, 2021

Proposal for future copy / view semantics in indexing operations #36195

Closed

Uh oh!

Copy on write using weakrefs #11500

Copy on write using weakrefs #11500

Uh oh!

Conversation

nickeubank commented Nov 1, 2015

Copy-on-Write Behavior:

Chained indexing behavior now consistent

Uh oh!

shoyer commented Nov 1, 2015

Uh oh!

jreback commented Nov 1, 2015

Uh oh!

nickeubank commented Nov 1, 2015

Uh oh!

shoyer commented Nov 1, 2015

Uh oh!

jreback commented Nov 1, 2015

Uh oh!

nickeubank commented Nov 2, 2015

Uh oh!

nickeubank commented Nov 6, 2015

Uh oh!

jreback commented Nov 7, 2015

Uh oh!

kawochen Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 7, 2015

Uh oh!

nickeubank commented Nov 7, 2015

Uh oh!

nickeubank commented Nov 21, 2015

Uh oh!

jreback commented Nov 23, 2015

Uh oh!

nickeubank commented Dec 1, 2015

Uh oh!

jreback commented Dec 1, 2015

Uh oh!

nickeubank commented Dec 16, 2015

Uh oh!

jreback commented Dec 16, 2015

Uh oh!

jreback commented Jan 11, 2016

Uh oh!

jreback commented Jan 13, 2016

Uh oh!

kawochen commented Jan 13, 2016

Uh oh!

nickeubank commented Jan 13, 2016

Uh oh!

Uh oh!