Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy on write using weakrefs #11500

Closed
wants to merge 9 commits into from

Conversation

nickeubank
Copy link
Contributor

Working model of copy-on-write. Aims to close #10954, alternative to #10973, extension of #11207.

Copy-on-Write Behavior:

Setting on child doesn't affect parent, but still uses views when can for efficiency

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent.loc[0:0,]
child._is_view

Out[1]: True

child.loc[0:0] = -88
child

Out[2]: 
   col1  col2
0   -88   -88

parent

Out[3]: 
   col1  col2
0     1     3
1     2     4

Setting on parent doesn't affect child

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent.loc[0:0,]
child._is_view

Out[4]: True

parent.loc[0:0, 'col1'] = -88
child

Out[5]: 
   col1  col2
0     1     3

parent

Out[6]: 
   col1  col2
0   -88     3
1     2     4

One exception is dictionary-like access, where views are preserved
(as suggested here: #10954 (comment) )

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent['col1']
child._is_view

Out[7]: True

parent.loc[0:0, 'col1'] = -88
child

Out[8]: 
0   -88
1     2
Name: col1, dtype: int64

parent

Out[9]: 
   col1  col2
0   -88     3
1     2     4

Safe for views of views

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
child = parent.loc[0:0,]
child._is_view
Out[20]: True

child_of_child = child.loc[0:0,'col1':'col1']
child_of_child._is_view

Out[21]: True

child_of_child.loc[0:0, 'col1'] = -88
child_of_child

Out[22]: 
   col1
0   -88

child

Out[23]: 
   col1  col2
0     1     3

parent

Out[24]: 
   col1  col2
0     1     3
1     2     4

Chained indexing behavior now consistent

Will always fail unless first class is a dictionary-like call for a single series (since that will always be a view)

Will fail if first call not dict-like

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
parent.loc[0:0, 'col1']['col1'] = -88
parent

Out[10]: 
   col1  col2
0     1     3
1     2     4

Will always succeed if first call dict-like

parent = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
parent['col1'].loc[0:0] = -88
parent

Out[11]: 
   col1  col2
0   -88     3
1     2     4

To Do:

  • Get feedback on implementation;
  • if sound in behavior and principles, remove all SettingWithCopy machinery
  • Add docs

@jreback
@TomAugspurger
@shoyer
@JanSchulz
cc @ellisonbg
cc @CarstVaartjes

@shoyer
Copy link
Member

shoyer commented Nov 1, 2015

First of all -- this is really great! Thank you for pushing on this.

A few specific design issues:

  1. There are two obvious ways to do copy-on-write for views. We could copy the parent dataframe, or we could copy all the children (and their children, recursively). Right now, it looks like you're doing both? I think that's unnecessary. If I'm right, let's consider which option has more favorable performance.
  2. Why are you using a list instead of a WeakValueDictionary? The later seems like it might be a better choice -- it's automatically cleaned up when children are garbage collected. With your current implementation, I think you are missing the check to verify that your weakrefs are valid.

Finally, we are definitely going to need some comprehensive performance checks. How does the ASV benchmark suite compare after this change? What types of uses used to be fast but now are slow? These will need to be carefully documented.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2015

you need to have all of the original tests in here
some might work / be modified but you cannot eliminate tests

@nickeubank
Copy link
Contributor Author

Thanks both for the feedback!

@shoyer

  1. In the simple case -- a parent who is not a view and a child who is a view -- there won't be any copying of the parent because of the if statement at 1235 of generic.py. But I'll think about weirder cases like views of views...

  2. I thought about this, but wasn't sure how to key a dictionary. Thankfully, weakrefs are actually self-cleaning -- if they object they refer to gets cleaned, they redirect to None:

test_df = pd.DataFrame()
test_wr = weakref.ref(test_df)
test_df = 'redirect_to_string'
test_wr() is None
Out[36]: True

But you are correct I need to add a if child() is None call at the top of _convert_views_to_copies.

  1. Could you please point me to some docs on this ASV? Never worked with it before.

@jreback Would you mind expanding on your comment / clarifying a little? I don't think I've deleted any tests. Do you mean that you think this seems promising enough it's worth digging into the testing modules to see what works and what fails with these modifications and edit the failing tests? Happy to do so, just wanted to see if people felt this path was sufficiently promising to warrant that work.

@shoyer
Copy link
Member

shoyer commented Nov 1, 2015

@nickeubank

  1. you're right that unnecessary copies aren't done if there are no views. My point is that it is unnecessary to copy both the parent AND its children -- only one is sufficient.

  2. use id(obj) as the key. That's guaranteed to be unique as long as an object is in memory. The main reason I prefer using WeakValueDictionary to list is that the later can be a memory leak -- if you repeatedly index a DataFrame, the list will grow unbounded in size, even if the indexed dataframes themselves are garbage collected.

  3. see here for ASV: http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

@jreback
Copy link
Contributor

jreback commented Nov 1, 2015

you have essentially turned off all of the old setting with copy tests
so these need to be adjusted to the new behavior
but all of the cases are valid

@nickeubank
Copy link
Contributor Author

@jreback are you talking about runtime tests in core files like generic.py, or the nosetests in the test files like test_generic.py?

@shoyer Got it. Revising now. Also realized one places this breaks, so need to add another tweak to implementation. Will ping when done.

One general question: Is there a better way to know if something is a view than ._is_view? Seems to give false positives -- makes it safe, but leads to some extra copying. For example:

df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
df._is_view
Out[1]: False

df.loc[0,'col1'] = -88
df._is_view
Out[2]: True

@nickeubank
Copy link
Contributor Author

@shoyer @jreback Performance checks

@shoyer Think your suggestions all integrated. Still probably some excess copying because _is_view is conservative (sometimes calls things views when they aren't), but not sure how to fix

@jreback I'm afraid I'm still a little unclear on what you're asking for. I know you do a lot on these sites so brevity is your friend, but could you please be a little more expansive / point to a specific example?

    before     after       ratio
  [eb66bccc] [cc6c9b1b]
+  116.74ms      1.99s     17.00  frame_methods.frame_insert_500_columns_end.time_frame_insert_500_columns_end
+  185.65ms      1.32s      7.10  plotting.plot_timeseries_period.time_plot_timeseries_period
+   17.64ms   120.52ms      6.83  timeseries.timeseries_series_offset_fast.time_timeseries_series_offset_fast
+   19.03ms   104.11ms      5.47  timeseries.timeseries_datetimeindex_offset_fast.time_timeseries_datetimeindex_offset_fast
+   43.55ms   234.87ms      5.39  sparse.sparse_frame_constructor.time_sparse_frame_constructor
+   45.87ms   109.18ms      2.38  frame_methods.frame_count_level_axis0_multi.time_frame_count_level_axis0_multi
-    1.78ms   781.24μs      0.44  replace.replace_replacena.time_replace_replacena
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2015

the perf issues need to be addressed

return self._get_item_cache(item)
result = self._get_item_cache(item)

if isinstance(item, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only str?

@jreback
Copy link
Contributor

jreback commented Nov 7, 2015

@nickeubank you changed code but you never fixed tests. e.g. SetttingWithCopyError is no longer defined. So its impossible to see if what you did works or not.

You will need to carefully address each of the existing tests, as these define the current API. These must all pass and/or change as appropriate. This ensures that the API is preserved. You can certainly change things, but these must be well-defined and explictly communicated.

Generally having green PR's gives much more confidence. This is the point of tests, to try to map out the API as much as possible. To cover edge cases and to provide consistency among operations. This of course is not guaranteed, but helps a lot. Examining code only goes so far. Python is far to dynamic a language to just rely on code inspection for correctness.

@nickeubank
Copy link
Contributor Author

@jreback Ah, ok -- thanks so much for the longer explanation! Will do!

@nickeubank nickeubank force-pushed the copy_on_write branch 2 times, most recently from 5098923 to aa337ef Compare November 20, 2015 18:26
@nickeubank
Copy link
Contributor Author

@jreback thanks again for explanation on tests -- now see wisdom. Updated tests, and fixed a few issues that came up.

Open Issues

  1. Currently the only place views are preserved are column slices of data frames with simple indices. Is there an analogous case for panels where we can always offer views?

  2. Are single column slices from multi-index DataFrames always views (i.e. do we have same guarantee as with single-index)?

  3. The attribute _is_column_view I added in generic.py does not appear to be preserved when a DataFrame is pickled and re-loaded. Any suggestions on how to fix? Had to comment out one test for that...

Other notes

I removed the copy option from merge, since it's pretty inconsistent with the new copy-on-write paradigm.

Performance
Fixed main performance issue, still some work to do on dates.

    before     after       ratio
   [eb66bccc] [f4f8e562]
+   14.66ms    98.57ms      6.73  timeseries.timeseries_datetimeindex_offset_fast.time_timeseries_datetimeindex_offset_fast
+  187.66ms      1.18s      6.30  plotting.plot_timeseries_period.time_plot_timeseries_period
+   16.61ms   100.92ms      6.07  timeseries.timeseries_series_offset_fast.time_timeseries_series_offset_fast
+   24.86ms    61.02ms      2.46  frame_methods.frame_getitem_single_column2.time_frame_getitem_single_column2
+   45.47ms   108.02ms      2.38  frame_methods.frame_count_level_axis0_multi.time_frame_count_level_axis0_multi
+   26.26ms    60.99ms      2.32  frame_methods.frame_getitem_single_column.time_frame_getitem_single_column
-    1.72ms   715.28μs      0.41  replace.replace_replacena.time_replace_replacena
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

cc: @shoyer

@jreback
Copy link
Contributor

jreback commented Nov 23, 2015

  1. this should work, nothing special about multiindexes
  2. these attributes should not be pickled. Simply make them a class attribute defaulted to None.

You are still removing LOTS of tests. Don't remove any! You need to individually go thru and fix/correct them. You need to assert that the entire existing API is unchanged (which is defined via the tests). Clearly if somethign is asserting that SettingWIthCopyWarning is raised it should be changed and tested that it is a view (or not a view as the case may be).

The entire point of the test suite is to avoid changing behavior accidently.

@nickeubank
Copy link
Contributor Author

@jreback

On 2: sorry, wasn't clear. It isn't attribute value I want to pickle; it's the existence of the attribute.

In[17]:
df = pd.DataFrame({"A": [1,2]})
df._is_column_view
Out[18]: False

In[19]:
df.to_pickle('test')
df2=pd.read_pickle('test')
df2._is_column_view
Traceback (most recent call last):

 File "<ipython-input-22-18fc2f02c57c>", line 1, in <module>
    df2._is_column_view

 File "/Users/Nick/github/pandas/pandas/core/generic.py", line 2290, in __getattr__
   return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute '_is_column_view'

Do I need to rebuild somewhere?

Will work on tests!

@jreback
Copy link
Contributor

jreback commented Dec 1, 2015

you need to define it as None on the class (e.g. notice the is_copy = None which you are removing), e.g.

class NDFrame(...):
     .....
     _is_column_view = None

then when you set it in an instance it is overriden only in that instance. This is a feature of python.

@nickeubank
Copy link
Contributor Author

@jreback @shoyer Quick update:

Most things working well and I think will work eventually. Currently need to address:

However, at the moment I need to take a hiatus on this to work on a draft of my dissertation, so I'm going to have to set this aside for ~4 weeks. Haven't dropped -- I still really believe in the importance of this and that it will work -- but taking up more time than I have at the moment (especially given how much background research I'm having to learn about python to do all this...).

@jreback
Copy link
Contributor

jreback commented Dec 16, 2015

@nickeubank np, thanks for the effort so far!

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

xref #11970

as this might take a bit, feel free to reopen if updated

@jreback
Copy link
Contributor

jreback commented Jan 13, 2016

@nickeubank hmm, I can't seem to reopen this, very odd.

@kawochen
Copy link
Contributor

because HEAD has changed since

@nickeubank
Copy link
Contributor Author

k will reopen as new pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Views, Copies, and the SettingWithCopyWarning Issue
5 participants