Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: New copy / view semantics using Copy-on-Write #46958

Merged
merged 37 commits into from
Aug 20, 2022

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 6, 2022

This is a port of the proof of concept using the ArrayManager in #41878 to the default BlockManager.

This PR is a start to implement the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in #36195
A very brief summary of the behaviour you get:

  • Any subset (so also a slice, single column access, etc) behaves as a copy (using CoW, or is already a copy)
  • DataFrame methods that return a new DataFrame return shallow copies (using CoW) if applicable (for now, this is only implemented / tested for reset_index and rename, needs to be expanded to other methods)

Implementation approach
This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series at the BlockManager level. It does this by adding a new .refs attribute to the BlockManager that, if populated, keeps a list of weakref references to the blocks it shares data with (so for the BlockManager, this reference tracking is done per block, so len(mgr.blocks) == len(mgr.refs)).
This ensures that if we are modifying a block of a child manager, we can check if it is referencing (viewing) another block, and if needed do a copy on write. And also if we are modifying a block of a parent manager, we can check if that block is being referenced by another manager and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)


How to enable this new behaviour?
Currently this PR simply enabled the new behaviour with CoW, but of course that will need to be turned off before merging (which also means that some of the changes will need to put behind a feature flag. I only did that now in some places).

I think that ideally, (on the short term) users have a way to enable the future behaviour (eg using an option), but also have a way to enable additional warnings.
I already started adding an option, currently the boolean flag options.mode.copy_on_write=True|False:

  • Do we have a better name? I personally don't like that it uses "copy_on_write", because this is the internal implementation detail, and not what most end users really have to care about. But something like "new_copy_view_behaviour" is also not super ..
  • In addition to True/False, we can probably add "warn" as a third option, which gives warnings in cases where behaviour would change.

Some notes:

  • Not everything is already implemented (there are a couple of TODO(CoW) in the code), although the majority for indexing / setitem is done.
  • This PR does not yet try to tackle copy/view behaviour for the constructors, or for numpy array access (.values). Given the size of this PR already, those can probably be done in separate PRs?
  • Most tests are already passing (with changes), but still need to fix a few tests outside of /indexing
  • We will also need to think about a way to test this (in a similar way as the ArrayManager with an environment variable?)

I will also pull out some of the changes in separate PRs (eg the new test file could already be discussed/reviewed separately (-> #46979), and the column_setitem is maybe also something that could be done as pre-cursor(-> #47074))

@pandas-dev pandas-dev deleted a comment from jreback May 7, 2022
@twoertwein

This comment was marked as outdated.

@jreback

This comment was marked as outdated.

for blk in self.blocks:
nb = blk.getitem_block_index(slobj)
nbs.append(nb)
nrefs.append(weakref.ref(blk))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to make the reference to the blk instead of its array blk.values? will that make a difference in the cases where blk.values is re-set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't remember if there was a specific technical reason to do so, but I think it seemed the easier option (since this is all handled at the BlockManager level). When keeping those references / when checking for references, it would otherwise be an additional level of indirection to check the blk.values instead of blk itself.

I suppose this would generally be the same, but indeed except in places where blk.values is re-set in place. I actually added one such case in this PR in Block.set_values (which is called in BlockManager.iset). In that case I should probably rather copy the Block and replace the block in the BM with the copied block, instead of copying the values in the existing block.
Do you know of other places where we currently we re-set blk.values?

@jbrockmendel
Copy link
Member

@jorisvandenbossche im questioning the accuracy of the "nothing is 'just a view'" heuristic. Do any of the following cases break that:

  • df.T # single-block case
  • df.astype(df.dtypes, copy=False)
  • df.iloc[:]
  • df.stack()/unstack()
  • df.copy(deep=False)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 18, 2022

im questioning the accuracy of the "nothing is 'just a view'" heuristic

Do you mean in the current state of the PR? Or that those cases in general should/would break that in your opinion?

The goal is that none of those examples would break that general rule. But at the moment, this PR only added the CoW mechanism to indexing and explicitly to a few methods (copy, reset_index, rename; and in addition methods that are based on (re)indexing operations under the hood will also already follow it), so others won't yet follow the intended semantics.
But for example copy is already implemented, so df.copy(deep=False) will no longer return a "real" view (it returns a new dataframe with the same data in practice, but will stop being a view upon mutation through CoW).

stack/unstack already seem to return a copy on master / 1.4.

df.iloc[:] is already implemented and thus follows the outlined rules (returning a view in practice, but guarded with CoW).

The other ones (astype, transpose) are good cases I still need to test and/or implement.

@jbrockmendel
Copy link
Member

Do you mean in the current state of the PR? Or that those cases in general should/would break that in your opinion?

Just in general, wanted to make sure that the question was written down in one of the relevant issues/PRs. Thanks for taking a look.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you planning on some docs + whatsnew in a separate PR?

@mroeschke mroeschke mentioned this pull request Aug 15, 2022
@jorisvandenbossche
Copy link
Member Author

There is one new test from a few days ago that still needs to be updated for CoW (the one failing build), will do that tomorrow.

Were you planning on some docs + whatsnew in a separate PR?

Yes, would prefer to do that in a separate PR.

@mroeschke mroeschke added this to the 1.5 milestone Aug 20, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good too me, especially based on the testing. Would you still like to include this for 1.5?

@jorisvandenbossche
Copy link
Member Author

Would you still like to include this for 1.5?

Yes, I would prefer that

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

are there bechmarks on CoW for typical operations? eg pick a chain of ops that are now view returning and compare vs existing

what is the magnitude of the change?

@mroeschke mroeschke merged commit 221f636 into pandas-dev:main Aug 20, 2022
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche! Maybe could reference or document the benchmarks in the follow-up documentation PR

@jbrockmendel
Copy link
Member

way to go!

CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Aug 21, 2022
* API: New copy / view semantics using Copy-on-Write

* fix more tests

* Handle CoW in BM.iset

* Handle CoW in xs

* add bunch of todo comments and usage warnings

* Insert None ref in BM.insert

* Ensure to not assume copy_on_write is set in case of ArrayManager

* Handle refs in BM._combine / test CoW in select_dtypes

* handle get_numeric_data for single block manager

* fix test_internals (get_numeric_data now uses CoW)

* handle refs in consolidation

* fix deep=None for ArrayManager

* update copy/view tests from other PR

* clean-up fast_xs workarounds now it returns a SingleBlockManager

* tracks refs in to_frame

* fixup after updata main and column_setitem + iloc inplace setitem changes (pandas-devgh-45333)

* fix inplace fillna + fixup new tests

* address comments + update some todo comments

* Update pandas/core/internals/managers.py

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>

* fixup linting

* update new copy_view tests to use get_array helper

* add comment to setitem

* switch default to False, ensure CoW copies only happen when enabled + add additional test build with CoW

* update type annotations

* Fix stata issue to avoid SettingWithCopyWarning in read_stata

* update type + option comment

* fixup new rename test

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
@jorisvandenbossche jorisvandenbossche deleted the blockmanager-cow branch October 7, 2022 08:43
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* API: New copy / view semantics using Copy-on-Write

* fix more tests

* Handle CoW in BM.iset

* Handle CoW in xs

* add bunch of todo comments and usage warnings

* Insert None ref in BM.insert

* Ensure to not assume copy_on_write is set in case of ArrayManager

* Handle refs in BM._combine / test CoW in select_dtypes

* handle get_numeric_data for single block manager

* fix test_internals (get_numeric_data now uses CoW)

* handle refs in consolidation

* fix deep=None for ArrayManager

* update copy/view tests from other PR

* clean-up fast_xs workarounds now it returns a SingleBlockManager

* tracks refs in to_frame

* fixup after updata main and column_setitem + iloc inplace setitem changes (pandas-devgh-45333)

* fix inplace fillna + fixup new tests

* address comments + update some todo comments

* Update pandas/core/internals/managers.py

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>

* fixup linting

* update new copy_view tests to use get_array helper

* add comment to setitem

* switch default to False, ensure CoW copies only happen when enabled + add additional test build with CoW

* update type annotations

* Fix stata issue to avoid SettingWithCopyWarning in read_stata

* update type + option comment

* fixup new rename test

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants