Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write #51463

Merged
merged 18 commits into from
Nov 2, 2023

Conversation

jorisvandenbossche
Copy link
Member

This PR adds the google doc about the Copy-on-Write proposal that came out of the discussion in #36195 as a PDEP.
It predates the PDEPs, but bascially back then I wrote what would now be a PDEP, and since this is still actively being developed, I think it would be useful to have an "official" reference to the proposal, instead of pointing to my google doc.

The first commit is just a basic conversion of the google doc to the markdown. I will do a first follow-up commit updating some outdated parts (eg referencing the POC implementation).

I am fine with still getting review of the text / details here, or we can also take it as it is.

And then the question is also how to mark the status of the PDEP. Can we assume it is already accepted / under implementation? (it went through the somewhat vague pre-PDEP decision process of some kind of consensus / nobody loudly objecting anymore, and it's now available and documented as opt-in). But am also happy to let it pass through the PDEP process of approving this officially.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

@jorisvandenbossche jorisvandenbossche added Copy / view semantics PDEP pandas enhancement proposal labels Feb 17, 2023
Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up.

Left some comments on some outdated parts that could be freshened up.

web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1

Not a blocker for this PR or anything, but perhaps the SettingWithCopy warning could be updated to also ask users to consider enabling copy-on-write?

@jorisvandenbossche
Copy link
Member Author

but perhaps the SettingWithCopy warning could be updated to also ask users to consider enabling copy-on-write?

That's a good idea to do at the point when we are more confident to point general users towards it. In general we will have to add warnings for behaviours that will change (eg for chained assignment), and so that might already replace some of the current SettingWithCopyWarnings as well.
In the current text, I mention adding warnings in general in the "Backward compatibility" section. I think that should be sufficient, or I can add this specific addition as well. In any case I tracked the idea in the CoW overview tracker: #48998

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some notes about the transition should be part of the PDEP. Namely, we change the SettingWithCopyWarning to suggest using the new mode, and whether we can implement other checks where people get no warning, but the behavior will change.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@phofl
Copy link
Member

phofl commented Jul 30, 2023

Updated a couple of sections to be more aligned with the current implementation, also mentioning things that we intend to do (warnings mode, upgrade, ...)

@jorisvandenbossche
Copy link
Member Author

@pandas-dev/pandas-core are there any more comments on the text of the proposal?
This is meant as a final request for comments, before calling for a vote on this PDEP.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are about grammar/syntax, except in one place where I think an example would be helpful.

I also wonder if we should do something to make it easier for users to identify defensive copying that is no longer needed, e.g., df_filtered = df[df["A"] > 1].copy() I'm not sure this is possible.

web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0007-copy-on-write.md Show resolved Hide resolved
@phofl
Copy link
Member

phofl commented Oct 3, 2023

You won’t need a defensive copy anymore, ever. That’s one of the advantages

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 3, 2023

You won’t need a defensive copy anymore, ever. That’s one of the advantages

Yes, but there is a lot of code out there that does have defensive copying going on (including things my staff has written), so the question is whether we help people identify it. For example, we could raise a warning whenever .copy() is called. Maybe we add a parameter to DataFrame.copy() such as DataFrame.copy(intended: bool = False) where we raise a warning if intended==False, which would help people pick up all the defensive copying in their code. If they really want a copy, they then have to use intended=True

@jbrockmendel
Copy link
Member

so the question is whether we help people identify it

If the alternative is adding a fluffy keyword, then id go for not helping people identify it, just document it.

@jorisvandenbossche can you confirm that the TL;DR here is pretty much "make official that we will enable CoW for all users"?

@jorisvandenbossche
Copy link
Member Author

can you confirm that the TL;DR here is pretty much "make official that we will enable CoW for all users"?

That's correct, essentially what we already implemented under the optional pd.options.mode.copy_on_write = True becomes the default behaviour (and only option) in 3.0

@jorisvandenbossche
Copy link
Member Author

Thanks Irv for the textual comments, will update for those.

You won’t need a defensive copy anymore, ever. That’s one of the advantages

Yes, but there is a lot of code out there that does have defensive copying going on (including things my staff has written), so the question is whether we help people identify it.

I am also not fond of adding another keyword for this, but in theory we could add some option to trigger a warning for this. For example, it should be quite easy to check within the copy method if the calling object fully owns its data and is called from a method chain (both kinds of checks we already do elsewhere), and in that case raise an (optional) warning.
That should catch a typical example of df_filtered = df[df["A"] > 1].copy().

I don't know if it is worth adding a warning for this, but at least technically it shouldn't be hard. The main question would be how to let users enable it (I wouldn't do with a keyword in copy(), because then you need to modify your copy() calls, and in that case you could just as well read the code and remove the copy() call in case like the above?)

@phofl
Copy link
Member

phofl commented Oct 3, 2023

Could we move this discussion to a separate issue? I don't think that this should be part of the PDEP since it's not really relevant for CoW generally speaking.

@lithomas1
Copy link
Member

/preview

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

No preview found for PR #51463. Did the docs build complete?

@phofl
Copy link
Member

phofl commented Oct 4, 2023

Just FYI: The preview thing is broken

@lithomas1
Copy link
Member

Yeah, I emailed Marc about this and he said he refreshed the token that expired.

Looks like something else is still going wrong :(
I had a look on the server, and it seems like the docs artifact from PRs aren't getting stored on the preview server.

cc @datapythonista for more info.

@datapythonista
Copy link
Member

Yes, sorry, I started to refresh the expired token but there were some problems and I couldn't finish until now. I see an unrelated error when uncompressing the artifact file in the server, but not sure if it's the specific file I was testing. I re-run the docs build here, when it finishes you can retry the preview again and hopefully it works, otherwise I'll have a look tomorrow. For other PRs, note that the preview is generated in the docs build, the /preview comment just posts the link, so previews won't be available for PRs when the docs build run while the server was broken (you can re-run the docs build to regenerate it as I did here).

@datapythonista
Copy link
Member

/preview

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

No preview found for PR #51463. Did the docs build complete?

jorisvandenbossche and others added 3 commits October 6, 2023 14:25
Co-authored-by: Irv Lustig <irv@princeton.com>
Co-authored-by: Irv Lustig <irv@princeton.com>
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this is ready for a vote.

@jorisvandenbossche
Copy link
Member Author

Started the vote at #55511!

@mroeschke
Copy link
Member

Looks like the vote passed in #55511, I think we're good to set this as approved?

@phofl phofl merged commit f3b9309 into pandas-dev:main Nov 2, 2023
7 checks passed
@phofl phofl added this to the 2.2 milestone Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants