-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: A cheap copy on a dataframe #19102
Comments
I am not sure how you think the above would work. If you are filtering, then adding columns you need to copy. You can avoid the memory penalty by reassigning to the original variable which will have the original garbage collected. I actually doubt this is an issue in practice unless you are very nearly at memory limits in any event.
We won't have copy-on-write in the current version of pandas, so this is an intractable problem. defensive copying is the solution, and to be honest its not a big deal to do this. Virtually every pandas operation copies anyhow. |
@has2k1 can you explain what this line is supposed to do? df2 = df[df.A>2].as_copy() # New method, as_copy Is that supposed to copy if and only if the underlying data is a view, and not otherwise? |
Yes |
@jreback , when there are 100s of operations that involve copying the penalty of an extra copy becomes significant. It even worthwhile to tease out the operations where the As the issue does not be a wider concern, I'll reevaluate the use patterns and maybe switch over to |
@has2k1 I find it hard to believe that a plotting library is actually affected by copy operations. good luck. |
I don't think we should be so dismissive to the needs of other projects. A plotting library can certainly care about memory usage (plotnine can work with a lot of data, because it does not necessarily plot it all, but calculate statistics (eg hist, boxplot) for different factors. I can imagine in those operations you regularly need to subset a dataframe) We have seen reports of IMO valid usecases in both plotnine and sklearn. IMO we should need to think about some way to give such power users a means to deal with this. |
With the deprecation, there is no cheap way tell pandas that the user knows that they are modifying a copied dataframe. Libraries that use pandas cannot afford to always
copy()
after sub-setting. My comment in the PR thread gives a specific example.If resurrecting
is_copy
is not an option, then a better solution could be a method that can be applied inline.The text was updated successfully, but these errors were encountered: