Skip to content

DEPR: flags #52153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed

DEPR: flags #52153

wants to merge 24 commits into from

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

xref #51280

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you first create an issue to discuss this specific feature? You mentioned the idea of deprecating this in #51280, but the actual discussion there was mostly about __finalize__ and _metadata and how it relates to subclasses (i.e. the underlying mechanics of how it works), and not about the actual user-facing feature that flags/set_flags provide (do we have an idea if people are using this? do we think it is no longer useful? what are the alternatives?)

@jbrockmendel jbrockmendel mentioned this pull request Mar 24, 2023
@mroeschke
Copy link
Member

Right now probably -0 on deprecating. While I don't issues about this feature, it's seems like a way to force a no duplicate labels mode

@mroeschke
Copy link
Member

Actually I discovered that pandera has a general dataframe schema validation framework that covers the duplicate labels case, so I suppose I would be +0 on deprecating

https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.components.Index.html#pandera.api.pandas.components.Index

@gfyoung gfyoung added the Deprecate Functionality to remove in pandas label Aug 17, 2023
@jbrockmendel
Copy link
Member Author

@TomAugspurger any objection here? IIRC this was your thing

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 22, 2023

In #51280, it seems like there's not a ton of evidence that this is causing issues in real use cases. The one linked issue turned out to be unrelated, right (but that's about __finalize__ and _metadata, not flags. So maybe I'm missing some context)?

How do you reconcile

I'd be OK with deprecating flags and having allow_duplicate_labels checking move to third-party validation libraries.

with deprecating __finalize__? Wouldn't the 3rd party need that kind of post DataFrame creation hook to do the validation?

@jbrockmendel
Copy link
Member Author

What do you mean "my thing"?

Just that you championed implementing it and might have an opinion about keeping it.

How do you reconcile [making 3rd-party libraries responsible for disallowing non-unique columns] with deprecating __finalize__?

I imagine the 3rd party libraries wouldn't necessarily be implementing subclasses of pandas classes, so wouldn't use the hook at all.

@TomAugspurger
Copy link
Contributor

Yeah, I think we should keep it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 27, 2023

In #28394 where this was implemented, @TomAugspurger wrote:

This fixes a vexing issue with using pandas for ETL pipelines, where accidentally introducing duplicate labels can lead to confusing downstream behavior (e.g. NDFrame.getitem not reducing dimensionality).

I do think the feature of allows_duplicate_labels=False is useful (now that I know about it), but the implementation creates a generic flags that we could possibly extend in the future with other possible flags. Since one of the motivations of getting rid of flags is because it affects performance in __finalize__(), I'd like to suggest a compromise, which would address the performance issue and can be implemented as a simple if test.

The proposal would be to move allows_duplicate_labels to a pandas option that is then set globally for all DataFrame objects that are created. Then it is just a matter of testing if the option is set, and you don't have to propagate the flags. I think the performance hit would be minimal with the default of False, but you then still have the feature where pandas can do the checking for duplicate labels, if it is useful in cases like what Tom described when he created the initial PR. And I think I've had cases where it would have helped with debugging some thorny issues.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 27, 2023

I think we'll need to sort out #52166 and anything else riding on __finalize__ before using performance as a motivator for this decision (unless duplicate label propagation specifically is slowing things down, rather than __finalize__ generally). Since being opened there's been a good amount of feedback on the usefulness of attrs.

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Dec 28, 2023

Tom is right that performance is mostly a red herring here. I think of it is being "directionally" behaving like a performance penalty but with a \epsilon magnitude. It is tiny but hits frequently.

The reason to deprecate is that this is basically-never-used* API surface (which is a constantly-complained-about pain point) with a use case that has viable alternatives**. Also that propagation is half-baked with no prospect of getting to fully-baked.

* Github search for allow_duplicate_labels shows 251 code results, all of which appear to be unrelatedallows_duplicate_labels shows more than I'm willing to sort through, but the first few pages look dominated by vendored copies of pandas and profiling output.

** My preferred alternative would be a 3rd party UniqueColumnsDataFrame that would have a check at __init__-time. I think this is more annotation-friendly than other alternatives.

@mroeschke
Copy link
Member

I also believe this functionality of "validation" is best served by a 3rd party library like pandera https://pandera.readthedocs.io/en/stable/dataframe_schemas.html?highlight=duplicate#column-validation

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 28, 2023 via email

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 28, 2023

A challenge with doing this in a 3rd party library is around propagating these attributes of a table. By placing this on the DataFrame and propagating it through operations, you can have an entire block of method calls for which this property is enforced. A 3rd-party library isn't going to be able to do that; users would need to add method calls at each place they want to validate the property. Do people have thoughts on the merit of that? To me, that seems worth the effort.

I agree. For me, I would find value in this particular flag allows_duplicate_labels to be in debugging pandas code. If I know that I want to not introduce duplicate labels, because by doing so it would indicate a bug, then turning on that flag would help me find the bug. Moving to a 3rd party library for that debugging feature creates a host of issues.

I view this as being similar to treating warnings as errors. We had a client that ran our code by adding warnings.filterwarnings("error") and it caught places where pandas was issuing warnings that members of our development team had ignored. There were also cases where the tests we ran did not generate the warnings, but when our code was called by the client's code, the warnings were emitted.

IMHO, providing options to users that assist with debugging is something we should do as much as we can. It's one reason I got behind pandas-stubs. It has helped me and my staff avoid coding mistakes, thus increasing productivity.

@jbrockmendel jbrockmendel added the Mothballed Temporarily-closed PR the author plans to return to label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Mothballed Temporarily-closed PR the author plans to return to
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants