Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: should values_for_factorize and _from_factorized round-trip missing values? #32673

Closed
jbrockmendel opened this issue Mar 13, 2020 · 15 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite

Comments

@jbrockmendel
Copy link
Member

(fixtures make this so much harder to give a copy/pasteable example)

from pandas.tests.extension.json.test_json import *

dtype = JSONDtype()

data = make_data()
while len(data[0]) == len(data[1]):
        data = make_data()

data = JSONArray(data)

values = data._values_for_factorize()[0]
result = type(data)._from_factorized(values, data)
assert len(values) == len(result)  # <-- nope!

Am I wrong in thinking this assertion should hold? If we had a .equals method, i'd strengthen this assertion to assert result.equals(data)

cc @TomAugspurger @WillAyd

@TomAugspurger
Copy link
Contributor

I think that's the difference are the missing values. NA values (empty dict for JSONDtype I think?) are dropped.

@jorisvandenbossche
Copy link
Member

The _from_factorized implementation is clearly wrong:

@classmethod
def _from_factorized(cls, values, original):
return cls([UserDict(x) for x in values if x != ()])

it's indeed simply skipping missing values, instead of only not trying to convert them to a dict

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite labels Mar 13, 2020
@jbrockmendel
Copy link
Member Author

BooleanArray also doesn't roundtrip:

from pandas.tests.extension.test_boolean import *

dtype = BooleanDtype()
data = pd.array(make_data(), dtype=dtype)

values = data._values_for_factorize()[0]
>>> type(data)._from_factorized(values, data)
Traceback (most recent call last):
[...]
TypeError: Need to pass bool-like values

@jbrockmendel
Copy link
Member Author

is there concensus that the answer to the original question is "yes"?

@jorisvandenbossche
Copy link
Member

BooleanArray also doesn't roundtrip:

That's not a fully correct example. If you do an actual factorize in the meantime, the missing values (encoded as -1) are not included.

And, so my comment above about JSONArray was not fully correct: it can skip NAs, as right now NAs are never present in _from_factorized in practice (since we only call _from_factorized after factorize, which always drops missing values)

@jorisvandenbossche
Copy link
Member

is there concensus that the answer to the original question is "yes"?

I think this is clearly the case, yes, for valid values (that's the whole point of _values_for_factorize / _from_factorized combo).
But I think the actual question is if we also want / need to enforce this for missing values.

Purely for the factorize case we don't need that (I think), but if we want to use _values_for_factorize / _from_factorized for more general purposes, we might need to start requiring that (although that might be backwards incompatible).

Also possibly related here is my earlier comment in the EA interface issue about _values_for_factorize returning a mask instead of NA sentinel.

@jorisvandenbossche jorisvandenbossche changed the title TST/BUG?: should values_for_factorize and _from_factorized round-trip? API: should values_for_factorize and _from_factorized round-trip missing values? Mar 17, 2020
@jreback
Copy link
Contributor

jreback commented Mar 19, 2020

Also possibly related here is my earlier comment in the EA interface issue about _values_for_factorize returning a mask instead of NA sentinel.

this is a good idea.

@jreback jreback added this to the 1.1 milestone Mar 21, 2020
@TomAugspurger
Copy link
Contributor

@jorisvandenbossche do you have other use-cases in mind for _values_for_factorize? Otherwise, I'm perfectly happy to keep their scope limited to factorization, and document the expected behavior around missing values.

@jorisvandenbossche
Copy link
Member

I think some came up in the EA interface revisit discussion (#32586) ?
There is the question about whether we need both _values_for_factorize and _values_for_argsort. And one concrete case was if we can use _values_for_factorize in joining operations (I don't know what the requirement around missing values would be for this one).

@jorisvandenbossche
Copy link
Member

I'm perfectly happy to keep their scope limited to factorization, and document the expected behavior around missing values.

To be clear, in that case we don't need the PR #32798 ? (some clean-up of the PR might still be useful, but I mean then there is "no bug to fix")

@jbrockmendel
Copy link
Member Author

Otherwise, I'm perfectly happy to keep their scope limited to factorization, and document the expected behavior around missing values.

Do we know of a compelling use case where round-trip-ability would be actively un-desirable? The only extant case where it doesnt hold appears unintentional (cc @WillAyd correct me if im wrong here)

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche wondering if we can get consensus on a couple of weaker requirements:

  1. round-trip-ability for unique arrays? or for just unique sorted arrays?
  2. some form of idempotency?

@jorisvandenbossche
Copy link
Member

Since we were just discussing in the other issue about how this _values_for_factorize method is (currently) actually only internal to EA.factorize(), I don't see the need to hard-require round-trip-ability of arrays including missing values (to be clear, round-trip ability of the uniques (without missing values) is already required and tested).

If we are defining an interface to get values to be used in an indexing engine, for example, yes we can discuss that. But I would first discuss those use cases, before adding requirements to _values_for_factorize ?

What do you mean with "idempotency" in this context?

@jbrockmendel
Copy link
Member Author

What do you mean with "idempotency" in this context?

Basically theround-trip-ability of the uniques-without-missing-values that you mentioned.

We can stick a pin in this, as I've come around to "rip these out entirely"

@jorisvandenbossche
Copy link
Member

Basically theround-trip-ability of the uniques-without-missing-values that you mentioned.

As I already mentioned a few times: we already have this requirement (so I don't think it's something we need to agree on :-))

as I've come around to "rip these out entirely"

That would break several external projects, including GeoPandas. We could potentially deprecate it, but we at least need to keep it around for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants