-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: should values_for_factorize and _from_factorized round-trip missing values? #32673
Comments
I think that's the difference are the missing values. NA values (empty dict for JSONDtype I think?) are dropped. |
The pandas/pandas/tests/extension/json/array.py Lines 66 to 68 in 059f9bf
it's indeed simply skipping missing values, instead of only not trying to convert them to a dict |
BooleanArray also doesn't roundtrip:
|
is there concensus that the answer to the original question is "yes"? |
That's not a fully correct example. If you do an actual factorize in the meantime, the missing values (encoded as -1) are not included. And, so my comment above about JSONArray was not fully correct: it can skip NAs, as right now NAs are never present in |
I think this is clearly the case, yes, for valid values (that's the whole point of Purely for the Also possibly related here is my earlier comment in the EA interface issue about |
this is a good idea. |
@jorisvandenbossche do you have other use-cases in mind for |
I think some came up in the EA interface revisit discussion (#32586) ? |
To be clear, in that case we don't need the PR #32798 ? (some clean-up of the PR might still be useful, but I mean then there is "no bug to fix") |
Do we know of a compelling use case where round-trip-ability would be actively un-desirable? The only extant case where it doesnt hold appears unintentional (cc @WillAyd correct me if im wrong here) |
@jorisvandenbossche wondering if we can get consensus on a couple of weaker requirements:
|
Since we were just discussing in the other issue about how this If we are defining an interface to get values to be used in an indexing engine, for example, yes we can discuss that. But I would first discuss those use cases, before adding requirements to What do you mean with "idempotency" in this context? |
Basically theround-trip-ability of the uniques-without-missing-values that you mentioned. We can stick a pin in this, as I've come around to "rip these out entirely" |
As I already mentioned a few times: we already have this requirement (so I don't think it's something we need to agree on :-))
That would break several external projects, including GeoPandas. We could potentially deprecate it, but we at least need to keep it around for some time. |
(fixtures make this so much harder to give a copy/pasteable example)
Am I wrong in thinking this assertion should hold? If we had a .equals method, i'd strengthen this assertion to
assert result.equals(data)
cc @TomAugspurger @WillAyd
The text was updated successfully, but these errors were encountered: