-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix StringArray.astype for category dtype #40450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
This uncovered a bit of a larger issue, basically #37626, but for datetimes & period. Example: import pandas as pd
t = pd.Series(["1/1/2021", "2/1/2021", None], dtype="period[M]")
s = pd.Series(["1/1/2021", "2/1/2021", None], dtype="string").astype("period[M]")
Example import pandas as pd
t = pd.Series(["1/1/2021", "2/1/2021"], dtype="object").astype("datetime64[ns]")
s = pd.Series(["1/1/2021", "2/1/2021"], dtype="string").astype("datetime64[ns]")
However if it's ok I'd open a new issue for it, since it'll be a larger revamp which will take me a few more days. |
Yes, that's a more general issue, and can be left out of scope for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a whatsnew note for 1.2.4
expected = Series(["A", np.NaN], dtype="category") | ||
tm.assert_series_equal(result, expected) | ||
|
||
s = Series(["1/1/2021", "2/1/2021"], dtype="string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add an example for Timedelta, Datetime w/time zone and Interval (all the EA types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added test for all ExtensionArray dtypes. BooleanArray
and IntervalArray
I had to exclude since there's no way to to parse them back from a list of strings. TimedeltaArray
xfails due to #40478. For PeriodArray
and DatetimeArray
the NaT
get converted to NA
strings. But converting the NA
strings back to NaT
fails. I added XFails, unless expecting EA ⇒ StringArray ⇒ EA to roundtrip successfully is still up for debate.
This doesn't need a whatsnew, it's only a regression on master |
ok great |
request.node.add_marker(mark) | ||
if NaT in data and dtype in ("period[M]", "datetime64[ns]"): | ||
mark = pytest.mark.xfail( | ||
reason="TODO StringArray.astype() None to dtype.na_value conversion" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created one here: #40566 to track this
(["1/1/2021", "2/1/2021"], "period[M]"), | ||
(["1/1/2021", "2/1/2021", NaT], "period[M]"), | ||
(["1 Day", "59 Days", NaT], "timedelta64[ns]"), | ||
# currently no way to parse BooleanArray, IntervalArray from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BooleanArray can be parsed from string (see _from_sequence_of_strings, the general method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an implementation that used _from_sequence_of_strings
instead of _from_sequence
to in StringArray.astype(). That required bigger code changes. I'd like to merge this Regression PR and then implement to implement _from_sequence_of_strings
as part of #40566
This reverts commit 94a8b58.
Why does Codecov say 0% coverage for the diff, when the added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, can you add a whatsnew note (bug fixes on 1.3 / conversion or extension types section). ping on greenish (also merge master)
@jreback added a Whatsnew, should be good for merge. |
thanks @siboehm |
This was failing due to
elif np.issubdtype(dtype, np.floating)
here, which fails for the pandas dtypes.StringArrays are now being cast to Categorical just as they were before #38530, when the casting still happened inside
Block
.