-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Implement convert_dtypes #30929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dr-Irv conceptually this is ok,
very quick glance
but impl needs work and api needs discussion (name)
the api should be more similar to the options provided in infer_objects and of.to_numeric
So For I'm open on what name to use. There was some discussion between @jorisvandenbossche and me in #29752 and this was my last suggestion. |
That might be an option to add, but I don't that is a priority to add now (users can first do
Initially I had some reservation to use "nullable" in the name, but actually I think this is OK. String dtypes where already "nullable" before, but it's not using pd.NA, and we can maybe try to use the term "nullable dtype" consistently for those new dtypes that use pd.NA. Then that should be fine. I would maybe only use nullable_dtypes (with the "d"), since that's the term that is used elsewhere in APIs in pandas (eg Something else I was wondering: does this need to be a method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the method always return a new object? (now it sometimes is, sometimes it is self
, in the Series case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the method always return a new object? (now it sometimes is, sometimes it is
self
, in the Series case)
I will make it return a new one.
@jreback @jorisvandenbossche So I have this all green now. More detailed review and comments are welcome. |
@jorisvandenbossche @WillAyd you seem to disagree on whether this should be on the 1.0.0 milestone..... |
@TomAugspurger all green. Made your suggested changes. Should be easier code to read now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic as written is super complicated because is nested. you need to de-nest this. and make it a simple series of if each one will astype and return or be caught.
@@ -945,3 +946,25 @@ work with ``NA``, and generally return ``NA``: | |||
in the future. | |||
|
|||
See :ref:`dsintro.numpy_interop` for more on ufuncs. | |||
|
|||
.. _missing_data.NA.Conversion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version added tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback this is a subsection of the whole pd.NA
section, which does have a version added tag of 1.0.0. So is a version added tag necessary if it goes in 1.0.0?
@jreback can we take a step back and first discuss what we are trying to achieve with this function? (because based on your comments, there is clearly either a misunderstanding or a disagreement on the purpose of the new function) I think for @Dr-Irv and me, the goal is to make it easier to experiment with the new nullable dtypes: the dtypes that use So the goal is not just to convert to any extension type. For example, the goal is not to convert an object column with timestamp objects with a timezone to a datetimetz dtype. First, because that's already what So therefore, this method is not meant to replace |
this function is too narrowly focused. a user searches the docs and sees If the purpose is to provide a convenient way to 'infer_dtypes', then let's simply do that with a few simple options to it. It is SO confusing that I somehow have a 'object' dtype, so we have a function to 'fix' it. The same with nullables, we want to 'fix' this too. So as I said I would be +1 on having 2 functions which do a very similiar thing under different namespaces is very confusing. |
In itself, I am certainly fine with adding the capabilities as an option to an existing function. And it's true that But, for me, a downside of adding it to Renaming But I want to stress again that for me the two use cases are rather distinct. The current |
@jreback @jorisvandenbossche I merged with latest master, and we're all green, so let me know if there is more to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just some minor typing comments. ping on green.
@jreback all green |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
thanks @Dr-Irv very nice. you have been very responsive on this PR! (and generally)! |
@Dr-Irv seems the automatic backport didn't work. If you can do #30929 (comment) would be amazing. |
@jreback So the issue with the backport has to do with whatsnew. I put the whatsnew in 1.1, so does that mean I should put it in 1.0.0 instead? |
ahh i c so yeah push a PR to master that fixes it in master (eg move to 1.0.0); we will just merge this to master and follow the backporting instructions above to backport to 1.0.0 |
@jreback wrote:
Submitted PR #31279
I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job??? If not, I presume it will give me the right instructions to do on that PR. |
@jreback you were a bit quick with merging. I was still discussing about the options (it's not because you pushed it the way you liked, that now all others are fine with it ;)). After such a long discussion, at least ask about it.
The backport will still need to be done manually, and then in the backport you can do a similar move of the whatsnew as you did in #31279. |
@jorisvandenbossche theb you should have put a block in the PR we have so many PRs |
|
||
Parameters | ||
---------- | ||
input_array : ExtensionArray or PandasArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ExtensionArray or PandasArray" is redundant, isnt it? is ndarray not allowed? either way, can input_array be annotated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel You're correct about the redundancy (this description resulted after lots of discussion above), and I think an ndarray would work, but it is probably untested.
With respect to annotation, the issue here is the ordering of imports, so if it were to be typed, it requires changes to _typing.py
and I didn't want to introduce that complexity to the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining, my mistake not following the thread in real-time.
convert_string: bool = True, | ||
convert_integer: bool = True, | ||
convert_boolean: bool = True, | ||
) -> Dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really need to get a DtypeObject in pandas._typing that excludes strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR welcome! (heh, heh)
# Convert to types that support pd.NA | ||
|
||
def _convert_dtypes( | ||
self: ABCSeries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we either a) not annotate self or b) use "Series" instead of ABCSeries (like we have for the return annotation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote the code, I didn't know about the "Series"
annotation, and the return value was caught, so this could be fixed.
@jbrockmendel So now the question is whether these changes are worth a new PR, and whether that could also include doing something with the typing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worires, ill do this in an upcoming "assorted cleanups" PR
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This implements
DataFrame.convert_dtypes()
andSeries.convert_dtypes()
, which will make it much easier to use the newpd.NA
functionality.Added documentation in the section about the new
pd.NA
functionality.I'm sure there will be comments about how I could have done this in a more/better/different way, and I'm open to resolving them so we get this into 1.0.