ENH: Implement convert_dtypes #30929

Dr-Irv · 2020-01-11T21:03:28Z

xref ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752
tests added / passed
- pandas/tests/series/test_dtypes.py:test_convert_dtypes
- pandas/tests/frame/test_dtypes.py:test_convert_dtypes
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This implements DataFrame.convert_dtypes() and Series.convert_dtypes(), which will make it much easier to use the new pd.NA functionality.

Added documentation in the section about the new pd.NA functionality.

I'm sure there will be comments about how I could have done this in a more/better/different way, and I'm open to resolving them so we get this into 1.0.

pep8speaks · 2020-01-11T21:03:36Z

Hello @Dr-Irv! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-24 02:29:52 UTC

jreback

@Dr-Irv conceptually this is ok,
very quick glance

but impl needs work and api needs discussion (name)

the api should be more similar to the options provided in infer_objects and of.to_numeric

Dr-Irv · 2020-01-11T22:13:27Z

@Dr-Irv conceptually this is ok,
very quick glance

but impl needs work and api needs discussion (name)

the api should be more similar to the options provided in infer_objects and of.to_numeric

So infer_objects has no parameters, so I don't see what options would apply there.

For df.to_numeric, options currently are errors and downcast. I don't think errors applies in this case (we're not parsing anything), but I could see that using the downcast idea for values of integer, signed and unsigned would apply. Is that what you are suggesting?

I'm open on what name to use. There was some discussion between @jorisvandenbossche and me in #29752 and this was my last suggestion.

jorisvandenbossche · 2020-01-13T08:41:41Z

but I could see that using the downcast idea for values of integer, signed and unsigned would apply. Is that what you are suggesting?

That might be an option to add, but I don't that is a priority to add now (users can first do to_numeric to downcast what they want, before calling this new method. Which is a bit verbose but perfectly possible right now, so I would first focus on getting the basics right / agreed.

I'm open on what name to use. There was some discussion between @jorisvandenbossche and me in #29752 and this was my last suggestion.

Initially I had some reservation to use "nullable" in the name, but actually I think this is OK. String dtypes where already "nullable" before, but it's not using pd.NA, and we can maybe try to use the term "nullable dtype" consistently for those new dtypes that use pd.NA. Then that should be fine.

I would maybe only use nullable_dtypes (with the "d"), since that's the term that is used elsewhere in APIs in pandas (eg dtypes property, dtype= keyword, etc).

Something else I was wondering: does this need to be a method?
Yes, a method is certainly more discoverable. But to me this doesn't feel like a typical operation, it's also kind of a temporary thing to try out (in awaiting it to be the default at some point), so a top-level function pd.as_nullable_dtypes(..) might also be fine?

jorisvandenbossche

Should the method always return a new object? (now it sometimes is, sometimes it is self, in the Series case)

pandas/core/generic.py

Dr-Irv

Should the method always return a new object? (now it sometimes is, sometimes it is self, in the Series case)

I will make it return a new one.

pandas/core/generic.py

Dr-Irv · 2020-01-13T20:27:26Z

@jreback @jorisvandenbossche So I have this all green now. More detailed review and comments are welcome.

jorisvandenbossche · 2020-01-13T20:58:57Z

cc @TomAugspurger

pandas/core/generic.py

pandas/tests/series/test_dtypes.py

pandas/core/generic.py

Dr-Irv · 2020-01-14T02:14:16Z

@jorisvandenbossche @WillAyd you seem to disagree on whether this should be on the 1.0.0 milestone.....

Dr-Irv · 2020-01-14T03:14:58Z

@TomAugspurger all green. Made your suggested changes. Should be easier code to read now.

jreback

the logic as written is super complicated because is nested. you need to de-nest this. and make it a simple series of if each one will astype and return or be caught.

doc/source/user_guide/missing_data.rst

jreback · 2020-01-14T04:25:40Z

doc/source/user_guide/missing_data.rst

@@ -945,3 +946,25 @@ work with ``NA``, and generally return ``NA``:
   in the future.

 See :ref:`dsintro.numpy_interop` for more on ufuncs.
+
+.. _missing_data.NA.Conversion:


version added tag

@jreback this is a subsection of the whole pd.NA section, which does have a version added tag of 1.0.0. So is a version added tag necessary if it goes in 1.0.0?

doc/source/whatsnew/v1.0.0.rst

pandas/core/generic.py

pandas/core/series.py

jorisvandenbossche · 2020-01-14T08:14:20Z

@jreback can we take a step back and first discuss what we are trying to achieve with this function? (because based on your comments, there is clearly either a misunderstanding or a disagreement on the purpose of the new function)

I think for @Dr-Irv and me, the goal is to make it easier to experiment with the new nullable dtypes: the dtypes that use pd.NA as missing value indicator (so yes, for us this is "about NA"), so at this point in time the string, integer and boolean nullable dtypes.

So the goal is not just to convert to any extension type. For example, the goal is not to convert an object column with timestamp objects with a timezone to a datetimetz dtype. First, because that's already what infer_objects() is for, and second because datetimetz is not a nullable dtype (in the sense of using pd.NA or having the same behaviour as dtypes with NA).

So therefore, this method is not meant to replace infer_objects (which is specifically to convert accidental object dtyped columns to their proper dtype), and so we didn't talk about deprecating that. And that is also the reason for the specific name (and not something general as convert_dtypes).

jreback · 2020-01-14T12:38:22Z

@jorisvandenbossche

this function is too narrowly focused. a user searches the docs and sees .infer_objects and .as_nullable_dtypes. ok which one shall I use? when should I use it.

If the purpose is to provide a convenient way to 'infer_dtypes', then let's simply do that with a few simple options to it. It is SO confusing that I somehow have a 'object' dtype, so we have a function to 'fix' it. The same with nullables, we want to 'fix' this too.

So as I said I would be +1 on .infer_dtypes, which by default does what infer_object does now (and deprecate that), along with keeep_integer=True (needs a more informative name, does this mean convert to nullable or don't convert my integers?

having 2 functions which do a very similiar thing under different namespaces is very confusing.

jorisvandenbossche · 2020-01-14T12:58:19Z

In itself, I am certainly fine with adding the capabilities as an option to an existing function. And it's true that infer_objects does similar things (returning the same dataframe but with inferred dtypes).

But, for me, a downside of adding it to infer_objects() is that the name "infer objects" does not fully cover what we want to do here, as the current as_nullable_dtypes in this PR does more than only inferring object columns. It eg also checks float columns to see if they can be nullable integer.

Renaming infer_objects to infer_dtypes as you propose can indeed be an option to overcome that naming issue. But I am not sure it is worth it to deprecate infer_objects for this. I want to note that this function is already a new version of the before-deprecated convert_objects. Putting users through a new deprecation cycle for the same functionality feels unneeded.

But I want to stress again that for me the two use cases are rather distinct. The current infer_objects tries to fix up object dtypes that should have been other dtypes (numeric, datetime), but are not for whatever reason (eg from reading excel files this sometimes happens, from constructing and enlarging a dataframe in steps, etc). The idea of the new function in this PR is not to "fix" dtypes, but to convert perfectly fine, properly inferred dtypes to dtypes that support pd.NA.

Dr-Irv · 2020-01-23T17:17:17Z

@jreback @jorisvandenbossche I merged with latest master, and we're all green, so let me know if there is more to do.

jreback

lgtm. just some minor typing comments. ping on green.

pandas/core/dtypes/cast.py

pandas/core/series.py

doc/source/whatsnew/v1.1.0.rst

Dr-Irv · 2020-01-24T03:10:47Z

@jreback all green

lumberbot-app · 2020-01-24T03:24:44Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.0.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick 08f2d6411290bae362407b9fb25174bb01fb9040

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #30929: ENH: Implement convert_dtypes'

Push to a named branch :

git push YOURFORK 1.0.x:auto-backport-of-pr-30929-on-1.0.x

Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #30929 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

jreback · 2020-01-24T03:24:44Z

thanks @Dr-Irv very nice. you have been very responsive on this PR! (and generally)!

jreback · 2020-01-24T03:25:49Z

@Dr-Irv seems the automatic backport didn't work. If you can do #30929 (comment) would be amazing.

Dr-Irv · 2020-01-24T03:40:38Z

@jreback So the issue with the backport has to do with whatsnew. I put the whatsnew in 1.1, so does that mean I should put it in 1.0.0 instead?

jreback · 2020-01-24T04:49:46Z

ahh i c

so yeah push a PR to master that fixes it in master (eg move to 1.0.0); we will just merge this to master

and follow the backporting instructions above to backport to 1.0.0

Dr-Irv · 2020-01-24T13:50:33Z

@jreback wrote:

so yeah push a PR to master that fixes it in master (eg move to 1.0.0); we will just merge this to master

Submitted PR #31279

and follow the backporting instructions above to backport to 1.0.0

I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job???

If not, I presume it will give me the right instructions to do on that PR.

jorisvandenbossche · 2020-01-24T14:11:15Z

@jreback you were a bit quick with merging. I was still discussing about the options (it's not because you pushed it the way you liked, that now all others are fine with it ;)). After such a long discussion, at least ask about it.

I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job???

The backport will still need to be done manually, and then in the backport you can do a similar move of the whatsnew as you did in #31279.
If you want to do this, the instructions above #30929 (comment) should be more or less what needs to happen.

jreback · 2020-01-24T14:29:53Z

@jreback you were a bit quick with merging. I was still discussing about the options (it's not because you pushed it the way you liked, that now all others are fine with it ;)). After such a long discussion, at least ask about it.

I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job???

The backport will still need to be done manually, and then in the backport you can do a similar move of the whatsnew as you did in #31279.
If you want to do this, the instructions above #30929 (comment) should be more or less what needs to happen.

@jorisvandenbossche theb you should have put a block in the PR

we have so many PRs
happy to have you review many more

Co-authored-by: Irv Lustig <irv@princeton.com>

jbrockmendel · 2020-01-26T18:07:40Z

pandas/core/dtypes/cast.py

+
+    Parameters
+    ----------
+    input_array : ExtensionArray or PandasArray


"ExtensionArray or PandasArray" is redundant, isnt it? is ndarray not allowed? either way, can input_array be annotated?

@jbrockmendel You're correct about the redundancy (this description resulted after lots of discussion above), and I think an ndarray would work, but it is probably untested.

With respect to annotation, the issue here is the ordering of imports, so if it were to be typed, it requires changes to _typing.py and I didn't want to introduce that complexity to the PR.

thanks for explaining, my mistake not following the thread in real-time.

jbrockmendel · 2020-01-26T18:08:19Z

pandas/core/dtypes/cast.py

+    convert_string: bool = True,
+    convert_integer: bool = True,
+    convert_boolean: bool = True,
+) -> Dtype:


we really need to get a DtypeObject in pandas._typing that excludes strings

PR welcome! (heh, heh)

jbrockmendel · 2020-01-26T18:09:23Z

pandas/core/series.py

+    # Convert to types that support pd.NA
+
+    def _convert_dtypes(
+        self: ABCSeries,


should we either a) not annotate self or b) use "Series" instead of ABCSeries (like we have for the return annotation)

When I wrote the code, I didn't know about the "Series" annotation, and the return value was caught, so this could be fixed.

@jbrockmendel So now the question is whether these changes are worth a new PR, and whether that could also include doing something with the typing.

no worires, ill do this in an upcoming "assorted cleanups" PR

ENH: Implement as_nullable_types()

04b277b

jreback requested changes Jan 11, 2020

View reviewed changes

Fix up whitespace and Linux 32-bit

82e62dc

jorisvandenbossche added this to the 1.0.0 milestone Jan 13, 2020

jorisvandenbossche reviewed Jan 13, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

Dr-Irv commented Jan 13, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

Dr-Irv added 3 commits January 13, 2020 14:04

change name to as_nullable_dtypes, fix integer conversion

dc1daa0

Merge remote-tracking branch 'upstream/master' into asnullabletype

a9b477e

be specific about int sizes. remove infer_dtype if float

e54ad4f

WillAyd removed this from the 1.0.0 milestone Jan 13, 2020

TomAugspurger reviewed Jan 13, 2020

View reviewed changes

Dr-Irv added 4 commits January 13, 2020 18:56

add keep_integer parameter. Handle mixed. Simplify logic

8cc238d

fix black, docstring, types issues

f0ba92b

fix docstrings. can't use blocks

aebba66

fix double line break

40123c7

jreback requested changes Jan 14, 2020

View reviewed changes

jorisvandenbossche modified the milestone: 1.0.0 Jan 14, 2020

jreback added Dtype Conversions Unexpected or buggy dtype conversions Enhancement labels Jan 14, 2020

fix doc issues v2

8a5fcf3

jorisvandenbossche mentioned this pull request Jan 23, 2020

ENH: add use_nullable_dtypes option in read_parquet #31242

Merged

merge in latest master

39798fa

jreback approved these changes Jan 24, 2020

View reviewed changes

pandas/core/dtypes/cast.py Show resolved Hide resolved

pandas/core/series.py Outdated Show resolved Hide resolved

jreback reviewed Jan 24, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

Dr-Irv added 2 commits January 23, 2020 21:28

fix up types, GH refs in whatsnew

1e68d03

Merge remote-tracking branch 'upstream/master' into asnullabletype

fa93a84

jreback merged commit 08f2d64 into pandas-dev:master Jan 24, 2020

lumberbot-app bot added the Still Needs Manual Backport label Jan 24, 2020

Dr-Irv mentioned this pull request Jan 24, 2020

DOC: move convert_dtypes whatsnew to 1.0 #31279

Merged

5 tasks

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jan 24, 2020

ENH: Implement convert_dtypes (pandas-dev#30929)

5a9dac6

Dr-Irv mentioned this pull request Jan 24, 2020

Backport PR #30929: ENH: Implement convert_dtypes' #31282

Merged

jorisvandenbossche added a commit that referenced this pull request Jan 24, 2020

ENH: Implement convert_dtypes (#30929) (#31282)

b463f4a

Co-authored-by: Irv Lustig <irv@princeton.com>

Dr-Irv deleted the asnullabletype branch January 24, 2020 16:00

jorisvandenbossche mentioned this pull request Jan 24, 2020

ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752

Closed

jbrockmendel reviewed Jan 26, 2020

View reviewed changes

jreback removed the Still Needs Manual Backport label Jan 27, 2020

jorisvandenbossche mentioned this pull request Nov 28, 2020

ENH: include conversion to nullable float in convert_dtypes() #38117

Merged

Uh oh!

ENH: Implement convert_dtypes #30929

ENH: Implement convert_dtypes #30929

Uh oh!

Conversation

Dr-Irv commented Jan 11, 2020 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Jan 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-01-24 02:29:52 UTC

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Dr-Irv commented Jan 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 13, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dr-Irv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dr-Irv commented Jan 13, 2020

Uh oh!

jorisvandenbossche commented Jan 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dr-Irv commented Jan 14, 2020

Uh oh!

Dr-Irv commented Jan 14, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 14, 2020

Uh oh!

jreback commented Jan 14, 2020

Uh oh!

jorisvandenbossche commented Jan 14, 2020

Uh oh!

Dr-Irv commented Jan 23, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dr-Irv commented Jan 24, 2020

Uh oh!

lumberbot-app bot commented Jan 24, 2020 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dr-Irv commented Jan 11, 2020 •

edited by jorisvandenbossche

Loading

pep8speaks commented Jan 11, 2020 •

edited

Loading

Dr-Irv commented Jan 11, 2020 •

edited

Loading

lumberbot-app bot commented Jan 24, 2020 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Jan 24, 2020 •

edited

Loading