Skip to content

ENH: Implement convert_dtypes #30929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jan 24, 2020
Merged

ENH: Implement convert_dtypes #30929

merged 28 commits into from
Jan 24, 2020

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Jan 11, 2020

This implements DataFrame.convert_dtypes() and Series.convert_dtypes(), which will make it much easier to use the new pd.NA functionality.

Added documentation in the section about the new pd.NA functionality.

I'm sure there will be comments about how I could have done this in a more/better/different way, and I'm open to resolving them so we get this into 1.0.

@pep8speaks
Copy link

pep8speaks commented Jan 11, 2020

Hello @Dr-Irv! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-24 02:29:52 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv conceptually this is ok,
very quick glance

but impl needs work and api needs discussion (name)

the api should be more similar to the options provided in infer_objects and of.to_numeric

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 11, 2020

@Dr-Irv conceptually this is ok,
very quick glance

but impl needs work and api needs discussion (name)

the api should be more similar to the options provided in infer_objects and of.to_numeric

So infer_objects has no parameters, so I don't see what options would apply there.

For df.to_numeric, options currently are errors and downcast. I don't think errors applies in this case (we're not parsing anything), but I could see that using the downcast idea for values of integer, signed and unsigned would apply. Is that what you are suggesting?

I'm open on what name to use. There was some discussion between @jorisvandenbossche and me in #29752 and this was my last suggestion.

@jorisvandenbossche jorisvandenbossche added this to the 1.0.0 milestone Jan 13, 2020
@jorisvandenbossche
Copy link
Member

but I could see that using the downcast idea for values of integer, signed and unsigned would apply. Is that what you are suggesting?

That might be an option to add, but I don't that is a priority to add now (users can first do to_numeric to downcast what they want, before calling this new method. Which is a bit verbose but perfectly possible right now, so I would first focus on getting the basics right / agreed.

I'm open on what name to use. There was some discussion between @jorisvandenbossche and me in #29752 and this was my last suggestion.

Initially I had some reservation to use "nullable" in the name, but actually I think this is OK. String dtypes where already "nullable" before, but it's not using pd.NA, and we can maybe try to use the term "nullable dtype" consistently for those new dtypes that use pd.NA. Then that should be fine.

I would maybe only use nullable_dtypes (with the "d"), since that's the term that is used elsewhere in APIs in pandas (eg dtypes property, dtype= keyword, etc).


Something else I was wondering: does this need to be a method?
Yes, a method is certainly more discoverable. But to me this doesn't feel like a typical operation, it's also kind of a temporary thing to try out (in awaiting it to be the default at some point), so a top-level function pd.as_nullable_dtypes(..) might also be fine?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the method always return a new object? (now it sometimes is, sometimes it is self, in the Series case)

Copy link
Contributor Author

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the method always return a new object? (now it sometimes is, sometimes it is self, in the Series case)

I will make it return a new one.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 13, 2020

@jreback @jorisvandenbossche So I have this all green now. More detailed review and comments are welcome.

@jorisvandenbossche
Copy link
Member

cc @TomAugspurger

@WillAyd WillAyd removed this from the 1.0.0 milestone Jan 13, 2020
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 14, 2020

@jorisvandenbossche @WillAyd you seem to disagree on whether this should be on the 1.0.0 milestone.....

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 14, 2020

@TomAugspurger all green. Made your suggested changes. Should be easier code to read now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic as written is super complicated because is nested. you need to de-nest this. and make it a simple series of if each one will astype and return or be caught.

@@ -945,3 +946,25 @@ work with ``NA``, and generally return ``NA``:
in the future.

See :ref:`dsintro.numpy_interop` for more on ufuncs.

.. _missing_data.NA.Conversion:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version added tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback this is a subsection of the whole pd.NA section, which does have a version added tag of 1.0.0. So is a version added tag necessary if it goes in 1.0.0?

@jorisvandenbossche jorisvandenbossche modified the milestone: 1.0.0 Jan 14, 2020
@jorisvandenbossche
Copy link
Member

@jreback can we take a step back and first discuss what we are trying to achieve with this function? (because based on your comments, there is clearly either a misunderstanding or a disagreement on the purpose of the new function)

I think for @Dr-Irv and me, the goal is to make it easier to experiment with the new nullable dtypes: the dtypes that use pd.NA as missing value indicator (so yes, for us this is "about NA"), so at this point in time the string, integer and boolean nullable dtypes.

So the goal is not just to convert to any extension type. For example, the goal is not to convert an object column with timestamp objects with a timezone to a datetimetz dtype. First, because that's already what infer_objects() is for, and second because datetimetz is not a nullable dtype (in the sense of using pd.NA or having the same behaviour as dtypes with NA).

So therefore, this method is not meant to replace infer_objects (which is specifically to convert accidental object dtyped columns to their proper dtype), and so we didn't talk about deprecating that. And that is also the reason for the specific name (and not something general as convert_dtypes).

@jreback
Copy link
Contributor

jreback commented Jan 14, 2020

@jorisvandenbossche

this function is too narrowly focused. a user searches the docs and sees .infer_objects and .as_nullable_dtypes. ok which one shall I use? when should I use it.

If the purpose is to provide a convenient way to 'infer_dtypes', then let's simply do that with a few simple options to it. It is SO confusing that I somehow have a 'object' dtype, so we have a function to 'fix' it. The same with nullables, we want to 'fix' this too.

So as I said I would be +1 on .infer_dtypes, which by default does what infer_object does now (and deprecate that), along with keeep_integer=True (needs a more informative name, does this mean convert to nullable or don't convert my integers?

having 2 functions which do a very similiar thing under different namespaces is very confusing.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Enhancement labels Jan 14, 2020
@jorisvandenbossche
Copy link
Member

In itself, I am certainly fine with adding the capabilities as an option to an existing function. And it's true that infer_objects does similar things (returning the same dataframe but with inferred dtypes).

But, for me, a downside of adding it to infer_objects() is that the name "infer objects" does not fully cover what we want to do here, as the current as_nullable_dtypes in this PR does more than only inferring object columns. It eg also checks float columns to see if they can be nullable integer.

Renaming infer_objects to infer_dtypes as you propose can indeed be an option to overcome that naming issue. But I am not sure it is worth it to deprecate infer_objects for this. I want to note that this function is already a new version of the before-deprecated convert_objects. Putting users through a new deprecation cycle for the same functionality feels unneeded.

But I want to stress again that for me the two use cases are rather distinct. The current infer_objects tries to fix up object dtypes that should have been other dtypes (numeric, datetime), but are not for whatever reason (eg from reading excel files this sometimes happens, from constructing and enlarging a dataframe in steps, etc). The idea of the new function in this PR is not to "fix" dtypes, but to convert perfectly fine, properly inferred dtypes to dtypes that support pd.NA.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 23, 2020

@jreback @jorisvandenbossche I merged with latest master, and we're all green, so let me know if there is more to do.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just some minor typing comments. ping on green.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 24, 2020

@jreback all green

@jreback jreback merged commit 08f2d64 into pandas-dev:master Jan 24, 2020
@lumberbot-app
Copy link

lumberbot-app bot commented Jan 24, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick 08f2d6411290bae362407b9fb25174bb01fb9040
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #30929: ENH: Implement convert_dtypes'
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-30929-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #30929 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

thanks @Dr-Irv very nice. you have been very responsive on this PR! (and generally)!

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

@Dr-Irv seems the automatic backport didn't work. If you can do #30929 (comment) would be amazing.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 24, 2020

@jreback So the issue with the backport has to do with whatsnew. I put the whatsnew in 1.1, so does that mean I should put it in 1.0.0 instead?

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

ahh i c

so yeah push a PR to master that fixes it in master (eg move to 1.0.0); we will just merge this to master

and follow the backporting instructions above to backport to 1.0.0

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 24, 2020

@jreback wrote:

so yeah push a PR to master that fixes it in master (eg move to 1.0.0); we will just merge this to master

Submitted PR #31279

and follow the backporting instructions above to backport to 1.0.0

I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job???

If not, I presume it will give me the right instructions to do on that PR.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 24, 2020

@jreback you were a bit quick with merging. I was still discussing about the options (it's not because you pushed it the way you liked, that now all others are fine with it ;)). After such a long discussion, at least ask about it.


I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job???

The backport will still need to be done manually, and then in the backport you can do a similar move of the whatsnew as you did in #31279.
If you want to do this, the instructions above #30929 (comment) should be more or less what needs to happen.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

@jreback you were a bit quick with merging. I was still discussing about the options (it's not because you pushed it the way you liked, that now all others are fine with it ;)). After such a long discussion, at least ask about it.

I think that if you merge #31279 to master, the automatic backport (meeseeksdev) will do the job???

The backport will still need to be done manually, and then in the backport you can do a similar move of the whatsnew as you did in #31279.
If you want to do this, the instructions above #30929 (comment) should be more or less what needs to happen.

@jorisvandenbossche theb you should have put a block in the PR

we have so many PRs
happy to have you review many more

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jan 24, 2020
jorisvandenbossche added a commit that referenced this pull request Jan 24, 2020
Co-authored-by: Irv Lustig <irv@princeton.com>
@Dr-Irv Dr-Irv deleted the asnullabletype branch January 24, 2020 16:00

Parameters
----------
input_array : ExtensionArray or PandasArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ExtensionArray or PandasArray" is redundant, isnt it? is ndarray not allowed? either way, can input_array be annotated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel You're correct about the redundancy (this description resulted after lots of discussion above), and I think an ndarray would work, but it is probably untested.

With respect to annotation, the issue here is the ordering of imports, so if it were to be typed, it requires changes to _typing.py and I didn't want to introduce that complexity to the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining, my mistake not following the thread in real-time.

convert_string: bool = True,
convert_integer: bool = True,
convert_boolean: bool = True,
) -> Dtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need to get a DtypeObject in pandas._typing that excludes strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR welcome! (heh, heh)

# Convert to types that support pd.NA

def _convert_dtypes(
self: ABCSeries,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we either a) not annotate self or b) use "Series" instead of ABCSeries (like we have for the return annotation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote the code, I didn't know about the "Series" annotation, and the return value was caught, so this could be fixed.

@jbrockmendel So now the question is whether these changes are worth a new PR, and whether that could also include doing something with the typing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worires, ill do this in an upcoming "assorted cleanups" PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants