Skip to content

ENH: add use_nullable_dtypes option in read_parquet #31242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 29, 2020

Conversation

jorisvandenbossche
Copy link
Member

xref #29752, #30929

Using some work I am doing in pyarrow (apache/arrow#6189), we are able to provide an option in read_parquet to directly use new nullable dtypes instead of first using the default conversion (eg which gives floats for ints with nulls) and doing the conversion afterwards

@@ -116,13 +117,32 @@ def write(
**kwargs,
)

def read(self, path, columns=None, **kwargs):
def read(self, path, columns=None, use_nullable_dtypes=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use_extension_dtypes as more descriptive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use_extension_dtypes as more descriptive

It doesn't use extension dtypes in general, only those types that use pd.NA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #29752 for some discussion about naming this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I would also prefer use_extension_dtypes

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

whatever keyword is agreed here we should use in read_csv as well. Also should have a doc-string reference to convert_dtypes

@jorisvandenbossche
Copy link
Member Author

FWIW I would also prefer use_extension_dtypes

Can you then try to provide some (counter) arguments? (I was initially also not fond of use_nullable_dtypes, so happy to find a better alternative).

For me, the main reason to not use use_extension_dtypes is: 1) this option does not trigger to return extension dtypes in general. For example, it does not trigger to return categorical or datetimetz (as those are aready returned by default by pyarrow), and it does not trigger to return period or interval (those can be returned based on metadata saved in the parquet file / pyarrow exension types); in both cases, extension types will be returned even with use_extension_dtypes=False. In contrast, I find use_nullable_dtypes clearer in communicating the intent*.
In addition, and more semantically, "extension" types can give the idea of being about "external" extension types (but this is a problem in general with the term, so not that relevant here).

*I think we are going to need some terminology to denote "the dtypes that use pd.NA as missing value indicator". Also for our communication (and when discussing) about it, for in the docs, etc, it would be good to have a term for it that we can consistently use. I think "nullable dtypes" is an option for this (we already use "nullable integer dtype" for a while in the docs), although certainly not ideal, since strictly speaking other dtypes are also "nullable" (floats, object, datetime), just in a different way.
Maybe having this more general discussion can help us find matching keyword names afterwards.

@WillAyd
Copy link
Member

WillAyd commented Jan 23, 2020

Sure as some quick counter arguments:

  • The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect
  • Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)
  • If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

@TomAugspurger
Copy link
Contributor

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

This seems unlikely to me. The issue here is that we have multiple representations for the same data (int64, Int64, bool, boolean, etc.), and we want to make it easy to use the new representation, not the one the dataset was written with. Do you foresee any other cases where we would want to do this transformation? The only one I potentially see is if we for some reason added a floating-point extension type.

@jorisvandenbossche
Copy link
Member Author

Note: I moved the general discussion above to the original issue: #29752

@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

@jorisvandenbossche status of this?

@jorisvandenbossche
Copy link
Member Author

Well, for me this is ready, but I am not sure everybody is agreeing on the keyword name (including you?)
I also pinged the more general discussion in #29752

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche can you merge master to resolve conflicts

@simonjayhawkins
Copy link
Member

@jorisvandenbossche can you rebase.

@jorisvandenbossche
Copy link
Member Author

Let's first decide in #29752

@@ -184,6 +204,12 @@ def write(
)

def read(self, path, columns=None, **kwargs):
use_nullable_dtypes = kwargs.pop("use_nullable_dtypes", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a global option to turn this on (pls add an issue for this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this generally worth it, if you can add an issue for this / PR welcome too! (bot blocking for this PR)

@jorisvandenbossche
Copy link
Member Author

Updated with latest master, and addressed some comments (versionadded, additional tests)

@jbrockmendel
Copy link
Member

linting issue

@jorisvandenbossche
Copy link
Member Author

More comments on this?

@jorisvandenbossche jorisvandenbossche added this to the 1.2 milestone Nov 28, 2020
if LooseVersion(self.api.__version__) > "0.15.1.dev":
import pandas as pd

mapping = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you instead import from the arrays locations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also import eg DataFrame from the main namespace in this file

@@ -828,6 +828,35 @@ def test_additional_extension_types(self, pa):
)
check_round_trip(df, pa)

@td.skip_if_no("pyarrow", min_version="0.15.1.dev")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

@@ -184,6 +204,12 @@ def write(
)

def read(self, path, columns=None, **kwargs):
use_nullable_dtypes = kwargs.pop("use_nullable_dtypes", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this generally worth it, if you can add an issue for this / PR welcome too! (bot blocking for this PR)

@jreback jreback merged commit 7b400b3 into pandas-dev:master Nov 29, 2020
@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

thanks @jorisvandenbossche comment for followon issue.

@jorisvandenbossche jorisvandenbossche deleted the parquet-nullable-types branch November 29, 2020 16:13
@jorisvandenbossche
Copy link
Member Author

I think #29752 already covers that for now. We first need more IO supporting it anyhow, before a global option would be useful)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants