ENH: add use_nullable_dtypes option in read_parquet #31242

jorisvandenbossche · 2020-01-23T10:51:32Z

Using some work I am doing in pyarrow (apache/arrow#6189), we are able to provide an option in read_parquet to directly use new nullable dtypes instead of first using the default conversion (eg which gives floats for ints with nulls) and doing the conversion afterwards

jreback · 2020-01-23T11:03:35Z

pandas/io/parquet.py

@@ -116,13 +117,32 @@ def write(
                **kwargs,
            )

-    def read(self, path, columns=None, **kwargs):
+    def read(self, path, columns=None, use_nullable_dtypes=False, **kwargs):


how about use_extension_dtypes as more descriptive

how about use_extension_dtypes as more descriptive

It doesn't use extension dtypes in general, only those types that use pd.NA

See also #29752 for some discussion about naming this

FWIW I would also prefer use_extension_dtypes

pandas/tests/io/test_parquet.py

jreback · 2020-01-23T13:06:20Z

whatever keyword is agreed here we should use in read_csv as well. Also should have a doc-string reference to convert_dtypes

pandas/io/parquet.py

jorisvandenbossche · 2020-01-23T20:20:50Z

FWIW I would also prefer use_extension_dtypes

Can you then try to provide some (counter) arguments? (I was initially also not fond of use_nullable_dtypes, so happy to find a better alternative).

For me, the main reason to not use use_extension_dtypes is: 1) this option does not trigger to return extension dtypes in general. For example, it does not trigger to return categorical or datetimetz (as those are aready returned by default by pyarrow), and it does not trigger to return period or interval (those can be returned based on metadata saved in the parquet file / pyarrow exension types); in both cases, extension types will be returned even with use_extension_dtypes=False. In contrast, I find use_nullable_dtypes clearer in communicating the intent*.
In addition, and more semantically, "extension" types can give the idea of being about "external" extension types (but this is a problem in general with the term, so not that relevant here).

*I think we are going to need some terminology to denote "the dtypes that use pd.NA as missing value indicator". Also for our communication (and when discussing) about it, for in the docs, etc, it would be good to have a term for it that we can consistently use. I think "nullable dtypes" is an option for this (we already use "nullable integer dtype" for a while in the docs), although certainly not ideal, since strictly speaking other dtypes are also "nullable" (floats, object, datetime), just in a different way.
Maybe having this more general discussion can help us find matching keyword names afterwards.

WillAyd · 2020-01-23T20:32:59Z

Sure as some quick counter arguments:

The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect
Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)
If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

TomAugspurger · 2020-01-24T14:07:36Z

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

This seems unlikely to me. The issue here is that we have multiple representations for the same data (int64, Int64, bool, boolean, etc.), and we want to make it easy to use the new representation, not the one the dataset was written with. Do you foresee any other cases where we would want to do this transformation? The only one I potentially see is if we for some reason added a floating-point extension type.

jorisvandenbossche · 2020-01-24T19:31:43Z

Note: I moved the general discussion above to the original issue: #29752

jreback · 2020-04-10T22:16:14Z

@jorisvandenbossche status of this?

jorisvandenbossche · 2020-04-11T07:59:18Z

Well, for me this is ready, but I am not sure everybody is agreeing on the keyword name (including you?)
I also pinged the more general discussion in #29752

simonjayhawkins

@jorisvandenbossche can you merge master to resolve conflicts

pandas/io/parquet.py

simonjayhawkins · 2020-05-21T14:14:28Z

@jorisvandenbossche can you rebase.

jorisvandenbossche · 2020-05-21T14:18:11Z

Let's first decide in #29752

pandas/io/parquet.py

jreback · 2020-05-25T17:29:05Z

pandas/io/parquet.py

@@ -184,6 +204,12 @@ def write(
            )

    def read(self, path, columns=None, **kwargs):
+        use_nullable_dtypes = kwargs.pop("use_nullable_dtypes", False)


we should have a global option to turn this on (pls add an issue for this)

I think this generally worth it, if you can add an issue for this / PR welcome too! (bot blocking for this PR)

…types

jorisvandenbossche · 2020-09-04T11:45:55Z

Updated with latest master, and addressed some comments (versionadded, additional tests)

jbrockmendel · 2020-09-11T17:24:46Z

linting issue

…types

jorisvandenbossche · 2020-09-29T07:57:35Z

More comments on this?

…types

pandas/io/parquet.py

jreback · 2020-11-28T17:20:38Z

pandas/io/parquet.py

+            if LooseVersion(self.api.__version__) > "0.15.1.dev":
+                import pandas as pd
+
+                mapping = {


can you instead import from the arrays locations.

We also import eg DataFrame from the main namespace in this file

pandas/io/parquet.py

jreback · 2020-11-28T17:22:11Z

pandas/tests/io/test_parquet.py

@@ -828,6 +828,35 @@ def test_additional_extension_types(self, pa):
        )
        check_round_trip(df, pa)

+    @td.skip_if_no("pyarrow", min_version="0.15.1.dev")


same comment as above

pandas/tests/io/test_parquet.py

jreback · 2020-11-29T16:09:32Z

pandas/io/parquet.py

@@ -184,6 +204,12 @@ def write(
            )

    def read(self, path, columns=None, **kwargs):
+        use_nullable_dtypes = kwargs.pop("use_nullable_dtypes", False)


I think this generally worth it, if you can add an issue for this / PR welcome too! (bot blocking for this PR)

jreback · 2020-11-29T16:10:27Z

thanks @jorisvandenbossche comment for followon issue.

jorisvandenbossche · 2020-11-29T16:15:15Z

I think #29752 already covers that for now. We first need more IO supporting it anyhow, before a global option would be useful)

ENH: add use_nullable_dtypes option in read_parquet

b3053fd

jorisvandenbossche mentioned this pull request Jan 23, 2020

ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752

Closed

jreback requested changes Jan 23, 2020

View reviewed changes

TomAugspurger reviewed Jan 23, 2020

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

TomAugspurger mentioned this pull request Jan 31, 2020

StringDtype not preserved when writing + reading parquet or feather #31497

Closed

jbrockmendel added the IO Parquet parquet, feather label Mar 19, 2020

simonjayhawkins reviewed Apr 15, 2020

View reviewed changes

pandas/io/parquet.py Show resolved Hide resolved

jreback reviewed May 25, 2020

View reviewed changes

jorisvandenbossche added 2 commits September 4, 2020 13:27

Merge remote-tracking branch 'upstream/master' into parquet-nullable-…

434442a

…types

add message for old versions + test also uint/bool

f617a7e

jorisvandenbossche mentioned this pull request Sep 29, 2020

ENH: add option to get nullable dtypes to pd.read_csv #36712

Closed

3 tasks

jorisvandenbossche added 3 commits September 29, 2020 09:14

Merge remote-tracking branch 'upstream/master' into parquet-nullable-…

af81de9

…types

lint

60a1c0e

add whatsnew note

18c93b5

TomAugspurger approved these changes Sep 29, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into parquet-nullable-…

0f691be

…types

jorisvandenbossche added this to the 1.2 milestone Nov 28, 2020

jreback requested changes Nov 28, 2020

View reviewed changes

jorisvandenbossche added 2 commits November 28, 2020 20:42

update version

46932f4

add fastparquet test ensuring an error

1375bad

jreback approved these changes Nov 29, 2020

View reviewed changes

jreback merged commit 7b400b3 into pandas-dev:master Nov 29, 2020

jorisvandenbossche deleted the parquet-nullable-types branch November 29, 2020 16:13

jorisvandenbossche mentioned this pull request Mar 31, 2021

ENH: Add nullable dtypes to read_csv #40687

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add use_nullable_dtypes option in read_parquet #31242

ENH: add use_nullable_dtypes option in read_parquet #31242

jorisvandenbossche commented Jan 23, 2020

jreback Jan 23, 2020

jorisvandenbossche Jan 23, 2020

jorisvandenbossche Jan 23, 2020

WillAyd Jan 23, 2020

jreback commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

WillAyd commented Jan 23, 2020

TomAugspurger commented Jan 24, 2020

jorisvandenbossche commented Jan 24, 2020

jreback commented Apr 10, 2020

jorisvandenbossche commented Apr 11, 2020

simonjayhawkins left a comment

simonjayhawkins commented May 21, 2020

jorisvandenbossche commented May 21, 2020

jreback May 25, 2020

jreback Nov 29, 2020

jorisvandenbossche commented Sep 4, 2020

jbrockmendel commented Sep 11, 2020

jorisvandenbossche commented Sep 29, 2020

jreback Nov 28, 2020

jorisvandenbossche Nov 28, 2020

jreback Nov 28, 2020

jreback Nov 29, 2020

jreback commented Nov 29, 2020

jorisvandenbossche commented Nov 29, 2020

ENH: add use_nullable_dtypes option in read_parquet #31242

ENH: add use_nullable_dtypes option in read_parquet #31242

Conversation

jorisvandenbossche commented Jan 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

WillAyd commented Jan 23, 2020

TomAugspurger commented Jan 24, 2020

jorisvandenbossche commented Jan 24, 2020

jreback commented Apr 10, 2020

jorisvandenbossche commented Apr 11, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

simonjayhawkins commented May 21, 2020

jorisvandenbossche commented May 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 4, 2020

jbrockmendel commented Sep 11, 2020

jorisvandenbossche commented Sep 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 29, 2020

jorisvandenbossche commented Nov 29, 2020