-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: add use_nullable_dtypes option in read_parquet #31242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add use_nullable_dtypes option in read_parquet #31242
Conversation
pandas/io/parquet.py
Outdated
@@ -116,13 +117,32 @@ def write( | |||
**kwargs, | |||
) | |||
|
|||
def read(self, path, columns=None, **kwargs): | |||
def read(self, path, columns=None, use_nullable_dtypes=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about use_extension_dtypes as more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about use_extension_dtypes as more descriptive
It doesn't use extension dtypes in general, only those types that use pd.NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #29752 for some discussion about naming this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I would also prefer use_extension_dtypes
whatever keyword is agreed here we should use in read_csv as well. Also should have a doc-string reference to convert_dtypes |
Can you then try to provide some (counter) arguments? (I was initially also not fond of For me, the main reason to not use *I think we are going to need some terminology to denote "the dtypes that use |
Sure as some quick counter arguments:
The third point would probably the one I think is most of an issue |
This seems unlikely to me. The issue here is that we have multiple representations for the same data (int64, Int64, bool, boolean, etc.), and we want to make it easy to use the new representation, not the one the dataset was written with. Do you foresee any other cases where we would want to do this transformation? The only one I potentially see is if we for some reason added a floating-point extension type. |
Note: I moved the general discussion above to the original issue: #29752 |
@jorisvandenbossche status of this? |
Well, for me this is ready, but I am not sure everybody is agreeing on the keyword name (including you?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche can you merge master to resolve conflicts
@jorisvandenbossche can you rebase. |
Let's first decide in #29752 |
@@ -184,6 +204,12 @@ def write( | |||
) | |||
|
|||
def read(self, path, columns=None, **kwargs): | |||
use_nullable_dtypes = kwargs.pop("use_nullable_dtypes", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have a global option to turn this on (pls add an issue for this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generally worth it, if you can add an issue for this / PR welcome too! (bot blocking for this PR)
Updated with latest master, and addressed some comments (versionadded, additional tests) |
linting issue |
More comments on this? |
if LooseVersion(self.api.__version__) > "0.15.1.dev": | ||
import pandas as pd | ||
|
||
mapping = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you instead import from the arrays locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also import eg DataFrame from the main namespace in this file
pandas/tests/io/test_parquet.py
Outdated
@@ -828,6 +828,35 @@ def test_additional_extension_types(self, pa): | |||
) | |||
check_round_trip(df, pa) | |||
|
|||
@td.skip_if_no("pyarrow", min_version="0.15.1.dev") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
@@ -184,6 +204,12 @@ def write( | |||
) | |||
|
|||
def read(self, path, columns=None, **kwargs): | |||
use_nullable_dtypes = kwargs.pop("use_nullable_dtypes", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generally worth it, if you can add an issue for this / PR welcome too! (bot blocking for this PR)
thanks @jorisvandenbossche comment for followon issue. |
I think #29752 already covers that for now. We first need more IO supporting it anyhow, before a global option would be useful) |
xref #29752, #30929
Using some work I am doing in pyarrow (apache/arrow#6189), we are able to provide an option in
read_parquet
to directly use new nullable dtypes instead of first using the default conversion (eg which gives floats for ints with nulls) and doing the conversion afterwards