Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add nullable dtypes to read_csv #40687

Closed
wants to merge 41 commits into from

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Mar 29, 2021

@lithomas1 lithomas1 added Enhancement IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 30, 2021
@gfyoung
Copy link
Member

gfyoung commented Mar 30, 2021

As a whole looks pretty good, though with the growing number of null-related parameters, I think we will at some point need to condense into one "super parameter" (e.g. dict of some sorts).

@@ -2005,7 +2005,8 @@ def maybe_convert_numeric(
set na_values,
bint convert_empty=True,
bint coerce_numeric=False,
) -> ndarray:
bint convert_to_nullable_integer=False,
) -> "ArrayLike":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be done at a higher level (i.e. not in the cython code)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably require a substantial refactoring of the Python Parser code, which calls this directly.

FWIW, maybe_convert_objects also has a convert_to_nullable_integer param.

@lithomas1 lithomas1 closed this Mar 31, 2021
@lithomas1 lithomas1 reopened this Mar 31, 2021
@lithomas1 lithomas1 closed this Mar 31, 2021
@lithomas1 lithomas1 reopened this Mar 31, 2021
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 31, 2021

@lithomas1 Cool! Thanks for working on this.

We now also have a nullable float dtype nowadays. I know that this is less useful at this point (since float already supports missing values, in contrast to integer/boolean), but I think that a use_nullable_dtypes keyword should probably also result in nullable float.

EDIT: and actually also the nullable "string" dtype.

As a whole looks pretty good, though with the growing number of null-related parameters, I think we will at some point need to condense into one "super parameter" (e.g. dict of some sorts).

@gfyoung I don't think this keyword is very related to the other read_csv null-related keywords? (or which ones were you thinking about?) And the use_nullable_dtypes keyword is also consistent with other IO methods (#29752, #31242)

"A": pd_array([True, NA, False], dtype="boolean"),
"B": pd_array([False, True, NA], dtype="boolean"),
"C": np.array([np.nan, np.nan, np.nan], dtype="float64"),
"D": np.array([True, False, True], dtype="bool"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use "boolean" dtype if use_nullable_dtypes=True regardless of whether NA's where actually present or not?

Copy link
Member Author

@lithomas1 lithomas1 Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All NAs has unknown dtype, so we follow numpy and return float64. Currently, there has to be an integer or boolean value and NAs in order to return a nullable EA type, so a column of a float or bool will just return the numpy types.

This is just here as a test case to make sure I'm not doing the inferencing stuff wrong.

@gfyoung
Copy link
Member

gfyoung commented Mar 31, 2021

@gfyoung I don't think this keyword is very related to the other read_csv null-related keywords? (or which ones were you thinking about?) And the use_nullable_dtypes keyword is also consistent with other IO methods (#29752, #31242)

@jorisvandenbossche: They are related broadly speaking, but the consistency argument is fair as well.

I was just saying it's something to consider in the future, especially if we continue to add more arguments to an already long signature for read_csv (and the other I/O methods).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do a pre-cursor PR which does the lower level changes (e.g. to maybe_convert_*) and fully unit tests those. then we can layer to reader changes in this PR.

@lithomas1
Copy link
Member Author

@jreback I think I will first try to get this feature-complete by adding in support for nullable float and string dtypes. Then I can spit out the inferencing changes. Marking this as draft in the meantime.

@lithomas1 lithomas1 marked this pull request as draft April 2, 2021 20:37
elif use_nullable_dtypes and arr.dtype == np.object_:
# Maybe convert StringArray & catch error for non-strings
try:
arr = StringArray(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructors for StringArray and ArrowStringArray are incompatible. To allow the global default for StringDtype backend storage (pyarrow or python) to be used here in the future (after #39908 is merged), we should probably call StringDtype.construct_array_type._from_sequence.

@lithomas1 lithomas1 marked this pull request as ready for review June 4, 2021 20:17
@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

@lithomas1 status of this?

@lithomas1
Copy link
Member Author

Depends on #41412, but we need to discuss the correct behavior in #42973.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

@lithomas1 this might be easier these days?

@lithomas1
Copy link
Member Author

Still stuck by at least #45168.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2022

@lithomas1 status of this?

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have the time to revisit. Closing to clear out the queue.

@mroeschke mroeschke closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add option to get nullable dtypes to pd.read_csv
8 participants