-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add nullable dtypes to read_csv #40687
Conversation
lithomas1
commented
Mar 29, 2021
•
edited
Loading
edited
- closes ENH: add option to get nullable dtypes to pd.read_csv #36712
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
As a whole looks pretty good, though with the growing number of null-related parameters, I think we will at some point need to condense into one "super parameter" (e.g. |
pandas/_libs/lib.pyx
Outdated
@@ -2005,7 +2005,8 @@ def maybe_convert_numeric( | |||
set na_values, | |||
bint convert_empty=True, | |||
bint coerce_numeric=False, | |||
) -> ndarray: | |||
bint convert_to_nullable_integer=False, | |||
) -> "ArrayLike": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be done at a higher level (i.e. not in the cython code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably require a substantial refactoring of the Python Parser code, which calls this directly.
FWIW, maybe_convert_objects also has a convert_to_nullable_integer param.
@lithomas1 Cool! Thanks for working on this. We now also have a nullable float dtype nowadays. I know that this is less useful at this point (since float already supports missing values, in contrast to integer/boolean), but I think that a EDIT: and actually also the nullable "string" dtype.
@gfyoung I don't think this keyword is very related to the other read_csv null-related keywords? (or which ones were you thinking about?) And the |
"A": pd_array([True, NA, False], dtype="boolean"), | ||
"B": pd_array([False, True, NA], dtype="boolean"), | ||
"C": np.array([np.nan, np.nan, np.nan], dtype="float64"), | ||
"D": np.array([True, False, True], dtype="bool"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use "boolean" dtype if use_nullable_dtypes=True
regardless of whether NA's where actually present or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All NAs has unknown dtype, so we follow numpy and return float64. Currently, there has to be an integer or boolean value and NAs in order to return a nullable EA type, so a column of a float or bool will just return the numpy types.
This is just here as a test case to make sure I'm not doing the inferencing stuff wrong.
@jorisvandenbossche: They are related broadly speaking, but the consistency argument is fair as well. I was just saying it's something to consider in the future, especially if we continue to add more arguments to an already long signature for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do a pre-cursor PR which does the lower level changes (e.g. to maybe_convert_*) and fully unit tests those. then we can layer to reader changes in this PR.
@jreback I think I will first try to get this feature-complete by adding in support for nullable float and string dtypes. Then I can spit out the inferencing changes. Marking this as draft in the meantime. |
pandas/_libs/parsers.pyx
Outdated
elif use_nullable_dtypes and arr.dtype == np.object_: | ||
# Maybe convert StringArray & catch error for non-strings | ||
try: | ||
arr = StringArray(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructors for StringArray and ArrowStringArray are incompatible. To allow the global default for StringDtype backend storage (pyarrow or python) to be used here in the future (after #39908 is merged), we should probably call StringDtype.construct_array_type._from_sequence.
@lithomas1 status of this? |
@lithomas1 this might be easier these days? |
Still stuck by at least #45168. |
@lithomas1 status of this? |
Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have the time to revisit. Closing to clear out the queue. |