ENH: Add nullable dtypes to read_csv #40687

lithomas1 · 2021-03-29T23:52:55Z

closes ENH: add option to get nullable dtypes to pd.read_csv #36712
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

…into use-nullable-csv

pandas/tests/dtypes/test_inference.py

gfyoung · 2021-03-30T02:18:02Z

As a whole looks pretty good, though with the growing number of null-related parameters, I think we will at some point need to condense into one "super parameter" (e.g. dict of some sorts).

jbrockmendel · 2021-03-30T03:33:26Z

pandas/_libs/lib.pyx

@@ -2005,7 +2005,8 @@ def maybe_convert_numeric(
    set na_values,
    bint convert_empty=True,
    bint coerce_numeric=False,
-) -> ndarray:
+    bint convert_to_nullable_integer=False,
+) -> "ArrayLike":


can this be done at a higher level (i.e. not in the cython code)?

That would probably require a substantial refactoring of the Python Parser code, which calls this directly.

FWIW, maybe_convert_objects also has a convert_to_nullable_integer param.

…e-nullable-csv

jorisvandenbossche · 2021-03-31T20:14:06Z

@lithomas1 Cool! Thanks for working on this.

We now also have a nullable float dtype nowadays. I know that this is less useful at this point (since float already supports missing values, in contrast to integer/boolean), but I think that a use_nullable_dtypes keyword should probably also result in nullable float.

EDIT: and actually also the nullable "string" dtype.

As a whole looks pretty good, though with the growing number of null-related parameters, I think we will at some point need to condense into one "super parameter" (e.g. dict of some sorts).

@gfyoung I don't think this keyword is very related to the other read_csv null-related keywords? (or which ones were you thinking about?) And the use_nullable_dtypes keyword is also consistent with other IO methods (#29752, #31242)

jorisvandenbossche · 2021-03-31T20:24:29Z

pandas/tests/io/parser/test_na_values.py

+                    "A": pd_array([True, NA, False], dtype="boolean"),
+                    "B": pd_array([False, True, NA], dtype="boolean"),
+                    "C": np.array([np.nan, np.nan, np.nan], dtype="float64"),
+                    "D": np.array([True, False, True], dtype="bool"),


We should probably use "boolean" dtype if use_nullable_dtypes=True regardless of whether NA's where actually present or not?

All NAs has unknown dtype, so we follow numpy and return float64. Currently, there has to be an integer or boolean value and NAs in order to return a nullable EA type, so a column of a float or bool will just return the numpy types.

This is just here as a test case to make sure I'm not doing the inferencing stuff wrong.

gfyoung · 2021-03-31T20:46:57Z

@gfyoung I don't think this keyword is very related to the other read_csv null-related keywords? (or which ones were you thinking about?) And the use_nullable_dtypes keyword is also consistent with other IO methods (#29752, #31242)

@jorisvandenbossche: They are related broadly speaking, but the consistency argument is fair as well.

I was just saying it's something to consider in the future, especially if we continue to add more arguments to an already long signature for read_csv (and the other I/O methods).

…e-nullable-csv

jreback

can you do a pre-cursor PR which does the lower level changes (e.g. to maybe_convert_*) and fully unit tests those. then we can layer to reader changes in this PR.

lithomas1 · 2021-04-02T20:37:36Z

@jreback I think I will first try to get this feature-complete by adding in support for nullable float and string dtypes. Then I can spit out the inferencing changes. Marking this as draft in the meantime.

…ringarray-nan

simonjayhawkins · 2021-05-22T11:18:00Z

pandas/_libs/parsers.pyx

+    elif use_nullable_dtypes and arr.dtype == np.object_:
+        # Maybe convert StringArray & catch error for non-strings
+        try:
+            arr = StringArray(arr)


The constructors for StringArray and ArrowStringArray are incompatible. To allow the global default for StringDtype backend storage (pyarrow or python) to be used here in the future (after #39908 is merged), we should probably call StringDtype.construct_array_type._from_sequence.

…ringarray-nan

jreback · 2021-10-04T00:17:16Z

@lithomas1 status of this?

lithomas1 · 2021-10-16T22:52:54Z

Depends on #41412, but we need to discuss the correct behavior in #42973.

jreback · 2022-01-16T17:57:25Z

@lithomas1 this might be easier these days?

lithomas1 · 2022-01-16T23:36:24Z

Still stuck by at least #45168.

jreback · 2022-03-06T23:19:05Z

@lithomas1 status of this?

mroeschke · 2022-06-10T21:16:49Z

Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have the time to revisit. Closing to clear out the queue.

lithomas1 and others added 4 commits March 29, 2021 16:51

Add nullable dtypes to read_csv

148fc9d

Merge branch 'master' into use-nullable-csv

053ecd7

Updates

a70f3a4

Merge branch 'use-nullable-csv' of github-other.com:lithomas1/pandas …

1726c29

…into use-nullable-csv

lithomas1 added Enhancement IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 30, 2021

lithomas1 requested review from jreback, jbrockmendel, WillAyd, gfyoung and jorisvandenbossche March 30, 2021 01:46

gfyoung reviewed Mar 30, 2021

View reviewed changes

pandas/tests/dtypes/test_inference.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Mar 30, 2021

View reviewed changes

lithomas1 added 2 commits March 30, 2021 08:50

More thorough testing

2504be6

Merge branch 'master' of https://github.com/pandas-dev/pandas into us…

89f4032

…e-nullable-csv

lithomas1 closed this Mar 31, 2021

lithomas1 reopened this Mar 31, 2021

Optimizations & Found a bug!

63733dc

lithomas1 closed this Mar 31, 2021

lithomas1 reopened this Mar 31, 2021

jorisvandenbossche reviewed Mar 31, 2021

View reviewed changes

lithomas1 and others added 2 commits April 1, 2021 12:59

Merge branch 'master' of https://github.com/pandas-dev/pandas into us…

ea63fb2

…e-nullable-csv

Merge branch 'master' into use-nullable-csv

738c340

jreback requested changes Apr 2, 2021

View reviewed changes

lithomas1 marked this pull request as draft April 2, 2021 20:37

lithomas1 and others added 5 commits May 11, 2021 07:01

Remove failing test

418e1d2

Changes from code review

25a6c4d

Merge branch 'master' of https://github.com/pandas-dev/pandas into st…

47d68f7

…ringarray-nan

typo

8257dbd

Update lib.pyi

922436a

simonjayhawkins reviewed May 22, 2021

View reviewed changes

lithomas1 and others added 10 commits May 29, 2021 11:03

Update lib.pyx

2f28086

Update lib.pyx

3ee2198

Merge branch 'master' of https://github.com/pandas-dev/pandas into st…

9426a52

…ringarray-nan

Updates

3ee55f2

Update lib.pyx

fe4981a

Update lib.pyx

a66948a

Update lib.pyx

e852719

disallow invalid nans in stringarray constructor

91b73bb

Merge branch 'master' into stringarray-nan

42ec3e4

add to _from_sequence and fixes

41f49d2

lithomas1 marked this pull request as ready for review June 4, 2021 20:17

lithomas1 and others added 4 commits June 4, 2021 13:19

Merge branch 'master' into use-nullable-csv

156d29f

Update to make work

033580f

Merge branch 'stringarray-nan' into use-nullable-csv

f437d77

Merge branch 'master' into use-nullable-csv

e4ed02e

lithomas1 mentioned this pull request Jul 19, 2021

CI: Fastparquet upgrade broke CI #42588

Closed

lithomas1 mentioned this pull request Aug 10, 2021

DIS: Should the keyword use_nullable_dtypes use nullable dtypes in the absence of nulls #42973

Closed

mroeschke closed this Jun 10, 2022

Uh oh!

ENH: Add nullable dtypes to read_csv #40687

ENH: Add nullable dtypes to read_csv #40687

Uh oh!

Conversation

lithomas1 commented Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gfyoung commented Mar 30, 2021

Uh oh!

jbrockmendel Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

lithomas1 Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

lithomas1 Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Mar 31, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

lithomas1 commented Apr 2, 2021

Uh oh!

simonjayhawkins May 22, 2021

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 4, 2021

Uh oh!

lithomas1 commented Oct 16, 2021

Uh oh!

jreback commented Jan 16, 2022

Uh oh!

lithomas1 commented Jan 16, 2022

Uh oh!

jreback commented Mar 6, 2022

Uh oh!

mroeschke commented Jun 10, 2022

Uh oh!

Uh oh!

lithomas1 commented Mar 29, 2021 •

edited

Loading

jorisvandenbossche commented Mar 31, 2021 •

edited

Loading

lithomas1 Apr 1, 2021 •

edited

Loading