[DISCUSS] State of nullable type support in cuDF #5754

shwina · 2020-07-23T21:01:27Z

This issue documents the current state of nullable type support in cuDF and the plan for the near future as far as nullable types are concerned, and is the result of a discussion with @brandon-b-miller and @kkraus14:

Nullable ("uppercase") types (i.e., Int64, Float32 etc.,) will be the default type of columns in cuDF. We will drop non-nullable ("lowercase") types (int64, float32, etc.,). Thus:
```
>>> a = cudf.Series([1, 2, 3])
>>> a.dtype
Int64Dtype()
```
However, for backward compatibility:
- cuDF uppercase types will compare equal to their corresponding Pandas/numpy lowercase type. Thus cudf.Int64Dtype() == np.int64
- cuDF objects can still be constructed from objects of lowercase types, and one can still specify a lowercase type (e.g., dtype="int32") when constructing a cuDF object.
The .to_pandas() methods in cuDF will always return a Pandas object with uppercase type. Users must call astype() on that result if they want a lowercase type instead. If there are nulls in the output, they will have to do .to_pandas().to_numpy(na_value=...). This puts the onus on the user to choose an appropriate na_value.
One complication is calling .to_pandas() on Index objects. Since Pandas doesn't (AFAICT) support indexes with uppercase dtype, it's an open question what we should do in this situation. We can either return a Pandas index of object type, or a Pandas index of lowercase type.
Float columns containing both nan and null: since this is something Pandas doesn't support yet, we will convert nulls to nans when .to_pandas() is called on a float column with both nan and null.
Testing: many of our tests compare a cuDF result with the Pandas result for the same operation:
```
expect = pd.func(...)
got = cudf.func(...)
assert_eq(expect, got)
```
Inside the assert_eq function, we will convert got to an object with lowercase types (using the approach described in (2)), before comparing with expect.
Documentation: this is a substantial change to cuDF and introduces some differences in behaviour compared to Pandas. That being said, most cuDF users should be largely unaffected by most of the above changes as we have had nullable types from the beginning. Still, we should accompany this release with a blog post explaining these changes, and make sure to explain them in our docs.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-07-23T21:31:51Z

cc @BradReesWork @afender @JohnZed @dantegd @thomcom @cwharris @trxcllnt @benfred @EvenOldridge @quasiben @rjzamora @jakirkham as this can have significant downstream impact on libraries that rely on cuDF.

Please loop in anyone else this can impact and let us know if there's any issues with the proposed changes.

cwharris · 2020-07-28T17:48:02Z

Cuspatial mainly uses non-nullable float 32/64 and int, and I think it will be unaffected by these changes. If there is an impact, fixing it should be straightforward in our case.

Inside the assert_eq function, we will convert got to an object with lowercase types (using the approach described in (2)), before comparing with expect.

If I recall correctly, assert_eq relies on Pandas’s assert utilities, which is why we need to convert from non-nullable. Is this a good opportunity to implement our own testing comparators? That could serve as both a good reference and the defacto for how cuDF types map to Pandas types, and allow us to add special consideration for certain types. It’s possible this would eliminate “unless it’s this type, and then test this other way and fill null results with...” logic in our tests - I think this would lead to more consistency, less bugs, faster tests, and less time spent writing tests.

cwharris · 2020-07-28T17:50:29Z

I think “upgrading”/converting Pandas types to cudf types, then using custom comparators is a good approach - if we choose to write our own.

shwina · 2020-07-28T18:12:33Z

@cwharris Opened #5788 for discussion about cuDF testing strategy, as I think it's a broader topic than this. Hope that's OK :)

github-actions · 2021-02-19T16:25:42Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

brandon-b-miller · 2021-02-19T17:27:42Z

As far as this issue is concerned, it seems to be stuck in sort of a catch-22. We have to use numpy types otherwise interop with other libraries breaks, but that means we're stuck representing our data with lowercase dtypes at the user level.

If we do bite the bullet and force uppercase dtypes, it mangles a lot of the cuDF internals that rely on being able to use the NumPy dtype API to do things like resolve result dtypes on operations involving our own objects. We end up needing to write an ExtensionDtype for all of the nullable dtypes we support that pandas does not. This seems like a headache for both us developers and users. A native cuDF dtype system could make the internal logic look clean, but not solve the headache on the user side - in fact it may worsen the problem for users by forcing them to understand an entirely new type system, even if it's very numpy-like.

To be honest I am not entirely sure what the best course of action is besides onboarding the entire python data science ecosystem onto a more general type system that works for all libraries.

vyasr · 2022-07-12T21:04:38Z

Closing this since most of the originally discussed issues have been addressed, contingent upon the decision that we will always use lowercase dtypes even though all of our dtypes are nullable. When we see a pandas 2.0 rc (where we expect to see more complete support for nullable data) we can start reevaluating our dtype handling to see what changes will be needed.

shwina added bug Something isn't working Needs Triage Need team to review and classify labels Jul 23, 2020

shwina added Python Affects Python cuDF API. and removed bug Something isn't working Needs Triage Need team to review and classify labels Jul 23, 2020

kkraus14 added the proposal Change current process or code label Jul 24, 2020

shwina mentioned this issue Jul 28, 2020

[DISCUSS] cuDF testing strategy #5788

Closed

shwina mentioned this issue Aug 12, 2020

Converting DataFrame with nullable types to DataFrame with non-nullable types? pandas-dev/pandas#35694

Open

kkraus14 mentioned this issue Sep 23, 2020

[DISCUSSION] None conversion to pandas #5388

Closed

wphicks mentioned this issue Oct 13, 2020

[Fea] Data imputation limited by null conversion rapidsai/cuml#2966

Closed

github-actions bot added the inactive-90d label Feb 19, 2021

brandon-b-miller removed the inactive-90d label Feb 19, 2021

vyasr closed this as completed Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] State of nullable type support in cuDF #5754

[DISCUSS] State of nullable type support in cuDF #5754

shwina commented Jul 23, 2020 •

edited

Loading

kkraus14 commented Jul 23, 2020

cwharris commented Jul 28, 2020

cwharris commented Jul 28, 2020

shwina commented Jul 28, 2020

github-actions bot commented Feb 19, 2021

brandon-b-miller commented Feb 19, 2021

vyasr commented Jul 12, 2022

[DISCUSS] State of nullable type support in cuDF #5754

[DISCUSS] State of nullable type support in cuDF #5754

Comments

shwina commented Jul 23, 2020 • edited Loading

kkraus14 commented Jul 23, 2020

cwharris commented Jul 28, 2020

cwharris commented Jul 28, 2020

shwina commented Jul 28, 2020

github-actions bot commented Feb 19, 2021

brandon-b-miller commented Feb 19, 2021

vyasr commented Jul 12, 2022

shwina commented Jul 23, 2020 •

edited

Loading