Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] State of nullable type support in cuDF #5754

Closed
shwina opened this issue Jul 23, 2020 · 7 comments
Closed

[DISCUSS] State of nullable type support in cuDF #5754

shwina opened this issue Jul 23, 2020 · 7 comments
Labels
proposal Change current process or code Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

This issue documents the current state of nullable type support in cuDF and the plan for the near future as far as nullable types are concerned, and is the result of a discussion with @brandon-b-miller and @kkraus14:

  1. Nullable ("uppercase") types (i.e., Int64, Float32 etc.,) will be the default type of columns in cuDF. We will drop non-nullable ("lowercase") types (int64, float32, etc.,). Thus:

    >>> a = cudf.Series([1, 2, 3])
    >>> a.dtype
    Int64Dtype()

    However, for backward compatibility:

    • cuDF uppercase types will compare equal to their corresponding Pandas/numpy lowercase type. Thus cudf.Int64Dtype() == np.int64
    • cuDF objects can still be constructed from objects of lowercase types, and one can still specify a lowercase type (e.g., dtype="int32") when constructing a cuDF object.
  2. The .to_pandas() methods in cuDF will always return a Pandas object with uppercase type. Users must call astype() on that result if they want a lowercase type instead. If there are nulls in the output, they will have to do .to_pandas().to_numpy(na_value=...). This puts the onus on the user to choose an appropriate na_value.

  3. One complication is calling .to_pandas() on Index objects. Since Pandas doesn't (AFAICT) support indexes with uppercase dtype, it's an open question what we should do in this situation. We can either return a Pandas index of object type, or a Pandas index of lowercase type.

  4. Float columns containing both nan and null: since this is something Pandas doesn't support yet, we will convert nulls to nans when .to_pandas() is called on a float column with both nan and null.

  5. Testing: many of our tests compare a cuDF result with the Pandas result for the same operation:

    expect = pd.func(...)
    got = cudf.func(...)
    assert_eq(expect, got)

    Inside the assert_eq function, we will convert got to an object with lowercase types (using the approach described in (2)), before comparing with expect.

  6. Documentation: this is a substantial change to cuDF and introduces some differences in behaviour compared to Pandas. That being said, most cuDF users should be largely unaffected by most of the above changes as we have had nullable types from the beginning. Still, we should accompany this release with a blog post explaining these changes, and make sure to explain them in our docs.

@shwina shwina added bug Something isn't working Needs Triage Need team to review and classify labels Jul 23, 2020
@shwina shwina added Python Affects Python cuDF API. and removed bug Something isn't working Needs Triage Need team to review and classify labels Jul 23, 2020
@kkraus14
Copy link
Collaborator

cc @BradReesWork @afender @JohnZed @dantegd @thomcom @cwharris @trxcllnt @benfred @EvenOldridge @quasiben @rjzamora @jakirkham as this can have significant downstream impact on libraries that rely on cuDF.

Please loop in anyone else this can impact and let us know if there's any issues with the proposed changes.

@kkraus14 kkraus14 added the proposal Change current process or code label Jul 24, 2020
@cwharris
Copy link
Contributor

Cuspatial mainly uses non-nullable float 32/64 and int, and I think it will be unaffected by these changes. If there is an impact, fixing it should be straightforward in our case.

Inside the assert_eq function, we will convert got to an object with lowercase types (using the approach described in (2)), before comparing with expect.

If I recall correctly, assert_eq relies on Pandas’s assert utilities, which is why we need to convert from non-nullable. Is this a good opportunity to implement our own testing comparators? That could serve as both a good reference and the defacto for how cuDF types map to Pandas types, and allow us to add special consideration for certain types. It’s possible this would eliminate “unless it’s this type, and then test this other way and fill null results with...” logic in our tests - I think this would lead to more consistency, less bugs, faster tests, and less time spent writing tests.

@cwharris
Copy link
Contributor

I think “upgrading”/converting Pandas types to cudf types, then using custom comparators is a good approach - if we choose to write our own.

@shwina
Copy link
Contributor Author

shwina commented Jul 28, 2020

@cwharris Opened #5788 for discussion about cuDF testing strategy, as I think it's a broader topic than this. Hope that's OK :)

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@brandon-b-miller
Copy link
Contributor

As far as this issue is concerned, it seems to be stuck in sort of a catch-22. We have to use numpy types otherwise interop with other libraries breaks, but that means we're stuck representing our data with lowercase dtypes at the user level.

If we do bite the bullet and force uppercase dtypes, it mangles a lot of the cuDF internals that rely on being able to use the NumPy dtype API to do things like resolve result dtypes on operations involving our own objects. We end up needing to write an ExtensionDtype for all of the nullable dtypes we support that pandas does not. This seems like a headache for both us developers and users. A native cuDF dtype system could make the internal logic look clean, but not solve the headache on the user side - in fact it may worsen the problem for users by forcing them to understand an entirely new type system, even if it's very numpy-like.

To be honest I am not entirely sure what the best course of action is besides onboarding the entire python data science ecosystem onto a more general type system that works for all libraries.

@vyasr
Copy link
Contributor

vyasr commented Jul 12, 2022

Closing this since most of the originally discussed issues have been addressed, contingent upon the decision that we will always use lowercase dtypes even though all of our dtypes are nullable. When we see a pandas 2.0 rc (where we expect to see more complete support for nullable data) we can start reevaluating our dtype handling to see what changes will be needed.

@vyasr vyasr closed this as completed Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Change current process or code Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants