-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] State of nullable type support in cuDF #5754
Comments
cc @BradReesWork @afender @JohnZed @dantegd @thomcom @cwharris @trxcllnt @benfred @EvenOldridge @quasiben @rjzamora @jakirkham as this can have significant downstream impact on libraries that rely on cuDF. Please loop in anyone else this can impact and let us know if there's any issues with the proposed changes. |
Cuspatial mainly uses non-nullable float 32/64 and int, and I think it will be unaffected by these changes. If there is an impact, fixing it should be straightforward in our case.
If I recall correctly, assert_eq relies on Pandas’s assert utilities, which is why we need to convert from non-nullable. Is this a good opportunity to implement our own testing comparators? That could serve as both a good reference and the defacto for how cuDF types map to Pandas types, and allow us to add special consideration for certain types. It’s possible this would eliminate “unless it’s this type, and then test this other way and fill null results with...” logic in our tests - I think this would lead to more consistency, less bugs, faster tests, and less time spent writing tests. |
I think “upgrading”/converting Pandas types to cudf types, then using custom comparators is a good approach - if we choose to write our own. |
This issue has been labeled |
As far as this issue is concerned, it seems to be stuck in sort of a catch-22. We have to use numpy types otherwise interop with other libraries breaks, but that means we're stuck representing our data with lowercase dtypes at the user level. If we do bite the bullet and force uppercase dtypes, it mangles a lot of the cuDF internals that rely on being able to use the NumPy dtype API to do things like resolve result dtypes on operations involving our own objects. We end up needing to write an ExtensionDtype for all of the nullable dtypes we support that pandas does not. This seems like a headache for both us developers and users. A native cuDF dtype system could make the internal logic look clean, but not solve the headache on the user side - in fact it may worsen the problem for users by forcing them to understand an entirely new type system, even if it's very numpy-like. To be honest I am not entirely sure what the best course of action is besides onboarding the entire python data science ecosystem onto a more general type system that works for all libraries. |
Closing this since most of the originally discussed issues have been addressed, contingent upon the decision that we will always use lowercase dtypes even though all of our dtypes are nullable. When we see a pandas 2.0 rc (where we expect to see more complete support for nullable data) we can start reevaluating our dtype handling to see what changes will be needed. |
This issue documents the current state of nullable type support in cuDF and the plan for the near future as far as nullable types are concerned, and is the result of a discussion with @brandon-b-miller and @kkraus14:
Nullable ("uppercase") types (i.e.,
Int64
,Float32
etc.,) will be the default type of columns in cuDF. We will drop non-nullable ("lowercase") types (int64
,float32
, etc.,). Thus:However, for backward compatibility:
cudf.Int64Dtype() == np.int64
dtype="int32"
) when constructing a cuDF object.The
.to_pandas()
methods in cuDF will always return a Pandas object with uppercase type. Users must callastype()
on that result if they want a lowercase type instead. If there are nulls in the output, they will have to do.to_pandas().to_numpy(na_value=...)
. This puts the onus on the user to choose an appropriate na_value.One complication is calling
.to_pandas()
onIndex
objects. Since Pandas doesn't (AFAICT) support indexes with uppercase dtype, it's an open question what we should do in this situation. We can either return a Pandas index ofobject
type, or a Pandas index of lowercase type.Float columns containing both
nan
andnull
: since this is something Pandas doesn't support yet, we will convert nulls to nans when.to_pandas()
is called on a float column with bothnan
andnull
.Testing: many of our tests compare a cuDF result with the Pandas result for the same operation:
Inside the
assert_eq
function, we will convertgot
to an object with lowercase types (using the approach described in (2)), before comparing withexpect
.Documentation: this is a substantial change to cuDF and introduces some differences in behaviour compared to Pandas. That being said, most cuDF users should be largely unaffected by most of the above changes as we have had nullable types from the beginning. Still, we should accompany this release with a blog post explaining these changes, and make sure to explain them in our docs.
The text was updated successfully, but these errors were encountered: