Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] None conversion to pandas #5388

Closed
rgsl888prabhu opened this issue Jun 4, 2020 · 3 comments
Closed

[DISCUSSION] None conversion to pandas #5388

rgsl888prabhu opened this issue Jun 4, 2020 · 3 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@rgsl888prabhu
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Pandas and Cupy don't handle None in the integer type arrays and to satisfy this behavior cudf fills None with -1. But with ongoing porting with unsigned dtypes this will be a problem.

Describe the solution you'd like

  • Pandas 1.0 has nullable integer type, so we could depend on that and wait.
    (Note: Currently we are dependent on numba to convert the device array to host array, not sure how to handle None here)

  • Or we can use max of that dtype or 0 for unsigned, similar to -1 used for signed integer.

Open for discussion and suggestions.

@rgsl888prabhu rgsl888prabhu added feature request New feature or request Needs Triage Need team to review and classify and removed Needs Triage Need team to review and classify labels Jun 4, 2020
@brandon-b-miller
Copy link
Contributor

brandon-b-miller commented Jun 4, 2020

The plan for Pandas 1.0 nullable integer support, at least so far, is to do something like this:

https://github.com/rapidsai/cudf/blob/branch-0.15/python/cudf/cudf/core/column/column.py#L124-L135

Here we use the mask to figure out where to place the resulting Nones. In 1.0+, the plan is to more or less do this with pd.NA. In this case we just allow Numba to do whatever it wants with those values, and mask them out later.

In general, I think to_pandas should result in a pandas object with the appropriate nullable datatype (pd.Int64Dtype, pd.BooleanDtype, etc). However doing this cleanly might involve using those nullable types as the dtype of our cuDF objects, and that would be a fairly sweeping change in which we might as well just invent our own cuDF dtype.

@shwina
Copy link
Contributor

shwina commented Jun 12, 2020

@kkraus14 suggested the idea to route our to_pandas and from_pandas through Arrow, so that to_pandas() basically does to_arrow().to_pandas(). This frees us up to worry only about how to translate cuDF objects to Arrow objects. If/when Arrow decides to convert to Pandas nullable types, we get that "for free".

@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Jun 15, 2020
@kkraus14
Copy link
Collaborator

Defer to #5754 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants