-
Notifications
You must be signed in to change notification settings - Fork 951
[REVIEW] Support Pandas 1.0+ #4546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Support Pandas 1.0+ #4546
Conversation
Created #5581 to track residual items here. I tested cuML, cuGraph, cuSpatial, and cuSignal and submitted PRs to those repos where I found test failures when used together with this PR and pandas 1.0. I was not able to put together an environment in which I could both build cuDF from source and blazing at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, given dependency change, need to coordinate with ops
rerun tests |
rerun tests |
1 similar comment
rerun tests |
Hmm - still seeing |
You'll probably see that the containers have the old version to start, but then get updated when they pull the new version. I see it with the containers having the older version. That said the
For the GPU tests we do an explicit update so they show the old version, but install the correct one:
|
These are the 24 tests of |
@mike-wendt thanks for clarifying the installation behavior, that makes sense. |
Closes #3957
Making this for now to track what has been done / needs to be done to support Pandas. I see this as having several parts:
pd.NA
and determining where and to what degreenull
in cuDF should behave the samepandas<1.0
behaviorThe plan is to update this PR using test failures as a way to start to compile what changes need to be made and work on addressing them.
Breaking Changes
Series
raise different errorsSeries.astype
doesn't accept keyword arguments anymore. This makes the places where we do something likesr.astype('category', ordered=ordered)
orsr.astype('datetime', format='%y%m%d')
broken. We need to either update our API or update how/if we test against pandas. CLN: remove unused categories/ordered handling in astype pandas-dev/pandas#28646Series
objects no longer causes a cast of the whole series.del
thempd.api.types.is_bool_type
breaksMultiIndex__getitem__
for some types of tuples.Series.replace
on a categorical column, Pandas now adjusts the resulting dtype to reflect the remaining categories instead of the new categories.Rolling.count
yields a different answer in pandas 1.0, possibly due to BUG: Series rolling count ignores min_periods pandas-dev/pandas#30923Series.str.cat
behaves differently for various values of the input parameterothers
. We now get a series of nans ifothers
is an index and actually aTypeError
for certain inputs that worked before.Places we no longer need to circumvent pandas
DataFrame
with a homogenous categorical type now yields a categorical dataframe instead ofobject
Tasks around
pd.NA
to_pandas
andfrom_pandas
properly converts to and from the new nullable datatypes in pandas, for various cuDF objects such as dataframes, series, and indexes