Skip to content

[REVIEW] Support Pandas 1.0+ #4546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 128 commits into from
Jun 26, 2020

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Mar 17, 2020

Closes #3957

Making this for now to track what has been done / needs to be done to support Pandas. I see this as having several parts:

  • Fixing things that are outright broken due to various API changes or updates to pandas internals
  • Implementing compatibility with pd.NA and determining where and to what degree null in cuDF should behave the same
  • Removing instances in our code where extra work is required to circumvent some aspect of pandas<1.0 behavior

The plan is to update this PR using test failures as a way to start to compile what changes need to be made and work on addressing them.

Breaking Changes

  • FrozenNDArray has been removed, causing cudf MultiIndex.from_pandas to fail
  • Binops between Series raise different errors
  • Series.astype doesn't accept keyword arguments anymore. This makes the places where we do something like sr.astype('category', ordered=ordered) or sr.astype('datetime', format='%y%m%d') broken. We need to either update our API or update how/if we test against pandas. CLN: remove unused categories/ordered handling in astype pandas-dev/pandas#28646
  • In some cases, assigning items into Series objects no longer causes a cast of the whole series.
  • Names are now a property of the base Index class in Pandas instead of an attribute, so we cant del them
  • The metadata concerning the DataFrame index that Pandas writes to parquet files has changed in some cases, this may be related to the new null behavior in Pandas.
  • A change in a function pd.api.types.is_bool_type breaks MultiIndex__getitem__ for some types of tuples.
  • A bug in pandas causes test_json_writer to fail when reading a column of bools. read_json with typ="series" of json list of bools results in timestamps/Exception pandas-dev/pandas#31464
  • When using Series.replace on a categorical column, Pandas now adjusts the resulting dtype to reflect the remaining categories instead of the new categories.
  • Pandas Rolling.count yields a different answer in pandas 1.0, possibly due to BUG: Series rolling count ignores min_periods pandas-dev/pandas#30923
  • Pandas Series.str.cat behaves differently for various values of the input parameter others. We now get a series of nans if others is an index and actually a TypeError for certain inputs that worked before.

Places we no longer need to circumvent pandas

  • Transposing a DataFrame with a homogenous categorical type now yields a categorical dataframe instead of object

Tasks around pd.NA

  • Ensure to_pandas and from_pandas properly converts to and from the new nullable datatypes in pandas, for various cuDF objects such as dataframes, series, and indexes

@brandon-b-miller brandon-b-miller added 2 - In Progress Currently a work in progress pandas Python Affects Python cuDF API. labels Mar 17, 2020
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner March 17, 2020 17:56
@brandon-b-miller
Copy link
Contributor Author

Created #5581 to track residual items here. I tested cuML, cuGraph, cuSpatial, and cuSignal and submitted PRs to those repos where I found test failures when used together with this PR and pandas 1.0. I was not able to put together an environment in which I could both build cuDF from source and blazing at the same time.

@kkraus14 kkraus14 added 5 - Merge After Dependencies 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 5 - Merge After Dependencies 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jun 25, 2020
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, given dependency change, need to coordinate with ops

@kkraus14
Copy link
Collaborator

rerun tests

@brandon-b-miller
Copy link
Contributor Author

rerun tests

1 similar comment
@mike-wendt
Copy link
Contributor

rerun tests

@rapidsai rapidsai deleted a comment from kkraus14 Jun 26, 2020
@brandon-b-miller
Copy link
Contributor Author

Hmm - still seeing 09:37:02 pandas 0.25.3 py36hb3f55d8_0 conda-forge here for some reason.

@mike-wendt
Copy link
Contributor

mike-wendt commented Jun 26, 2020

Hmm - still seeing 09:37:02 pandas 0.25.3 py36hb3f55d8_0 conda-forge here for some reason.

Where? Can you link a job?

You'll probably see that the containers have the old version to start, but then get updated when they pull the new version.

I see it with the containers having the older version. That said the meta.yamls are updated so when the conda builds hit that part they are/did pull the correct pandas version.

BUILD START: ['cudf-0.15.0a-py36_g78e2ed556_1954.tar.bz2']
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda/envs/rapids/conda-bld/cudf_1593185621143/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh


The following NEW packages will be INSTALLED:

...
    pandas:            1.0.5-py36h830a2c2_0                conda-forge

For the GPU tests we do an explicit update so they show the old version, but install the correct one:

>>>> Activate conda env...

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda/envs/gdf

  added / updated specs:
    - cudatoolkit=10.0
    - rapids-build-env=0.15
    - rapids-notebook-env=0.15
    - rmm=0.15


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    librmm-0.15.0a200626       |cuda10.0_g4ffb643_242          49 KB  rapidsai-nightly
    pandas-1.0.5               |   py37h0da4684_0        10.1 MB  conda-forge
    rapids-build-env-0.15.0a200625|cuda10.0_py37_g58e8b8f_73           9 KB  rapidsai-nightly
    rapids-notebook-env-0.15.0a200625|cuda10.0_py37_g58e8b8f_73           8 KB  rapidsai-nightly
    rmm-0.15.0a200626          |py37_g4ffb643_242         1.9 MB  rapidsai-nightly
    ------------------------------------------------------------
                                           Total:        12.1 MB

The following NEW packages will be INSTALLED:

  librmm             rapidsai-nightly/linux-64::librmm-0.15.0a200626-cuda10.0_g4ffb643_242
  rapids-build-env   rapidsai-nightly/linux-64::rapids-build-env-0.15.0a200625-cuda10.0_py37_g58e8b8f_73
  rapids-notebook-e~ rapidsai-nightly/linux-64::rapids-notebook-env-0.15.0a200625-cuda10.0_py37_g58e8b8f_73
  rmm                rapidsai-nightly/linux-64::rmm-0.15.0a200626-py37_g4ffb643_242

The following packages will be UPDATED:

  pandas                              0.25.3-py37hb3f55d8_0 --> 1.0.5-py37h0da4684_0

@mike-wendt
Copy link
Contributor

mike-wendt commented Jun 26, 2020

These are the 24 tests of cudf.tests.test_index.test_integer_index_apis that are failing https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-build/28131/testReport/

@brandon-b-miller
Copy link
Contributor Author

@mike-wendt thanks for clarifying the installation behavior, that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support for Pandas 1.0.0
10 participants