[REVIEW] Support Pandas 1.0+ #4546

brandon-b-miller · 2020-03-17T17:56:37Z

Making this for now to track what has been done / needs to be done to support Pandas. I see this as having several parts:

Fixing things that are outright broken due to various API changes or updates to pandas internals
Implementing compatibility with pd.NA and determining where and to what degree null in cuDF should behave the same
Removing instances in our code where extra work is required to circumvent some aspect of pandas<1.0 behavior

The plan is to update this PR using test failures as a way to start to compile what changes need to be made and work on addressing them.

Breaking Changes

Places we no longer need to circumvent pandas

Transposing a DataFrame with a homogenous categorical type now yields a categorical dataframe instead of object

Tasks around `pd.NA`

Ensure to_pandas and from_pandas properly converts to and from the new nullable datatypes in pandas, for various cuDF objects such as dataframes, series, and indexes

brandon-b-miller · 2020-06-25T20:42:44Z

Created #5581 to track residual items here. I tested cuML, cuGraph, cuSpatial, and cuSignal and submitted PRs to those repos where I found test failures when used together with this PR and pandas 1.0. I was not able to put together an environment in which I could both build cuDF from source and blazing at the same time.

kkraus14

Actually, given dependency change, need to coordinate with ops

kkraus14 · 2020-06-26T03:39:42Z

rerun tests

brandon-b-miller · 2020-06-26T13:19:40Z

rerun tests

mike-wendt · 2020-06-26T14:33:27Z

rerun tests

brandon-b-miller · 2020-06-26T15:13:49Z

Hmm - still seeing 09:37:02 pandas 0.25.3 py36hb3f55d8_0 conda-forge here for some reason.

mike-wendt · 2020-06-26T16:25:09Z

Hmm - still seeing 09:37:02 pandas 0.25.3 py36hb3f55d8_0 conda-forge here for some reason.

~~Where? Can you link a job?~~

You'll probably see that the containers have the old version to start, but then get updated when they pull the new version.

I see it with the containers having the older version. That said the meta.yamls are updated so when the conda builds hit that part they are/did pull the correct pandas version.

BUILD START: ['cudf-0.15.0a-py36_g78e2ed556_1954.tar.bz2']
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda/envs/rapids/conda-bld/cudf_1593185621143/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh


The following NEW packages will be INSTALLED:

...
    pandas:            1.0.5-py36h830a2c2_0                conda-forge

For the GPU tests we do an explicit update so they show the old version, but install the correct one:

>>>> Activate conda env...

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda/envs/gdf

  added / updated specs:
    - cudatoolkit=10.0
    - rapids-build-env=0.15
    - rapids-notebook-env=0.15
    - rmm=0.15


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    librmm-0.15.0a200626       |cuda10.0_g4ffb643_242          49 KB  rapidsai-nightly
    pandas-1.0.5               |   py37h0da4684_0        10.1 MB  conda-forge
    rapids-build-env-0.15.0a200625|cuda10.0_py37_g58e8b8f_73           9 KB  rapidsai-nightly
    rapids-notebook-env-0.15.0a200625|cuda10.0_py37_g58e8b8f_73           8 KB  rapidsai-nightly
    rmm-0.15.0a200626          |py37_g4ffb643_242         1.9 MB  rapidsai-nightly
    ------------------------------------------------------------
                                           Total:        12.1 MB

The following NEW packages will be INSTALLED:

  librmm             rapidsai-nightly/linux-64::librmm-0.15.0a200626-cuda10.0_g4ffb643_242
  rapids-build-env   rapidsai-nightly/linux-64::rapids-build-env-0.15.0a200625-cuda10.0_py37_g58e8b8f_73
  rapids-notebook-e~ rapidsai-nightly/linux-64::rapids-notebook-env-0.15.0a200625-cuda10.0_py37_g58e8b8f_73
  rmm                rapidsai-nightly/linux-64::rmm-0.15.0a200626-py37_g4ffb643_242

The following packages will be UPDATED:

  pandas                              0.25.3-py37hb3f55d8_0 --> 1.0.5-py37h0da4684_0

mike-wendt · 2020-06-26T16:36:33Z

These are the 24 tests of cudf.tests.test_index.test_integer_index_apis that are failing https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-build/28131/testReport/

…ea-support-pandas-1

brandon-b-miller · 2020-06-26T17:32:01Z

@mike-wendt thanks for clarifying the installation behavior, that makes sense.

brandon-b-miller added 7 commits March 16, 2020 09:24

Remove FrozenNDArray

7dbc000

test_categorical_binary_add raises a different error

094803b

allow multiindex codes to come from numpy array

944a992

no need to materialize categorical in transpose

86bce58

test_to_from_pandas casts before roundtripping

c7d7bea

pd.core.indexes.base.Index name is now a property

ab78ba4

circumvent pandas is_bool_dtype behavior

2717d32

brandon-b-miller added 2 - In Progress Currently a work in progress pandas Python Affects Python cuDF API. labels Mar 17, 2020

brandon-b-miller requested a review from a team as a code owner March 17, 2020 17:56

brandon-b-miller added 19 commits March 17, 2020 11:07

cant delete a property

fc34749

astype() no longer accepts kwargs

f84078c

create basic test for nullable integer type

b0454f3

create basic test for nullable boolean type

7bb9f5e

implement _cudf_nullable_pd_dtypes

2acb6c2

rework NumericalColumn.to_pandas()

236db6b

create basic test for nullable string type

d5fcb3c

rework StringColumn.to_pandas()

5cf18b0

Merge branch 'branch-0.14' into fea-support-pandas-1

62d4cd8

test_avro_reader_basic: drop to pandas before cast

89a8155

only set index if not none in string.to_pandas()

73c016a

test_column_offset_and_size casts to string if object

155a4e7

test_repeat: hacky string special casing

37e7604

assert_eq normalizes string to object datatypes

f1f2880

move cast to after everything is pandas

581e0b5

assert_eq instead of use pandas testing directly

421f31e

test_dataframe_hash_partition_masked_value expects pd.NA not -1

ef4511a

fix more dataframe.py tests

187c07b

merge 0.14

2649b6b

This was referenced Jun 23, 2020

[REVIEW] Support Pandas 1.0+ rapidsai/cuml#2465

Merged

[REVIEW] Support Pandas 1.0+ rapidsai/cugraph#970

Closed

[BUG] Rolling count aggregations produce different results than Pandas 1.0+ #5580

Closed

kkraus14 added 5 - Merge After Dependencies 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 5 - Merge After Dependencies 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jun 25, 2020

kkraus14 approved these changes Jun 25, 2020

View reviewed changes

kkraus14 requested changes Jun 25, 2020

View reviewed changes

kkraus14 added 2 commits June 25, 2020 18:54

Remove Pandas 1.0 installation from GPU build script

d237a3d

Forgot line

a824a3c

kkraus14 approved these changes Jun 25, 2020

View reviewed changes

kkraus14 mentioned this pull request Jun 26, 2020

[REVIEW] Add support for axis and other parameters to DataFrame.sort_index and fix other bunch of issues. #5582

Merged

rapidsai deleted a comment from kkraus14 Jun 26, 2020

rgsl888prabhu added 2 commits June 26, 2020 11:48

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into f…

2a6b608

…ea-support-pandas-1

fix test case

e4d9493

mike-wendt merged commit 810e422 into rapidsai:branch-0.15 Jun 26, 2020

quasiben mentioned this pull request Jun 26, 2020

Inconsistent output in GroupBy.apply returning a DataFrame pandas-dev/pandas#34809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[REVIEW] Support Pandas 1.0+ #4546

[REVIEW] Support Pandas 1.0+ #4546

Uh oh!

brandon-b-miller commented Mar 17, 2020 •

edited

Loading

Uh oh!

brandon-b-miller commented Jun 25, 2020

Uh oh!

kkraus14 left a comment

Uh oh!

kkraus14 commented Jun 26, 2020

Uh oh!

brandon-b-miller commented Jun 26, 2020

Uh oh!

mike-wendt commented Jun 26, 2020

Uh oh!

brandon-b-miller commented Jun 26, 2020

Uh oh!

mike-wendt commented Jun 26, 2020 •

edited

Loading

Uh oh!

mike-wendt commented Jun 26, 2020 •

edited

Loading

Uh oh!

brandon-b-miller commented Jun 26, 2020

Uh oh!

Uh oh!

[REVIEW] Support Pandas 1.0+ #4546

[REVIEW] Support Pandas 1.0+ #4546

Uh oh!

Conversation

brandon-b-miller commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Breaking Changes

Places we no longer need to circumvent pandas

Tasks around pd.NA

Uh oh!

brandon-b-miller commented Jun 25, 2020

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

kkraus14 commented Jun 26, 2020

Uh oh!

brandon-b-miller commented Jun 26, 2020

Uh oh!

mike-wendt commented Jun 26, 2020

Uh oh!

brandon-b-miller commented Jun 26, 2020

Uh oh!

mike-wendt commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mike-wendt commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brandon-b-miller commented Jun 26, 2020

Uh oh!

Uh oh!

brandon-b-miller commented Mar 17, 2020 •

edited

Loading

Tasks around `pd.NA`

mike-wendt commented Jun 26, 2020 •

edited

Loading

mike-wendt commented Jun 26, 2020 •

edited

Loading