Support primitive types of pyarrow-backed pandas dataframe. #8653

trivialfis · 2023-01-10T14:21:08Z

Related: #8598

Categorical data (dictionary) is not supported at the moment as I can't get the correct ordering of the indices. Uncomment the code in tests to see the different outputs.

~~Boolean type is not supported due to the error noted inline.~~

The PR simply converts the data to numpy array with pyarrow handling the data type conversion. I tried to inspect the underlying implementation of pyarrow arrays, each array has children and can be backed by multiple buffers, which is difficult for XGBoost to parse. We have an interface for arrow that doesn't need to merge arrow chunks, but it's only usable with DMatrix with a CSR copy, we need pandas dataframe for other uses like inplace prediction and quantile dmatrix.

python-package/xgboost/data.py

python-package/xgboost/testing/data.py

trivialfis · 2023-01-12T18:41:36Z

python-package/xgboost/testing/data.py

+    # Error:
+    # >>> df.astype("category")
+    #   Function 'dictionary_encode' has no kernel matching input types
+    #   (array[dictionary<values=int32, indices=int32, ordered=0>])


For some reason, ordered is changed to 0 here, maybe I should open an issue on pandas or arrow?

This may have been an issue in an older pyarrow version? While debugging this call I'm getting

In [8]: df.astype("category") > /Users/.../pandas/core/arrays/arrow/array.py(656)factorize() -> null_encoding = "mask" if use_na_sentinel else "encode" (Pdb) self <ArrowExtensionArray> [0, 2, <NA>, 3] Length: 4, dtype: dictionary<values=int32, indices=int32, ordered=1>[pyarrow] (Pdb) self._data <pyarrow.lib.ChunkedArray object at 0x7f7c6a932a40> [ -- dictionary: [ 0, 2, 3 ] -- indices: [ 0, 1, null, 2 ] ] (Pdb) n > /Users/.../pandas/core/arrays/arrow/array.py(657)factorize() -> encoded = self._data.dictionary_encode(null_encoding=null_encoding) (Pdb) null_encoding 'mask' (Pdb) self._data.dictionary_encode(null_encoding=null_encoding) *** pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=1>)

I'm using pyarrow 10.0.1 from pypi:

import numpy as np import pandas as pd import pyarrow as pa print(pd.__version__) print(pa.__version__) Null = np.nan category = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.int32(), ordered=True)) df = pd.DataFrame({"f0": [0, 2, Null, 3], "f1": [4, 3, Null, 1]}, dtype=category) df.astype("category")

ordered=0 at the end.

1.5.2 10.0.1 Traceback (most recent call last): File "/home/jiamingy/Workspace/XGBoost-dev/XGBoostUtils/pyarrow/test.py", line 11, in <module> df.astype("category") File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6233, in astype results = [ File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6234, in <listcomp> self.iloc[:, i].astype(dtype, copy=copy) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 450, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply applied = getattr(b, f)(**kwargs) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array values = values.astype(dtype, copy=copy) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/base.py", line 608, in astype return cls._from_sequence(self, dtype=dtype, copy=copy) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 495, in _from_sequence return Categorical(scalars, dtype=dtype, copy=copy) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 441, in __init__ codes, categories = factorize(values, sort=True) File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/algorithms.py", line 785, in factorize codes, uniques = values.factorize( # type: ignore[call-arg] File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py", line 603, in factorize encoded = self._data.dictionary_encode(null_encoding=null_encoding) File "pyarrow/table.pxi", line 586, in pyarrow.lib.ChunkedArray.dictionary_encode File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=0>)

Ah okay. This might be a a bug in pandas astypeing in 1.5.2.

On the main branch (for pandas 2.0), I am getting ordered=1

In [1]: import numpy as np ...: import pandas as pd ...: import pyarrow as pa ...: ...: print(pd.__version__) ...: print(pa.__version__) ...: ...: Null = np.nan ...: category = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.int32(), ordered=True)) ...: df = pd.DataFrame({"f0": [0, 2, Null, 3], "f1": [4, 3, Null, 1]}, dtype=category) 2.0.0.dev0+1220.gd72e244db5 10.0.1 In [2]: df.astype("category") --------------------------------------------------------------------------- ArrowNotImplementedError Traceback (most recent call last) Cell In [2], line 1 ----> 1 df.astype("category" ... ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=1>)

Looking forward to the new pandas!

jrbourbeau

Thanks for all your work here @trivialfis! I'm hoping to take this PR for a spin sometime next week

jrbourbeau · 2023-01-14T19:31:07Z

python-package/xgboost/data.py

+pandas_pyarrow_mapper = {
+    "int8[pyarrow]": "i",
+    "int16[pyarrow]": "i",
+    "int32[pyarrow]": "i",
+    "int64[pyarrow]": "i",
+    "uint8[pyarrow]": "i",
+    "uint16[pyarrow]": "i",
+    "uint32[pyarrow]": "i",
+    "uint64[pyarrow]": "i",
+    "float[pyarrow]": "float",
+    "float32[pyarrow]": "float",
+    "double[pyarrow]": "float",
+    "float64[pyarrow]": "float",
+    "bool[pyarrow]": "i",
+}


I could totally be missing something as I'm not familiar with how xgboost handles data types internally. Does this mapping somehow mean all the pyarrow int types are mapped to the same representation in xgboost? Similar question for float data

yes. all of them are treated as float32 inside xgboost. The string "i", and "float" are used for visualization and categorical split (which is not yet available for pyarrow for now due to mentioned error).

jrbourbeau · 2023-01-14T19:39:15Z

python-package/xgboost/testing/data.py

+    orig = pd.DataFrame(
+        {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
+        dtype=pd.BooleanDtype(),
+    )
+    df = pd.DataFrame(
+        {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
+        dtype=pd.ArrowDtype(pa.bool_()),
+    )


Not meant as a suggestion, but just an FYI. You can also specify extension dtypes as strings if you prefer:

In [1]: import pandas as pd In [2]: orig = pd.DataFrame( ...: {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]}, ...: dtype="boolean", ...: ) In [3]: df = pd.DataFrame( ...: {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]}, ...: dtype="boolean[pyarrow]", ...: ) In [4]: orig Out[4]: f0 f1 0 True False 1 False True 2 <NA> <NA> 3 True True In [5]: orig.dtypes Out[5]: f0 boolean f1 boolean dtype: object In [6]: df.dtypes Out[6]: f0 bool[pyarrow] f1 bool[pyarrow] dtype: object

What you have here already is totally equivalent. Just a heads up in case you find this approach more convenient

Thank you for sharing!

This was referenced Jan 10, 2023

Add support for using data with pyarrow-backed pandas extension dtypes #8598

Open

Categorical data support (part 2) #7899

Open

trivialfis commented Jan 10, 2023

View reviewed changes

python-package/xgboost/data.py Outdated Show resolved Hide resolved

mroeschke reviewed Jan 10, 2023

View reviewed changes

python-package/xgboost/data.py Outdated Show resolved Hide resolved

mroeschke reviewed Jan 10, 2023

View reviewed changes

python-package/xgboost/data.py Outdated Show resolved Hide resolved

trivialfis force-pushed the pyarrow-pd-dtypes branch from c48e0c2 to f6627b2 Compare January 11, 2023 14:42

mroeschke reviewed Jan 12, 2023

View reviewed changes

python-package/xgboost/testing/data.py Outdated Show resolved Hide resolved

trivialfis commented Jan 12, 2023

View reviewed changes

trivialfis added 12 commits January 15, 2023 02:46

trying out pyarrow types.

64ccf49

playing around.

fae03a1

poc.

98f569d

note.

48e9972

cleanup.

020bf83

Drop bool and cat.

777b730

lazy isinstance.

b1ec6cd

reviewer's comments.

eda1671

lint.

a348445

skipif.

6842dea

test boolean, more notes for categorical data.

4db6857

lint.

0fcc002

trivialfis force-pushed the pyarrow-pd-dtypes branch from c53b6e8 to 0fcc002 Compare January 14, 2023 18:58

jrbourbeau reviewed Jan 14, 2023

View reviewed changes

hcho3 approved these changes Jan 30, 2023

View reviewed changes

Merge branch 'master' into pyarrow-pd-dtypes

fc52735

trivialfis merged commit 1325ba9 into dmlc:master Jan 30, 2023

trivialfis deleted the pyarrow-pd-dtypes branch January 30, 2023 09:53

trivialfis mentioned this pull request Mar 15, 2023

Feature names not captured for Arrow table inputs #8910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support primitive types of pyarrow-backed pandas dataframe. #8653

Support primitive types of pyarrow-backed pandas dataframe. #8653

trivialfis commented Jan 10, 2023 •

edited

Loading

trivialfis Jan 12, 2023

mroeschke Jan 13, 2023

trivialfis Jan 14, 2023 •

edited

Loading

mroeschke Jan 17, 2023

trivialfis Jan 17, 2023

jrbourbeau left a comment

jrbourbeau Jan 14, 2023

trivialfis Jan 14, 2023

jrbourbeau Jan 14, 2023

trivialfis Jan 14, 2023

Support primitive types of pyarrow-backed pandas dataframe. #8653

Support primitive types of pyarrow-backed pandas dataframe. #8653

Conversation

trivialfis commented Jan 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jan 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Jan 10, 2023 •

edited

Loading

trivialfis Jan 14, 2023 •

edited

Loading