Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support primitive types of pyarrow-backed pandas dataframe. #8653

Merged
merged 13 commits into from
Jan 30, 2023

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jan 10, 2023

Related: #8598

Categorical data (dictionary) is not supported at the moment as I can't get the correct ordering of the indices. Uncomment the code in tests to see the different outputs.

Boolean type is not supported due to the error noted inline.

The PR simply converts the data to numpy array with pyarrow handling the data type conversion. I tried to inspect the underlying implementation of pyarrow arrays, each array has children and can be backed by multiple buffers, which is difficult for XGBoost to parse. We have an interface for arrow that doesn't need to merge arrow chunks, but it's only usable with DMatrix with a CSR copy, we need pandas dataframe for other uses like inplace prediction and quantile dmatrix.

# Error:
# >>> df.astype("category")
# Function 'dictionary_encode' has no kernel matching input types
# (array[dictionary<values=int32, indices=int32, ordered=0>])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, ordered is changed to 0 here, maybe I should open an issue on pandas or arrow?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been an issue in an older pyarrow version? While debugging this call I'm getting

In [8]: df.astype("category")
> /Users/.../pandas/core/arrays/arrow/array.py(656)factorize()
-> null_encoding = "mask" if use_na_sentinel else "encode"
(Pdb) self
<ArrowExtensionArray>
[0, 2, <NA>, 3]
Length: 4, dtype: dictionary<values=int32, indices=int32, ordered=1>[pyarrow]
(Pdb) self._data
<pyarrow.lib.ChunkedArray object at 0x7f7c6a932a40>
[

  -- dictionary:
    [
      0,
      2,
      3
    ]
  -- indices:
    [
      0,
      1,
      null,
      2
    ]
]
(Pdb) n
> /Users/.../pandas/core/arrays/arrow/array.py(657)factorize()
-> encoded = self._data.dictionary_encode(null_encoding=null_encoding)
(Pdb) null_encoding
'mask'
(Pdb) self._data.dictionary_encode(null_encoding=null_encoding)
*** pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=1>)

Copy link
Member Author

@trivialfis trivialfis Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using pyarrow 10.0.1 from pypi:

import numpy as np
import pandas as pd
import pyarrow as pa

print(pd.__version__)
print(pa.__version__)

Null = np.nan
category = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.int32(), ordered=True))
df = pd.DataFrame({"f0": [0, 2, Null, 3], "f1": [4, 3, Null, 1]}, dtype=category)
df.astype("category")

ordered=0 at the end.

1.5.2
10.0.1
Traceback (most recent call last):
  File "/home/jiamingy/Workspace/XGBoost-dev/XGBoostUtils/pyarrow/test.py", line 11, in <module>
    df.astype("category")
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6233, in astype
    results = [
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6234, in <listcomp>
    self.iloc[:, i].astype(dtype, copy=copy)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 450, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array
    values = values.astype(dtype, copy=copy)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/base.py", line 608, in astype
    return cls._from_sequence(self, dtype=dtype, copy=copy)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 495, in _from_sequence
    return Categorical(scalars, dtype=dtype, copy=copy)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 441, in __init__
    codes, categories = factorize(values, sort=True)
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/algorithms.py", line 785, in factorize
    codes, uniques = values.factorize(  # type: ignore[call-arg]
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py", line 603, in factorize
    encoded = self._data.dictionary_encode(null_encoding=null_encoding)
  File "pyarrow/table.pxi", line 586, in pyarrow.lib.ChunkedArray.dictionary_encode
  File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=0>)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. This might be a a bug in pandas astypeing in 1.5.2.

On the main branch (for pandas 2.0), I am getting ordered=1

In [1]: import numpy as np
   ...: import pandas as pd
   ...: import pyarrow as pa
   ...:
   ...: print(pd.__version__)
   ...: print(pa.__version__)
   ...:
   ...: Null = np.nan
   ...: category = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.int32(), ordered=True))
   ...: df = pd.DataFrame({"f0": [0, 2, Null, 3], "f1": [4, 3, Null, 1]}, dtype=category)
2.0.0.dev0+1220.gd72e244db5
10.0.1

In [2]: df.astype("category")
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In [2], line 1
----> 1 df.astype("category"
...
ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=1>)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to the new pandas!

Copy link
Contributor

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here @trivialfis! I'm hoping to take this PR for a spin sometime next week

Comment on lines +250 to +264
pandas_pyarrow_mapper = {
"int8[pyarrow]": "i",
"int16[pyarrow]": "i",
"int32[pyarrow]": "i",
"int64[pyarrow]": "i",
"uint8[pyarrow]": "i",
"uint16[pyarrow]": "i",
"uint32[pyarrow]": "i",
"uint64[pyarrow]": "i",
"float[pyarrow]": "float",
"float32[pyarrow]": "float",
"double[pyarrow]": "float",
"float64[pyarrow]": "float",
"bool[pyarrow]": "i",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could totally be missing something as I'm not familiar with how xgboost handles data types internally. Does this mapping somehow mean all the pyarrow int types are mapped to the same representation in xgboost? Similar question for float data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. all of them are treated as float32 inside xgboost. The string "i", and "float" are used for visualization and categorical split (which is not yet available for pyarrow for now due to mentioned error).

Comment on lines +172 to +179
orig = pd.DataFrame(
{"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
dtype=pd.BooleanDtype(),
)
df = pd.DataFrame(
{"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
dtype=pd.ArrowDtype(pa.bool_()),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not meant as a suggestion, but just an FYI. You can also specify extension dtypes as strings if you prefer:

In [1]: import pandas as pd

In [2]: orig = pd.DataFrame(
   ...:     {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
   ...:     dtype="boolean",
   ...: )

In [3]: df = pd.DataFrame(
   ...:     {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
   ...:     dtype="boolean[pyarrow]",
   ...: )

In [4]: orig
Out[4]:
      f0     f1
0   True  False
1  False   True
2   <NA>   <NA>
3   True   True

In [5]: orig.dtypes
Out[5]:
f0    boolean
f1    boolean
dtype: object

In [6]: df.dtypes
Out[6]:
f0    bool[pyarrow]
f1    bool[pyarrow]
dtype: object

What you have here already is totally equivalent. Just a heads up in case you find this approach more convenient

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing!

@trivialfis trivialfis merged commit 1325ba9 into dmlc:master Jan 30, 2023
@trivialfis trivialfis deleted the pyarrow-pd-dtypes branch January 30, 2023 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants