-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support primitive types of pyarrow-backed pandas dataframe. #8653
Conversation
c48e0c2
to
f6627b2
Compare
# Error: | ||
# >>> df.astype("category") | ||
# Function 'dictionary_encode' has no kernel matching input types | ||
# (array[dictionary<values=int32, indices=int32, ordered=0>]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, ordered
is changed to 0
here, maybe I should open an issue on pandas or arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been an issue in an older pyarrow version? While debugging this call I'm getting
In [8]: df.astype("category")
> /Users/.../pandas/core/arrays/arrow/array.py(656)factorize()
-> null_encoding = "mask" if use_na_sentinel else "encode"
(Pdb) self
<ArrowExtensionArray>
[0, 2, <NA>, 3]
Length: 4, dtype: dictionary<values=int32, indices=int32, ordered=1>[pyarrow]
(Pdb) self._data
<pyarrow.lib.ChunkedArray object at 0x7f7c6a932a40>
[
-- dictionary:
[
0,
2,
3
]
-- indices:
[
0,
1,
null,
2
]
]
(Pdb) n
> /Users/.../pandas/core/arrays/arrow/array.py(657)factorize()
-> encoded = self._data.dictionary_encode(null_encoding=null_encoding)
(Pdb) null_encoding
'mask'
(Pdb) self._data.dictionary_encode(null_encoding=null_encoding)
*** pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=1>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using pyarrow 10.0.1 from pypi:
import numpy as np
import pandas as pd
import pyarrow as pa
print(pd.__version__)
print(pa.__version__)
Null = np.nan
category = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.int32(), ordered=True))
df = pd.DataFrame({"f0": [0, 2, Null, 3], "f1": [4, 3, Null, 1]}, dtype=category)
df.astype("category")
ordered=0
at the end.
1.5.2
10.0.1
Traceback (most recent call last):
File "/home/jiamingy/Workspace/XGBoost-dev/XGBoostUtils/pyarrow/test.py", line 11, in <module>
df.astype("category")
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6233, in astype
results = [
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6234, in <listcomp>
self.iloc[:, i].astype(dtype, copy=copy)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 450, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
applied = getattr(b, f)(**kwargs)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array
values = values.astype(dtype, copy=copy)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/base.py", line 608, in astype
return cls._from_sequence(self, dtype=dtype, copy=copy)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 495, in _from_sequence
return Categorical(scalars, dtype=dtype, copy=copy)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 441, in __init__
codes, categories = factorize(values, sort=True)
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/algorithms.py", line 785, in factorize
codes, uniques = values.factorize( # type: ignore[call-arg]
File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py", line 603, in factorize
encoded = self._data.dictionary_encode(null_encoding=null_encoding)
File "pyarrow/table.pxi", line 586, in pyarrow.lib.ChunkedArray.dictionary_encode
File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=0>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. This might be a a bug in pandas astype
ing in 1.5.2.
On the main branch (for pandas 2.0), I am getting ordered=1
In [1]: import numpy as np
...: import pandas as pd
...: import pyarrow as pa
...:
...: print(pd.__version__)
...: print(pa.__version__)
...:
...: Null = np.nan
...: category = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.int32(), ordered=True))
...: df = pd.DataFrame({"f0": [0, 2, Null, 3], "f1": [4, 3, Null, 1]}, dtype=category)
2.0.0.dev0+1220.gd72e244db5
10.0.1
In [2]: df.astype("category")
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
Cell In [2], line 1
----> 1 df.astype("category"
...
ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=int32, indices=int32, ordered=1>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to the new pandas!
c53b6e8
to
0fcc002
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work here @trivialfis! I'm hoping to take this PR for a spin sometime next week
pandas_pyarrow_mapper = { | ||
"int8[pyarrow]": "i", | ||
"int16[pyarrow]": "i", | ||
"int32[pyarrow]": "i", | ||
"int64[pyarrow]": "i", | ||
"uint8[pyarrow]": "i", | ||
"uint16[pyarrow]": "i", | ||
"uint32[pyarrow]": "i", | ||
"uint64[pyarrow]": "i", | ||
"float[pyarrow]": "float", | ||
"float32[pyarrow]": "float", | ||
"double[pyarrow]": "float", | ||
"float64[pyarrow]": "float", | ||
"bool[pyarrow]": "i", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could totally be missing something as I'm not familiar with how xgboost handles data types internally. Does this mapping somehow mean all the pyarrow int types are mapped to the same representation in xgboost? Similar question for float data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. all of them are treated as float32 inside xgboost. The string "i", and "float" are used for visualization and categorical split (which is not yet available for pyarrow for now due to mentioned error).
orig = pd.DataFrame( | ||
{"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]}, | ||
dtype=pd.BooleanDtype(), | ||
) | ||
df = pd.DataFrame( | ||
{"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]}, | ||
dtype=pd.ArrowDtype(pa.bool_()), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not meant as a suggestion, but just an FYI. You can also specify extension dtypes as strings if you prefer:
In [1]: import pandas as pd
In [2]: orig = pd.DataFrame(
...: {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
...: dtype="boolean",
...: )
In [3]: df = pd.DataFrame(
...: {"f0": [True, False, pd.NA, True], "f1": [False, True, pd.NA, True]},
...: dtype="boolean[pyarrow]",
...: )
In [4]: orig
Out[4]:
f0 f1
0 True False
1 False True
2 <NA> <NA>
3 True True
In [5]: orig.dtypes
Out[5]:
f0 boolean
f1 boolean
dtype: object
In [6]: df.dtypes
Out[6]:
f0 bool[pyarrow]
f1 bool[pyarrow]
dtype: object
What you have here already is totally equivalent. Just a heads up in case you find this approach more convenient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing!
Related: #8598
Categorical data (dictionary) is not supported at the moment as I can't get the correct ordering of the indices. Uncomment the code in tests to see the different outputs.
Boolean type is not supported due to the error noted inline.The PR simply converts the data to numpy array with pyarrow handling the data type conversion. I tried to inspect the underlying implementation of pyarrow arrays, each array has children and can be backed by multiple buffers, which is difficult for XGBoost to parse. We have an interface for arrow that doesn't need to merge arrow chunks, but it's only usable with
DMatrix
with a CSR copy, we need pandas dataframe for other uses like inplace prediction and quantile dmatrix.