-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for creating DMatrix from Apache Arrow #8111
Comments
I also noticed that pyarrow 8.x is not supported. One has to use 7.0.0 in order to avoid Traceback (most recent call last):
File "/itf-fi-ml/home/thomasaar/kaggle/convert.py", line 18, in <module>
dtrain = xgb.DMatrix(
File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/core.py", line 532, in inner_f
return f(**kwargs)
File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/core.py", line 643, in __init__
handle, feature_names, feature_types = dispatch_data_backend(
File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/data.py", line 930, in dispatch_data_backend
return _from_arrow(
File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/data.py", line 567, in _from_arrow
rb_iter = iter(data.to_batches(use_async=True))
File "pyarrow/table.pxi", line 3759, in pyarrow.lib.Table.to_batches
TypeError: to_batches() got an unexpected keyword argument 'use_async' |
That error is fixed in the latest branch but we haven't made a release for it yet. Apologies for the inconvenience. As for categorical data, it's experimental. We are still working on it for better coverage. I will update the document to reflect the support status. |
If I may add, it would be very nice to have a better technical documentation of |
Thank you for the comment. Let's document every possible inputs. |
Closing in favor of #8541 . |
Following #7512 it would be great to see some documentation on Apache Arrow usage in DMatrix. I had to look at the tests to figure out how to use it with python, but I'm not sure about what limitations there are.
For instance, I discovered that categorical types are not yet supported. Are there other disadvantages? The main advantage I have found (also should be documented) was zero-copy coming from other arrow-based libraries, like polars.
The text was updated successfully, but these errors were encountered: