Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for creating DMatrix from Apache Arrow #8111

Closed
thomasaarholt opened this issue Jul 22, 2022 · 5 comments
Closed

Documentation for creating DMatrix from Apache Arrow #8111

thomasaarholt opened this issue Jul 22, 2022 · 5 comments
Labels

Comments

@thomasaarholt
Copy link

thomasaarholt commented Jul 22, 2022

Following #7512 it would be great to see some documentation on Apache Arrow usage in DMatrix. I had to look at the tests to figure out how to use it with python, but I'm not sure about what limitations there are.

For instance, I discovered that categorical types are not yet supported. Are there other disadvantages? The main advantage I have found (also should be documented) was zero-copy coming from other arrow-based libraries, like polars.

@thomasaarholt thomasaarholt changed the title Documentation for building DMatrix from Apache Arrow Documentation for creating DMatrix from Apache Arrow Jul 22, 2022
@thomasaarholt
Copy link
Author

I also noticed that pyarrow 8.x is not supported. One has to use 7.0.0 in order to avoid

Traceback (most recent call last):
  File "/itf-fi-ml/home/thomasaar/kaggle/convert.py", line 18, in <module>
    dtrain = xgb.DMatrix(
  File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
  File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/core.py", line 643, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/data.py", line 930, in dispatch_data_backend
    return _from_arrow(
  File "/itf-fi-ml/home/thomasaar/.conda/envs/kaggle/lib/python3.10/site-packages/xgboost/data.py", line 567, in _from_arrow
    rb_iter = iter(data.to_batches(use_async=True))
  File "pyarrow/table.pxi", line 3759, in pyarrow.lib.Table.to_batches
TypeError: to_batches() got an unexpected keyword argument 'use_async'

@trivialfis
Copy link
Member

That error is fixed in the latest branch but we haven't made a release for it yet. Apologies for the inconvenience.

As for categorical data, it's experimental. We are still working on it for better coverage. I will update the document to reflect the support status.

@lorentzenchr
Copy link
Contributor

If I may add, it would be very nice to have a better technical documentation of DMatrix, in general. I think it is a very central "piece" in xgboost (the most detailed info is maybe found in C API Reference DMatrix)

@trivialfis
Copy link
Member

Thank you for the comment. Let's document every possible inputs.

@trivialfis trivialfis added the doc label Oct 29, 2022
@trivialfis
Copy link
Member

Closing in favor of #8541 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants