New feature? Java binding for categorical feature support #8727

shadyelgewily-slimstock · 2023-01-26T13:06:53Z

We are using XGBoost using the Java binding (outside of Spark) and we have a strong appetite for categorical feature support, where splits are considered in terms of subset partitioning of the categorical feature as opposed to one-hot encoding and having XGboost considering each category separately. The release notes for v1.6 states:

"In the future, we will continue to improve categorical data support with new features and
optimizations. Also, we are looking forward to bringing the feature beyond Python binding,
contributions and feedback are welcomed! Lastly, as a result of experimental status, the
behavior might be subject to change, especially the default value of related
hyper-parameters."

I'm raising this issue because I'm wondering what the status is of the Java binding for the experimental parameters related to categorical features. Concretely:

Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric?
Provided that we have some way to encode the feature type in the DMatrix or elsewhere, how do we communicate that to the C binding (there has to be some way to achieve this, since the Python binding already exists)
Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say, provided we contribute a PR that satisfies the general requirements for a XGboost PR?

It seems that some work has already been done on the first two items in (#7966), so perhaps the more general question is:

Which components are still required to start using categorical features (based on subset partitioning) in Java?
How can we help get this feature into XGboost faster (e.g., by contributing), provided that it is on the roadmap ([jvm-packages] bridge the gaps between jvm package and native xgboost #7802)?

I see that this feature request is on the roadmap, and we could contribute to help the process move forward.

trivialfis · 2023-01-28T17:46:49Z

Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric

Yes. As referred in your description #7966 .

Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say

Yes, that would be 2.0 if all goes well.

Which components are still required to start using categorical features

For the Java interface, I think we can already get some small examples running, but haven't been able to prioritize it yet. The feature_type and supported tree_methods are all it needs. However, my understanding is that most users prefer the scala binding over the java binding and we need to extend the feature info setter/getter to scala and have appropriate integration with the spark estimator interface.

wbo4958 · 2023-01-30T00:46:35Z

Please see this comment. #7802 (comment)

trivialfis added the feature-request label Jan 29, 2023

wbo4958 mentioned this issue Jan 30, 2023

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

Closed

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature? Java binding for categorical feature support #8727

New feature? Java binding for categorical feature support #8727

shadyelgewily-slimstock commented Jan 26, 2023 •

edited

Loading

trivialfis commented Jan 28, 2023

wbo4958 commented Jan 30, 2023

New feature? Java binding for categorical feature support #8727

New feature? Java binding for categorical feature support #8727

Comments

shadyelgewily-slimstock commented Jan 26, 2023 • edited Loading

trivialfis commented Jan 28, 2023

wbo4958 commented Jan 30, 2023

shadyelgewily-slimstock commented Jan 26, 2023 •

edited

Loading