You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using XGBoost using the Java binding (outside of Spark) and we have a strong appetite for categorical feature support, where splits are considered in terms of subset partitioning of the categorical feature as opposed to one-hot encoding and having XGboost considering each category separately. The release notes for v1.6 states:
"In the future, we will continue to improve categorical data support with new features and
optimizations. Also, we are looking forward to bringing the feature beyond Python binding,
contributions and feedback are welcomed! Lastly, as a result of experimental status, the
behavior might be subject to change, especially the default value of related
hyper-parameters."
I'm raising this issue because I'm wondering what the status is of the Java binding for the experimental parameters related to categorical features. Concretely:
Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric?
Provided that we have some way to encode the feature type in the DMatrix or elsewhere, how do we communicate that to the C binding (there has to be some way to achieve this, since the Python binding already exists)
Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say, provided we contribute a PR that satisfies the general requirements for a XGboost PR?
It seems that some work has already been done on the first two items in (#7966), so perhaps the more general question is:
Which components are still required to start using categorical features (based on subset partitioning) in Java?
Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say
Yes, that would be 2.0 if all goes well.
Which components are still required to start using categorical features
For the Java interface, I think we can already get some small examples running, but haven't been able to prioritize it yet. The feature_type and supported tree_methods are all it needs. However, my understanding is that most users prefer the scala binding over the java binding and we need to extend the feature info setter/getter to scala and have appropriate integration with the spark estimator interface.
We are using XGBoost using the Java binding (outside of Spark) and we have a strong appetite for categorical feature support, where splits are considered in terms of subset partitioning of the categorical feature as opposed to one-hot encoding and having XGboost considering each category separately. The release notes for v1.6 states:
"In the future, we will continue to improve categorical data support with new features and
optimizations. Also, we are looking forward to bringing the feature beyond Python binding,
contributions and feedback are welcomed! Lastly, as a result of experimental status, the
behavior might be subject to change, especially the default value of related
hyper-parameters."
I'm raising this issue because I'm wondering what the status is of the Java binding for the experimental parameters related to categorical features. Concretely:
It seems that some work has already been done on the first two items in (#7966), so perhaps the more general question is:
I see that this feature request is on the roadmap, and we could contribute to help the process move forward.
The text was updated successfully, but these errors were encountered: