-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
Hi Community !
I'm experiencing a strange behavior with Xgboost module , when I'm using it with categorical data. I have attached a sample file here with comments and results to reproduce the issue (link).
The issue I'm facing is that during training I turned on the categorical feature and followed the guidelines of documentation. I encoded the categories to lie in range [0, num_categories). The training phase follows the expected transformation of data in dmatrix .
But during test phase, when I'm doing the same transformations and enabling categorical, the transformation in the dmatrix for big(>2) dataframe doesn't follow the same encoding as happened during the training phase. Also, if I make predictions row-by-row each of the categorical_feature, each of them is getting encoded as 0.
Strangely, if I treat those categorical feature as of type integer, I'm getting the correct expected transformation for dmatrix. But, its result is different from treating the features as categorical and I'm not sure which one (or any) is correct result.
I think there is some problem with how categories are being transformed during the testing phase on a completely new dataset. Xgboost tries to encode each of the provided category to lie in the range of [0, num_categories), regardless of how they were treated during the training. For example if during training phase I had 100 unique categories each within the range [0, 100). Now if during test phase I provide same categorical column having original cat_id of 89 and 98, xgboost transform it to cat_id 0, and 1 which I think shouldn't be.
I would like to hear from the community that If there is something that I'm missing or this is some unexpected behavior.
Edit (14Oct, 2023) : Also not sure, if there should be a parameter to validate that the data type of dmatrix passed to model for prediction has same data_type of features as used during training. I'm trying to say something similar "validate_features" of predict method , but to validate type of data supplied.