-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure number of features in test matrix equal to model #4594
Conversation
Having this check causes the spark tests to fail which is very worrying. This condition should be absolutely necessary for correct results. @CodingCat do you have any ideas? |
I think I've seen this before. Probably some issue in the prediction batching code. My workaround was to do something like model.setInferBatchSize(10 << 20).transform(testDF) so that the batch size is big enough to cover the whole test dataset. I didn't debug this too deeply, but a couple of theories:
|
@RAMitchell @CodingCat I think the test that's failing is "infrequent features" (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala#L287). It partitions 4 rows into two partitions. One partition only contains rows with 2 features, while the other only has rows with 3 features. This mismatch causes the check to fail. In this particular test we can muck with the test data to make it pass, but in general it seems quite possible different partitions might end up with different number of features. I'm not sure what the proper fix is. Perhaps we need to keep track of the number of features for the whole dataset, and not just infer it from each partition? @CodingCat what do you think? |
Maybe one way to fix this is to add a check in the java/scala code to make sure the eval data and the booster have identical features. This is what's done on the python side: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L1668 As it currently stands, this seems like a pretty bad bug that causes incorrect prediction results on the cpu side. |
Hmm, the CLI version seems to be happy with mismatched datasets (I removed the last feature from the test file): $ ../../xgboost mushroom.conf tree_method=gpu_hist
[10:31:18] 6541x126 matrix with 143902 entries loaded from agaricus.txt.train
[10:31:18] 1583x125 matrix with 34217 entries loaded from agaricus.txt.test
[10:31:19] [0] test-error:0.015793 train-error:0.014524
[10:31:19] [1] test-error:0.000000 train-error:0.001223 Need to debug this more. |
looking |
Line 48 in 797ba8e
|
hmmm.... mac failed differently |
my 2 cents
|
if different partitions get different number of features, it will fail for version <= 0.81, after that we have https://github.com/dmlc/xgboost/blob/master/include/xgboost/data.h#L174, so the failed spark test is essentially come up with this piece of c++ code but now, it looks to me not a right fix, it may cause offset_vec shift for features |
A bug occurred on spark where the user was able to pass a test matrix with more features than the training matrix, leading to a difficult to interpret crash during inference.
This is already prevented via feature validation in Python. I have added this check at the native level to prevent this occurring with all language bindings.