Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure number of features in test matrix equal to model #4594

Closed
wants to merge 1 commit into from

Conversation

RAMitchell
Copy link
Member

A bug occurred on spark where the user was able to pass a test matrix with more features than the training matrix, leading to a difficult to interpret crash during inference.

This is already prevented via feature validation in Python. I have added this check at the native level to prevent this occurring with all language bindings.

@RAMitchell
Copy link
Member Author

Having this check causes the spark tests to fail which is very worrying. This condition should be absolutely necessary for correct results. @CodingCat do you have any ideas?

@rongou
Copy link
Contributor

rongou commented Jun 21, 2019

I think I've seen this before. Probably some issue in the prediction batching code. My workaround was to do something like

model.setInferBatchSize(10 << 20).transform(testDF)

so that the batch size is big enough to cover the whole test dataset.

I didn't debug this too deeply, but a couple of theories:

  • The infer batch is too small, depending on the data, some batches might be missing a feature completely.
  • There is an off-by-one error somewhere. :)

@rongou
Copy link
Contributor

rongou commented Jun 21, 2019

@RAMitchell @CodingCat I think the test that's failing is "infrequent features" (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala#L287). It partitions 4 rows into two partitions. One partition only contains rows with 2 features, while the other only has rows with 3 features. This mismatch causes the check to fail.

In this particular test we can muck with the test data to make it pass, but in general it seems quite possible different partitions might end up with different number of features.

I'm not sure what the proper fix is. Perhaps we need to keep track of the number of features for the whole dataset, and not just infer it from each partition? @CodingCat what do you think?

@rongou
Copy link
Contributor

rongou commented Jun 24, 2019

Maybe one way to fix this is to add a check in the java/scala code to make sure the eval data and the booster have identical features. This is what's done on the python side:

https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L1668

As it currently stands, this seems like a pretty bad bug that causes incorrect prediction results on the cpu side.

@CodingCat

@rongou
Copy link
Contributor

rongou commented Jun 26, 2019

Hmm, the CLI version seems to be happy with mismatched datasets (I removed the last feature from the test file):

$ ../../xgboost mushroom.conf tree_method=gpu_hist
[10:31:18] 6541x126 matrix with 143902 entries loaded from agaricus.txt.train
[10:31:18] 1583x125 matrix with 34217 entries loaded from agaricus.txt.test
[10:31:19] [0]	test-error:0.015793	train-error:0.014524
[10:31:19] [1]	test-error:0.000000	train-error:0.001223

Need to debug this more.

@CodingCat
Copy link
Member

looking

@CodingCat
Copy link
Member

image
according to this screenshot, isn't

the failed one?

@CodingCat
Copy link
Member

hmmm.... mac failed differently

@CodingCat
Copy link
Member

CodingCat commented Jun 26, 2019

my 2 cents

  1. block happening of different feature number in eval/training dataset

  2. have some way to enforce number of columns in batching code of prediction

  3. add https://github.com/dmlc/xgboost/blob/master/src/data/data.cc#L229 to the code path starting from https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L253

@CodingCat
Copy link
Member

if different partitions get different number of features, it will fail for version <= 0.81, after that we have https://github.com/dmlc/xgboost/blob/master/include/xgboost/data.h#L174, so the failed spark test is essentially come up with this piece of c++ code

but now, it looks to me not a right fix, it may cause offset_vec shift for features

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants