Ensure number of features in test matrix equal to model #4594

RAMitchell · 2019-06-21T03:58:31Z

A bug occurred on spark where the user was able to pass a test matrix with more features than the training matrix, leading to a difficult to interpret crash during inference.

This is already prevented via feature validation in Python. I have added this check at the native level to prevent this occurring with all language bindings.

RAMitchell · 2019-06-21T08:16:15Z

Having this check causes the spark tests to fail which is very worrying. This condition should be absolutely necessary for correct results. @CodingCat do you have any ideas?

rongou · 2019-06-21T15:31:31Z

I think I've seen this before. Probably some issue in the prediction batching code. My workaround was to do something like

model.setInferBatchSize(10 << 20).transform(testDF)

so that the batch size is big enough to cover the whole test dataset.

I didn't debug this too deeply, but a couple of theories:

The infer batch is too small, depending on the data, some batches might be missing a feature completely.
There is an off-by-one error somewhere. :)

rongou · 2019-06-21T22:15:10Z

@RAMitchell @CodingCat I think the test that's failing is "infrequent features" (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala#L287). It partitions 4 rows into two partitions. One partition only contains rows with 2 features, while the other only has rows with 3 features. This mismatch causes the check to fail.

In this particular test we can muck with the test data to make it pass, but in general it seems quite possible different partitions might end up with different number of features.

I'm not sure what the proper fix is. Perhaps we need to keep track of the number of features for the whole dataset, and not just infer it from each partition? @CodingCat what do you think?

rongou · 2019-06-24T23:10:03Z

Maybe one way to fix this is to add a check in the java/scala code to make sure the eval data and the booster have identical features. This is what's done on the python side:

https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L1668

As it currently stands, this seems like a pretty bad bug that causes incorrect prediction results on the cpu side.

@CodingCat

rongou · 2019-06-26T17:49:02Z

Hmm, the CLI version seems to be happy with mismatched datasets (I removed the last feature from the test file):

$ ../../xgboost mushroom.conf tree_method=gpu_hist
[10:31:18] 6541x126 matrix with 143902 entries loaded from agaricus.txt.train
[10:31:18] 1583x125 matrix with 34217 entries loaded from agaricus.txt.test
[10:31:19] [0]	test-error:0.015793	train-error:0.014524
[10:31:19] [1]	test-error:0.000000	train-error:0.001223

Need to debug this more.

CodingCat · 2019-06-26T17:57:43Z

looking

CodingCat · 2019-06-26T18:02:35Z

according to this screenshot, isn't

xgboost/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/MissingValueHandlingSuite.scala

Line 48 in 797ba8e

test("handle Float.NaN as missing value correctly") {

the failed one?

CodingCat · 2019-06-26T18:04:01Z

hmmm.... mac failed differently

CodingCat · 2019-06-26T19:03:37Z

my 2 cents

block happening of different feature number in eval/training dataset
have some way to enforce number of columns in batching code of prediction
add https://github.com/dmlc/xgboost/blob/master/src/data/data.cc#L229 to the code path starting from https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L253

CodingCat · 2019-06-26T19:26:00Z

if different partitions get different number of features, it will fail for version <= 0.81, after that we have https://github.com/dmlc/xgboost/blob/master/include/xgboost/data.h#L174, so the failed spark test is essentially come up with this piece of c++ code

but now, it looks to me not a right fix, it may cause offset_vec shift for features

Ensure number of features in test matrix equal to model

1f29bed

RAMitchell force-pushed the num_features branch from 2788152 to 1f29bed Compare June 21, 2019 04:31

rongou mentioned this pull request Jun 27, 2019

fix gpu predictor when dmatrix is mismatched with model #4613

Merged

CodingCat mentioned this pull request Jul 18, 2019

example pr for rabit cache #4636

Closed

This was referenced Oct 14, 2019

Warn when _validate_features sets feature_names #4937

Closed

[JVM-Package][WIP] Add missing value as parameter for DMatrix. #4954

Closed

RAMitchell closed this Dec 21, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure number of features in test matrix equal to model #4594

Ensure number of features in test matrix equal to model #4594

RAMitchell commented Jun 21, 2019

RAMitchell commented Jun 21, 2019

rongou commented Jun 21, 2019

rongou commented Jun 21, 2019

rongou commented Jun 24, 2019

rongou commented Jun 26, 2019

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019 •

edited

Loading

CodingCat commented Jun 26, 2019

Ensure number of features in test matrix equal to model #4594

Ensure number of features in test matrix equal to model #4594

Conversation

RAMitchell commented Jun 21, 2019

RAMitchell commented Jun 21, 2019

rongou commented Jun 21, 2019

rongou commented Jun 21, 2019

rongou commented Jun 24, 2019

rongou commented Jun 26, 2019

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019 • edited Loading

CodingCat commented Jun 26, 2019

CodingCat commented Jun 26, 2019 •

edited

Loading