Revert hist init optimization. #4502

trivialfis · 2019-05-25T22:56:00Z

@SmirnovEgorRu @hcho3 @RAMitchell

See #4433 (comment) for detail.

hcho3 · 2019-05-25T23:29:08Z

I will review this shortly.

In general, I think we should pick a few datasets with different characteristics and run benchmarks:

Dense dataset with small number of columns
Sparse dataset with high number of columns
Sparse dataset whose columns are generated by a high-cardinality categorical variables

Also have Jenkins run benchmarks every 2 weeks or so to check performance regression

trivialfis · 2019-05-26T00:58:31Z

That would be a very good idea.

thvasilo · 2019-06-03T09:43:48Z

@hcho3 The datasets we used in our paper would be good candidates for that.

Bosch: Dense with ~970 features: https://www.kaggle.com/c/bosch-production-line-performance/data
URL: sparse with 3.2M features, generated from text as well as numerical features http://www.sysnet.ucsd.edu/projects/url/
Avazu: Sparse with 1M features https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html and https://www.kaggle.com/c/avazu-ctr-prediction/data
RCV1: Sparse with ~47k features, also available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

I'm not aware of high-dimensional sparse datasets that were not generated by high-cardinality categorical variables.

One thing to note about URL vs. Avazu is perhaps that avazu strangely uses the value 0.258 (instead of 1) to indicate the presence of a variable, I'm not sure if that affects the parsing (i.e. treating it as a dense variable where it isn't)

Revert hist init optimization.

81985cd

trivialfis requested review from hcho3 May 25, 2019 22:57

hcho3 approved these changes May 26, 2019

View reviewed changes

trivialfis merged commit 55e645c into dmlc:master May 26, 2019

trivialfis deleted the revert-hist branch May 26, 2019 10:23

lock bot locked as resolved and limited conversation to collaborators Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert hist init optimization. #4502

Revert hist init optimization. #4502

trivialfis commented May 25, 2019

hcho3 commented May 25, 2019 •

edited

Loading

trivialfis commented May 26, 2019

thvasilo commented Jun 3, 2019

Revert hist init optimization. #4502

Revert hist init optimization. #4502

Conversation

trivialfis commented May 25, 2019

hcho3 commented May 25, 2019 • edited Loading

trivialfis commented May 26, 2019

thvasilo commented Jun 3, 2019

hcho3 commented May 25, 2019 •

edited

Loading