trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

dding3 · 2020-08-21T03:13:00Z

Trained a distributed XGBRegressor with XGBoost4J-Spark as:

val xgbRf0 = new XGBoostRegressor()
xgbRf0.setNumRound(500)
xgbRf0.setMaxDepth(50)
xgbRf0.setNthread(20)
xgbRf0.setTreeMethod("hist")
xgbRf0.setSeed(2) 
xgbRf0.setEta(0.1)
xgbRf0.setMinChildWeight(1)
xgbRf0.setSubsample(0.8)
xgbRf0.setColsampleBytree(0.8)
xgbRf0.setGamma(0)
xgbRf0.setAlpha(0)
xgbRf0.setLambda(1)
xgbRf0.setNumWorkers(3)

Also trained a single node python xgboostregressor as:

    xgb_rf0 = XGBRegressor(n_estimators=500, max_depth=50, n_jobs=-1, tree_method='hist',
                           random_state=2,learning_rate=0.1, min_child_weight=1, seed=0,
                           subsample= 0.8, colsample_bytree=0.8, gamma=0, reg_alpha=0,
                           reg_lambda=1,verbosity=0)

The two models are trained with the same training dataset. And the training dataset has more than 6 million records. After we save the two models on file, we found the model size is very different. 2G for distributed xgboost model while 350M for single node python xgboost.

xgboost version is 0.9

The text was updated successfully, but these errors were encountered:

trivialfis · 2020-08-21T03:41:00Z

Can you dump out the tree and see the difference? If you can use 1.2rc: #5970 you can also try JSON output and see the difference, like number of trees in total.

trivialfis · 2020-08-21T03:42:38Z

Unrelated, I'm not sure why would you set max depth to 50, that's 1125899906842624 leaf. I don't think it make sense of have something this huge for 6 million records data.

trivialfis · 2020-08-21T04:11:09Z

I think there might be integer overflow during training as the tree node is represented as 32bit integers...

FelixYBW · 2020-08-21T08:13:28Z

It's the same issue as #5977 and #6022 actually. Root cause is why distributed xgboost generate a much different model (larger model size, lower accurancy) from single node xgboost.

trivialfis · 2020-08-21T11:11:38Z

ping @ShvetsKS @SmirnovEgorRu would you please help taking a look?

FelixYBW · 2020-08-31T23:59:45Z

We have make clear the root cause and solve the issue, There are two reasons why the spark model is so different from single node model:

spark's maxbins is different from cpp. It's already fixed
training data set is sorted implicitly by Spark in one operator

Large max_depth may lead to a large model, but a stable model is decided by the dataset itself. e.g. we eventually set the max_depths to 100 but the at last the model's max depth is only 30~40.

dding3 · 2020-09-01T02:33:24Z

Thank you for the investigation. Have increased maxbins and repartition data frame before feed it into distributed xgboost model and the model size is as big as python single node now.

dding3 changed the title ~~Saved distributed xgboost model is much bigger than python single node xgboost mode~~ trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters Aug 21, 2020

trivialfis self-assigned this Aug 26, 2020

dding3 closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

dding3 commented Aug 21, 2020 •

edited

Loading

trivialfis commented Aug 21, 2020

trivialfis commented Aug 21, 2020 •

edited

Loading

trivialfis commented Aug 21, 2020 •

edited

Loading

FelixYBW commented Aug 21, 2020

trivialfis commented Aug 21, 2020

FelixYBW commented Aug 31, 2020

dding3 commented Sep 1, 2020

trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

Comments

dding3 commented Aug 21, 2020 • edited Loading

trivialfis commented Aug 21, 2020

trivialfis commented Aug 21, 2020 • edited Loading

trivialfis commented Aug 21, 2020 • edited Loading

FelixYBW commented Aug 21, 2020

trivialfis commented Aug 21, 2020

FelixYBW commented Aug 31, 2020

dding3 commented Sep 1, 2020

dding3 commented Aug 21, 2020 •

edited

Loading

trivialfis commented Aug 21, 2020 •

edited

Loading

trivialfis commented Aug 21, 2020 •

edited

Loading