-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044
Comments
Can you dump out the tree and see the difference? If you can use 1.2rc: #5970 you can also try JSON output and see the difference, like number of trees in total. |
Unrelated, I'm not sure why would you set max depth to 50, that's 1125899906842624 leaf. I don't think it make sense of have something this huge for 6 million records data. |
I think there might be integer overflow during training as the tree node is represented as 32bit integers... |
ping @ShvetsKS @SmirnovEgorRu would you please help taking a look? |
We have make clear the root cause and solve the issue, There are two reasons why the spark model is so different from single node model:
Large max_depth may lead to a large model, but a stable model is decided by the dataset itself. e.g. we eventually set the max_depths to 100 but the at last the model's max depth is only 30~40. |
Thank you for the investigation. Have increased maxbins and repartition data frame before feed it into distributed xgboost model and the model size is as big as python single node now. |
Trained a distributed XGBRegressor with XGBoost4J-Spark as:
Also trained a single node python xgboostregressor as:
The two models are trained with the same training dataset. And the training dataset has more than 6 million records. After we save the two models on file, we found the model size is very different. 2G for distributed xgboost model while 350M for single node python xgboost.
xgboost version is 0.9
The text was updated successfully, but these errors were encountered: