-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAE different betweern jvm-package on spark and python package #5520
Comments
Could you please try lastest XGBoost and don't set the silent parameter? |
Thanks for reply! I have set tree_method 'approx' and updater 'grow_histmaker,prune' and get nearly same mae |
@songbiu The jvm package and Python package have slightly different ways of specifying parameters. For example |
Using the latest XGBoost on Python will give you a warning when some parameters are not used provided you don't set the |
I will try, Thanks! |
[12:21:36] WARNING: /workspace/src/objective/regression_obj.cu:167: reg:linear is now deprecated in favor of reg:squarederror. |
I would like to know what updater can use when I train model with python package and cli on yarn? |
Updaters are the same across bindings. |
Just that we recommend using |
The |
Thanks! |
You can call |
I train a model use param = {
'booster': 'gbtree',
'objective': 'reg:linear',
'eval_metric': 'mae',
'updater': 'grow_histmaker,prune',
'eta': 0.020209749520997883,
'num_round': 1,
'min_child_weight' : 4.016289581092819,
'gamma' : 0.4275434126320313,
'lambda' : 2.9996223584471635,
'colsample_bytree': 0.7078453564565381,
'subsample': 0.7485739182368198,
'max_depth': 10,
'seed': 20190129,
'silent': 1,
'missing':np.NaN,
'nthread':4
}
bst = xgb.train(param, dtrain, param['num_round'], watchlist, verbose_eval=1,early_stopping_rounds=10) But the result that print when I use save_config is {
"learner": {
"generic_param": {
"enable_experimental_json_serialization": "0",
"gpu_id": "-1",
"gpu_page_size": "0",
"n_gpus": "0",
"n_jobs": "0",
"nthread": "0",
"random_state": "0",
"seed": "0",
"seed_per_iteration": "0",
"validate_features": "0",
"validate_parameters": "0"
},
"gradient_booster": {
"gbtree_train_param": {
"num_parallel_tree": "1",
"predictor": "auto",
"process_type": "default",
"tree_method": "auto",
"updater": "grow_colmaker,prune",
"updater_seq": "grow_colmaker,prune"
},
"name": "gbtree",
"specified_updater": false,
"updater": {
"grow_colmaker": {
"colmaker_train_param": {
"opt_dense_col": "1"
},
"train_param": {
"alpha": "0",
"cache_opt": "1",
"colsample_bylevel": "1",
"colsample_bynode": "1",
"colsample_bytree": "1",
"default_direction": "learn",
"enable_feature_grouping": "0",
"eta": "0.300000012",
"gamma": "0",
"grow_policy": "depthwise",
"interaction_constraints": "",
"lambda": "1",
"learning_rate": "0.300000012",
"max_bin": "256",
"max_conflict_rate": "0",
"max_delta_step": "0",
"max_depth": "6",
"max_leaves": "0",
"max_search_group": "100",
"min_child_weight": "1",
"min_split_loss": "0",
"monotone_constraints": "()",
"refresh_leaf": "1",
"reg_alpha": "0",
"reg_lambda": "1",
"sketch_eps": "0.0299999993",
"sketch_ratio": "2",
"sparse_threshold": "0.20000000000000001",
"split_evaluator": "elastic_net,monotonic",
"subsample": "1"
}
},
"prune": {
"train_param": {
"alpha": "0",
"cache_opt": "1",
"colsample_bylevel": "1",
"colsample_bynode": "1",
"colsample_bytree": "1",
"default_direction": "learn",
"enable_feature_grouping": "0",
"eta": "0.300000012",
"gamma": "0",
"grow_policy": "depthwise",
"interaction_constraints": "",
"lambda": "1",
"learning_rate": "0.300000012",
"max_bin": "256",
"max_conflict_rate": "0",
"max_delta_step": "0",
"max_depth": "6",
"max_leaves": "0",
"max_search_group": "100",
"min_child_weight": "1",
"min_split_loss": "0",
"monotone_constraints": "()",
"refresh_leaf": "1",
"reg_alpha": "0",
"reg_lambda": "1",
"sketch_eps": "0.0299999993",
"sketch_ratio": "2",
"sparse_threshold": "0.20000000000000001",
"split_evaluator": "elastic_net,monotonic",
"subsample": "1"
}
}
}
},
"learner_model_param": {
"base_score": "0.500000",
"num_class": "0",
"num_feature": "164"
},
"learner_train_param": {
"booster": "gbtree",
"disable_default_eval_metric": "0",
"dsplit": "auto",
"objective": "reg:linear"
},
"metrics": [
"mae"
],
"objective": {
"name": "reg:squarederror",
"reg_loss_param": {
"scale_pos_weight": "1"
}
}
},
"version": [
1,
0,
2
]
} Why params are different? |
Could you provide a sample that I can run. I just tried the 1.0.2 branch and it works correct: import xgboost as xgb
import numpy as np
kRows = 1000
kCols = 100
X = np.random.randn(kRows, kCols)
y = np.random.randn(kRows)
param = {
'booster': 'gbtree',
'objective': 'reg:linear',
'eval_metric': 'mae',
'updater': 'grow_histmaker,prune',
'eta': 0.020209749520997883,
'num_round': 1,
'min_child_weight': 4.016289581092819,
'gamma': 0.4275434126320313,
'lambda': 2.9996223584471635,
'colsample_bytree': 0.7078453564565381,
'subsample': 0.7485739182368198,
'max_depth': 10,
'seed': 20190129,
'silent': 1,
'missing': np.NaN,
'nthread': 4
}
dtrain = xgb.DMatrix(X, y)
watchlist = [(dtrain, 'train')]
bst = xgb.train(param,
dtrain,
param['num_round'],
watchlist,
verbose_eval=1,
early_stopping_rounds=10)
print(bst.save_config()) {
"version": [
1,
0,
2
],
"learner": {
"objective": {
"reg_loss_param": {
"scale_pos_weight": "1"
},
"name": "reg:squarederror"
},
"metrics": [
"mae"
],
"learner_train_param": {
"objective": "reg:linear",
"dsplit": "auto",
"disable_default_eval_metric": "0",
"booster": "gbtree"
},
"learner_model_param": {
"num_feature": "100",
"num_class": "0",
"base_score": "0.500000"
},
"gradient_booster": {
"updater": {
"prune": {
"train_param": {
"subsample": "0.748573899",
"split_evaluator": "elastic_net,monotonic",
"sparse_threshold": "0.20000000000000001",
"sketch_ratio": "2",
"sketch_eps": "0.0299999993",
"reg_lambda": "2.99962234",
"reg_alpha": "0",
"refresh_leaf": "1",
"monotone_constraints": "()",
"min_split_loss": "0.427543402",
"min_child_weight": "4.01628971",
"max_search_group": "100",
"max_leaves": "0",
"max_depth": "10",
"max_delta_step": "0",
"max_conflict_rate": "0",
"max_bin": "256",
"learning_rate": "0.0202097502",
"lambda": "2.99962234",
"interaction_constraints": "",
"grow_policy": "depthwise",
"gamma": "0.427543402",
"eta": "0.0202097502",
"enable_feature_grouping": "0",
"default_direction": "learn",
"colsample_bytree": "0.70784533",
"colsample_bynode": "1",
"colsample_bylevel": "1",
"cache_opt": "1",
"alpha": "0"
}
},
"grow_histmaker": {
"train_param": {
"subsample": "0.748573899",
"split_evaluator": "elastic_net,monotonic",
"sparse_threshold": "0.20000000000000001",
"sketch_ratio": "2",
"sketch_eps": "0.0299999993",
"reg_lambda": "2.99962234",
"reg_alpha": "0",
"refresh_leaf": "1",
"monotone_constraints": "()",
"min_split_loss": "0.427543402",
"min_child_weight": "4.01628971",
"max_search_group": "100",
"max_leaves": "0",
"max_depth": "10",
"max_delta_step": "0",
"max_conflict_rate": "0",
"max_bin": "256",
"learning_rate": "0.0202097502",
"lambda": "2.99962234",
"interaction_constraints": "",
"grow_policy": "depthwise",
"gamma": "0.427543402",
"eta": "0.0202097502",
"enable_feature_grouping": "0",
"default_direction": "learn",
"colsample_bytree": "0.70784533",
"colsample_bynode": "1",
"colsample_bylevel": "1",
"cache_opt": "1",
"alpha": "0"
}
}
},
"specified_updater": true,
"name": "gbtree",
"gbtree_train_param": {
"updater_seq": "grow_histmaker,prune",
"updater": "grow_histmaker,prune",
"tree_method": "auto",
"process_type": "default",
"predictor": "auto",
"num_parallel_tree": "1"
}
},
"generic_param": {
"validate_parameters": "1",
"validate_features": "0",
"seed_per_iteration": "0",
"seed": "20190129",
"random_state": "20190129",
"nthread": "4",
"n_jobs": "4",
"n_gpus": "0",
"gpu_page_size": "0",
"gpu_id": "-1",
"enable_experimental_json_serialization": "0"
}
}
} |
My result is incorrect because I load model after training. |
@songbiu Those parameters are gone once the model is saved into disk. See https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html for details about model persistence implementation. |
Thanks, I will try dump config of scala binding. |
Finally I want to know if it's OK that two models produced by python package and jvm package are different? |
@songbiu I believe the output should be the same when parameters are matched, and Python uses only the default parameters defined in C++. But I don't have a lots of experience with jvm packages. So pinging @CodingCat . |
I produce a model with xgboost cli running on yarn, and get same output with jvm package, and both of them are different from python package. |
Great! CLI is something I can work on. Will look into this tomorrow or so. |
Params which I use with CLI are booster=gbtree \
objective=reg:linear \
eval_metric=mae \
eta=0.020209749520997883 \
num_round=1200 \
min_child_weight=4.016289581092819 \
gamma=0.4275434126320313 \
lambda=2.9996223584471635 \
colsample_bytree=0.7078453564565381 \
subsample=0.7485739182368198 \
max_depth=10 \
seed=20190129 \
updater=grow_colmaker,prune and python params are param = {
'booster': 'gbtree',
'objective': 'reg:linear',
'updater':'grow_colmaker,prune',
'eval_metric': 'mae',
'num_round': 1200,
'learning_rate': 0.020209749520997883,
'min_child_weight' : 4.016289581092819,
'gamma' : 0.4275434126320313,
'eta' : 2.9996223584471635,
'colsample_bytree': 0.7078453564565381,
'subsample': 0.7485739182368198,
'max_depth': 10,
'seed': 20190129,
'nthread':4,
} |
And do you know How can I decide the method that CLI fill missing value? |
My datasets are libsvm fils with missing value. dtrain = xgb.DMatrix("train.libsvm", missing=np.nan) and CLI doesn't config missing fill. |
I have read code of python bindings and found it would update learning_rate after each iteration. |
@songbiu I don't think Python binding does something like that unless you explicitly instruct it to do so via a call back function. I just added a test to show CLI having same model output with Python: #5535 . So I'm suspecting it's due to missing value handling. When the data is loaded from a svm or csr file, DMatrix doesn't handle missing value as these two formats are sparse format leaving missing value as empty. But it's a feature we can add to maintain the consistency. |
@trivialfis The input files of both bindings are libsvm files with missing value. |
@trivialfis param = {
'booster': 'gbtree',
'objective': 'reg:linear',
'eval_metric': 'mae',
'updater': 'grow_colmaker,prune',
'learning_rate': 0.020209749520997883,
'num_round': 1,
'min_child_weight' : 4.016289581092819,
'gamma' : 0.4275434126320313,
'reg_lambda' : 2.9996223584471635,
'colsample_bytree': 0.7078453564565381,
'subsample': 0.7485739182368198,
'max_depth': 10,
'seed': 20190129,
'silent': True,
'missing':np.NaN,
'nthread':4
} https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/cpusmall |
@trivialfis So I am wonder if distributed CLI only support grow_histmake,prune? |
I don't think |
I had trained two xgboost models with spark and python package (both 0.80) which run using same train dataset and test dataset and I provide them the same params.
param = {
'booster': 'gbtree',
'objective': 'reg:linear',
'eval_metric': 'mae',
'eta': 0.020209749520997883,
'num_round': 1200,
'min_child_weight' : 4.016289581092819,
'gamma' : 0.4275434126320313,
'lambda' : 2.9996223584471635,
'colsample_bytree': 0.7078453564565381,
'subsample': 0.7485739182368198,
'max_depth': 10,
'seed': 20190129,
'silent': 1,
'missing':np.NaN,
'nthread':4
}
But the mae of them are different.
python:0.20
jvm-package:0.22.
Could you tell me why?
The text was updated successfully, but these errors were encountered: