Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAE different betweern jvm-package on spark and python package #5520

Closed
chy-crypto opened this issue Apr 13, 2020 · 30 comments
Closed

MAE different betweern jvm-package on spark and python package #5520

chy-crypto opened this issue Apr 13, 2020 · 30 comments

Comments

@chy-crypto
Copy link

I had trained two xgboost models with spark and python package (both 0.80) which run using same train dataset and test dataset and I provide them the same params.
param = {
'booster': 'gbtree',
'objective': 'reg:linear',
'eval_metric': 'mae',
'eta': 0.020209749520997883,
'num_round': 1200,
'min_child_weight' : 4.016289581092819,
'gamma' : 0.4275434126320313,
'lambda' : 2.9996223584471635,
'colsample_bytree': 0.7078453564565381,
'subsample': 0.7485739182368198,
'max_depth': 10,
'seed': 20190129,
'silent': 1,
'missing':np.NaN,
'nthread':4
}

But the mae of them are different.
python:0.20
jvm-package:0.22.

Could you tell me why?

@trivialfis
Copy link
Member

Could you please try lastest XGBoost and don't set the silent parameter?

@chy-crypto
Copy link
Author

Could you please try lastest XGBoost and don't set the silent parameter?

Thanks for reply!

I have set tree_method 'approx' and updater 'grow_histmaker,prune' and get nearly same mae

@trivialfis
Copy link
Member

@songbiu The jvm package and Python package have slightly different ways of specifying parameters. For example num_round is not a parameter in Python, nor is missing.

@trivialfis
Copy link
Member

Using the latest XGBoost on Python will give you a warning when some parameters are not used provided you don't set the silent parameter.

@chy-crypto
Copy link
Author

@songbiu The jvm package and Python package have slightly different ways of specifying parameters. For example num_round is not a parameter in Python, nor is missing.

Using the latest XGBoost on Python will give you a warning when some parameters are not used provided you don't set the silent parameter.

I will try, Thanks!

@chy-crypto
Copy link
Author

Using the latest XGBoost on Python will give you a warning when some parameters are not used provided you don't set the silent parameter.

[12:21:36] WARNING: /workspace/src/objective/regression_obj.cu:167: reg:linear is now deprecated in favor of reg:squarederror.
[12:21:36] WARNING: /workspace/src/gbm/gbtree.cc:72: DANGER AHEAD: You have manually specified updater parameter. The tree_method parameter will be ignored. Incorrect sequence of updaters will produce undefined behavior. For common uses, we recommend usingtree_method parameter instead.
[12:21:36] WARNING: /workspace/src/learner.cc:328:
Parameters: { missing, num_round } might not be used.

@chy-crypto
Copy link
Author

Using the latest XGBoost on Python will give you a warning when some parameters are not used provided you don't set the silent parameter.

I would like to know what updater can use when I train model with python package and cli on yarn?

@trivialfis
Copy link
Member

Updaters are the same across bindings.

@trivialfis
Copy link
Member

trivialfis commented Apr 14, 2020

Just that we recommend using tree_method instead of updater. tree_method is a preconfigured combination of updaters.

@trivialfis
Copy link
Member

The missing is a DMatrix parameter.

@chy-crypto
Copy link
Author

Updaters are the same across bindings.

Thanks!
Due to models produced by two bindings are different with updater 'grow_colmaker,prune', I'm wonder if there is other params which are default value different across bindings except updater
and tree_method?

@trivialfis
Copy link
Member

You can call Booster.save_config on Python to get a JSON doc of internal parameters. I'm not sure about the Scala.

@chy-crypto
Copy link
Author

You can call Booster.save_config on Python to get a JSON doc of internal parameters. I'm not sure about the Scala.

I train a model use

param = {
        'booster': 'gbtree',
        'objective': 'reg:linear',
        'eval_metric': 'mae',
        'updater': 'grow_histmaker,prune',
        'eta': 0.020209749520997883,
        'num_round': 1,
        'min_child_weight' : 4.016289581092819,
        'gamma' : 0.4275434126320313,
        'lambda' : 2.9996223584471635,
        'colsample_bytree': 0.7078453564565381,
        'subsample': 0.7485739182368198,
        'max_depth': 10,
        'seed': 20190129,
        'silent': 1,
        'missing':np.NaN,
        'nthread':4
}
bst = xgb.train(param, dtrain, param['num_round'], watchlist, verbose_eval=1,early_stopping_rounds=10)

But the result that print when I use save_config is

{
    "learner": {
        "generic_param": {
            "enable_experimental_json_serialization": "0",
            "gpu_id": "-1",
            "gpu_page_size": "0",
            "n_gpus": "0",
            "n_jobs": "0",
            "nthread": "0",
            "random_state": "0",
            "seed": "0",
            "seed_per_iteration": "0",
            "validate_features": "0",
            "validate_parameters": "0"
        },
        "gradient_booster": {
            "gbtree_train_param": {
                "num_parallel_tree": "1",
                "predictor": "auto",
                "process_type": "default",
                "tree_method": "auto",
                "updater": "grow_colmaker,prune",
                "updater_seq": "grow_colmaker,prune"
            },
            "name": "gbtree",
            "specified_updater": false,
            "updater": {
                "grow_colmaker": {
                    "colmaker_train_param": {
                        "opt_dense_col": "1"
                    },
                    "train_param": {
                        "alpha": "0",
                        "cache_opt": "1",
                        "colsample_bylevel": "1",
                        "colsample_bynode": "1",
                        "colsample_bytree": "1",
                        "default_direction": "learn",
                        "enable_feature_grouping": "0",
                        "eta": "0.300000012",
                        "gamma": "0",
                        "grow_policy": "depthwise",
                        "interaction_constraints": "",
                        "lambda": "1",
                        "learning_rate": "0.300000012",
                        "max_bin": "256",
                        "max_conflict_rate": "0",
                        "max_delta_step": "0",
                        "max_depth": "6",
                        "max_leaves": "0",
                        "max_search_group": "100",
                        "min_child_weight": "1",
                        "min_split_loss": "0",
                        "monotone_constraints": "()",
                        "refresh_leaf": "1",
                        "reg_alpha": "0",
                        "reg_lambda": "1",
                        "sketch_eps": "0.0299999993",
                        "sketch_ratio": "2",
                        "sparse_threshold": "0.20000000000000001",
                        "split_evaluator": "elastic_net,monotonic",
                        "subsample": "1"
                    }
                },
                "prune": {
                    "train_param": {
                        "alpha": "0",
                        "cache_opt": "1",
                        "colsample_bylevel": "1",
                        "colsample_bynode": "1",
                        "colsample_bytree": "1",
                        "default_direction": "learn",
                        "enable_feature_grouping": "0",
                        "eta": "0.300000012",
                        "gamma": "0",
                        "grow_policy": "depthwise",
                        "interaction_constraints": "",
                        "lambda": "1",
                        "learning_rate": "0.300000012",
                        "max_bin": "256",
                        "max_conflict_rate": "0",
                        "max_delta_step": "0",
                        "max_depth": "6",
                        "max_leaves": "0",
                        "max_search_group": "100",
                        "min_child_weight": "1",
                        "min_split_loss": "0",
                        "monotone_constraints": "()",
                        "refresh_leaf": "1",
                        "reg_alpha": "0",
                        "reg_lambda": "1",
                        "sketch_eps": "0.0299999993",
                        "sketch_ratio": "2",
                        "sparse_threshold": "0.20000000000000001",
                        "split_evaluator": "elastic_net,monotonic",
                        "subsample": "1"
                    }
                }
            }
        },
        "learner_model_param": {
            "base_score": "0.500000",
            "num_class": "0",
            "num_feature": "164"
        },
        "learner_train_param": {
            "booster": "gbtree",
            "disable_default_eval_metric": "0",
            "dsplit": "auto",
            "objective": "reg:linear"
        },
        "metrics": [
            "mae"
        ],
        "objective": {
            "name": "reg:squarederror",
            "reg_loss_param": {
                "scale_pos_weight": "1"
            }
        }
    },
    "version": [
        1,
        0,
        2
    ]
}

Why params are different?

@trivialfis
Copy link
Member

trivialfis commented Apr 14, 2020

Could you provide a sample that I can run. I just tried the 1.0.2 branch and it works correct:

import xgboost as xgb
import numpy as np

kRows = 1000
kCols = 100

X = np.random.randn(kRows, kCols)
y = np.random.randn(kRows)

param = {
    'booster': 'gbtree',
    'objective': 'reg:linear',
    'eval_metric': 'mae',
    'updater': 'grow_histmaker,prune',
    'eta': 0.020209749520997883,
    'num_round': 1,
    'min_child_weight': 4.016289581092819,
    'gamma': 0.4275434126320313,
    'lambda': 2.9996223584471635,
    'colsample_bytree': 0.7078453564565381,
    'subsample': 0.7485739182368198,
    'max_depth': 10,
    'seed': 20190129,
    'silent': 1,
    'missing': np.NaN,
    'nthread': 4
}

dtrain = xgb.DMatrix(X, y)
watchlist = [(dtrain, 'train')]

bst = xgb.train(param,
                dtrain,
                param['num_round'],
                watchlist,
                verbose_eval=1,
                early_stopping_rounds=10)

print(bst.save_config())
{
    "version": [
        1,
        0,
        2
    ],
    "learner": {
        "objective": {
            "reg_loss_param": {
                "scale_pos_weight": "1"
            },
            "name": "reg:squarederror"
        },
        "metrics": [
            "mae"
        ],
        "learner_train_param": {
            "objective": "reg:linear",
            "dsplit": "auto",
            "disable_default_eval_metric": "0",
            "booster": "gbtree"
        },
        "learner_model_param": {
            "num_feature": "100",
            "num_class": "0",
            "base_score": "0.500000"
        },
        "gradient_booster": {
            "updater": {
                "prune": {
                    "train_param": {
                        "subsample": "0.748573899",
                        "split_evaluator": "elastic_net,monotonic",
                        "sparse_threshold": "0.20000000000000001",
                        "sketch_ratio": "2",
                        "sketch_eps": "0.0299999993",
                        "reg_lambda": "2.99962234",
                        "reg_alpha": "0",
                        "refresh_leaf": "1",
                        "monotone_constraints": "()",
                        "min_split_loss": "0.427543402",
                        "min_child_weight": "4.01628971",
                        "max_search_group": "100",
                        "max_leaves": "0",
                        "max_depth": "10",
                        "max_delta_step": "0",
                        "max_conflict_rate": "0",
                        "max_bin": "256",
                        "learning_rate": "0.0202097502",
                        "lambda": "2.99962234",
                        "interaction_constraints": "",
                        "grow_policy": "depthwise",
                        "gamma": "0.427543402",
                        "eta": "0.0202097502",
                        "enable_feature_grouping": "0",
                        "default_direction": "learn",
                        "colsample_bytree": "0.70784533",
                        "colsample_bynode": "1",
                        "colsample_bylevel": "1",
                        "cache_opt": "1",
                        "alpha": "0"
                    }
                },
                "grow_histmaker": {
                    "train_param": {
                        "subsample": "0.748573899",
                        "split_evaluator": "elastic_net,monotonic",
                        "sparse_threshold": "0.20000000000000001",
                        "sketch_ratio": "2",
                        "sketch_eps": "0.0299999993",
                        "reg_lambda": "2.99962234",
                        "reg_alpha": "0",
                        "refresh_leaf": "1",
                        "monotone_constraints": "()",
                        "min_split_loss": "0.427543402",
                        "min_child_weight": "4.01628971",
                        "max_search_group": "100",
                        "max_leaves": "0",
                        "max_depth": "10",
                        "max_delta_step": "0",
                        "max_conflict_rate": "0",
                        "max_bin": "256",
                        "learning_rate": "0.0202097502",
                        "lambda": "2.99962234",
                        "interaction_constraints": "",
                        "grow_policy": "depthwise",
                        "gamma": "0.427543402",
                        "eta": "0.0202097502",
                        "enable_feature_grouping": "0",
                        "default_direction": "learn",
                        "colsample_bytree": "0.70784533",
                        "colsample_bynode": "1",
                        "colsample_bylevel": "1",
                        "cache_opt": "1",
                        "alpha": "0"
                    }
                }
            },
            "specified_updater": true,
            "name": "gbtree",
            "gbtree_train_param": {
                "updater_seq": "grow_histmaker,prune",
                "updater": "grow_histmaker,prune",
                "tree_method": "auto",
                "process_type": "default",
                "predictor": "auto",
                "num_parallel_tree": "1"
            }
        },
        "generic_param": {
            "validate_parameters": "1",
            "validate_features": "0",
            "seed_per_iteration": "0",
            "seed": "20190129",
            "random_state": "20190129",
            "nthread": "4",
            "n_jobs": "4",
            "n_gpus": "0",
            "gpu_page_size": "0",
            "gpu_id": "-1",
            "enable_experimental_json_serialization": "0"
        }
    }
}

@chy-crypto
Copy link
Author

Could you provide a sample that I can run. I just tried the 1.0.2 branch and it works correct:

import xgboost as xgb
import numpy as np

kRows = 1000
kCols = 100

X = np.random.randn(kRows, kCols)
y = np.random.randn(kRows)

param = {
    'booster': 'gbtree',
    'objective': 'reg:linear',
    'eval_metric': 'mae',
    'updater': 'grow_histmaker,prune',
    'eta': 0.020209749520997883,
    'num_round': 1,
    'min_child_weight': 4.016289581092819,
    'gamma': 0.4275434126320313,
    'lambda': 2.9996223584471635,
    'colsample_bytree': 0.7078453564565381,
    'subsample': 0.7485739182368198,
    'max_depth': 10,
    'seed': 20190129,
    'silent': 1,
    'missing': np.NaN,
    'nthread': 4
}

dtrain = xgb.DMatrix(X, y)
watchlist = [(dtrain, 'train')]

bst = xgb.train(param,
                dtrain,
                param['num_round'],
                watchlist,
                verbose_eval=1,
                early_stopping_rounds=10)

print(bst.save_config())

[0] train-mae:0.87545
Will train until train-mae hasn't improved in 10 rounds.
{"learner":{"generic_param":{"enable_experimental_json_serialization":"0","gpu_id":"-1","gpu_page_size":"0","n_gpus":"0","n_jobs":"4","nthread":"4","random_state":"20190129","seed":"20190129","seed_per_iteration":"0","validate_features":"0","validate_parameters":"1"},"gradient_booster":{"gbtree_train_param":{"num_parallel_tree":"1","predictor":"auto","process_type":"default","tree_method":"auto","updater":"grow_histmaker,prune","updater_seq":"grow_histmaker,prune"},"name":"gbtree","specified_updater":true,"updater":{"grow_histmaker":{"train_param":{"alpha":"0","cache_opt":"1","colsample_bylevel":"1","colsample_bynode":"1","colsample_bytree":"0.70784533","default_direction":"learn","enable_feature_grouping":"0","eta":"0.0202097502","gamma":"0.427543402","grow_policy":"depthwise","interaction_constraints":"","lambda":"2.99962234","learning_rate":"0.0202097502","max_bin":"256","max_conflict_rate":"0","max_delta_step":"0","max_depth":"10","max_leaves":"0","max_search_group":"100","min_child_weight":"4.01628971","min_split_loss":"0.427543402","monotone_constraints":"()","refresh_leaf":"1","reg_alpha":"0","reg_lambda":"2.99962234","sketch_eps":"0.0299999993","sketch_ratio":"2","sparse_threshold":"0.20000000000000001","split_evaluator":"elastic_net,monotonic","subsample":"0.748573899"}},"prune":{"train_param":{"alpha":"0","cache_opt":"1","colsample_bylevel":"1","colsample_bynode":"1","colsample_bytree":"0.70784533","default_direction":"learn","enable_feature_grouping":"0","eta":"0.0202097502","gamma":"0.427543402","grow_policy":"depthwise","interaction_constraints":"","lambda":"2.99962234","learning_rate":"0.0202097502","max_bin":"256","max_conflict_rate":"0","max_delta_step":"0","max_depth":"10","max_leaves":"0","max_search_group":"100","min_child_weight":"4.01628971","min_split_loss":"0.427543402","monotone_constraints":"()","refresh_leaf":"1","reg_alpha":"0","reg_lambda":"2.99962234","sketch_eps":"0.0299999993","sketch_ratio":"2","sparse_threshold":"0.20000000000000001","split_evaluator":"elastic_net,monotonic","subsample":"0.748573899"}}}},"learner_model_param":{"base_score":"0.500000","num_class":"0","num_feature":"100"},"learner_train_param":{"booster":"gbtree","disable_default_eval_metric":"0","dsplit":"auto","objective":"reg:linear"},"metrics":["mae"],"objective":{"name":"reg:squarederror","reg_loss_param":{"scale_pos_weight":"1"}}},"version":[1,0,2]}

My result is incorrect because I load model after training.

@trivialfis
Copy link
Member

@songbiu Those parameters are gone once the model is saved into disk. See https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html for details about model persistence implementation.

@chy-crypto
Copy link
Author

@songbiu Those parameters are gone once the model is saved into disk. See https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html for details about model persistence implementation.

Thanks, I will try dump config of scala binding.

@chy-crypto
Copy link
Author

You can call Booster.save_config on Python to get a JSON doc of internal parameters. I'm not sure about the Scala.

Finally I want to know if it's OK that two models produced by python package and jvm package are different?

@trivialfis
Copy link
Member

@songbiu I believe the output should be the same when parameters are matched, and Python uses only the default parameters defined in C++. But I don't have a lots of experience with jvm packages. So pinging @CodingCat .

@chy-crypto
Copy link
Author

@songbiu I believe the output should be the same when parameters are matched, and Python uses only the default parameters defined in C++. But I don't have a lots of experience with jvm packages. So pinging @CodingCat .

I produce a model with xgboost cli running on yarn, and get same output with jvm package, and both of them are different from python package.

@trivialfis
Copy link
Member

I produce a model with xgboost cli running on yarn

Great! CLI is something I can work on. Will look into this tomorrow or so.

@chy-crypto
Copy link
Author

chy-crypto commented Apr 14, 2020

Params which I use with CLI are

booster=gbtree \
objective=reg:linear \
eval_metric=mae \
eta=0.020209749520997883 \
num_round=1200 \
min_child_weight=4.016289581092819 \
gamma=0.4275434126320313 \
lambda=2.9996223584471635 \
colsample_bytree=0.7078453564565381 \
subsample=0.7485739182368198 \
max_depth=10 \
seed=20190129 \
updater=grow_colmaker,prune

and python params are

param = {
        'booster': 'gbtree',
        'objective': 'reg:linear',
        'updater':'grow_colmaker,prune',
        'eval_metric': 'mae',
        'num_round': 1200,
        'learning_rate': 0.020209749520997883,
        'min_child_weight' : 4.016289581092819,
        'gamma' : 0.4275434126320313,
        'eta' : 2.9996223584471635,
        'colsample_bytree': 0.7078453564565381,
        'subsample': 0.7485739182368198,
        'max_depth': 10,
        'seed': 20190129,
        'nthread':4,
        }

@chy-crypto
Copy link
Author

I produce a model with xgboost cli running on yarn

Great! CLI is something I can work on. Will look into this tomorrow or so.

And do you know How can I decide the method that CLI fill missing value?

@chy-crypto
Copy link
Author

I produce a model with xgboost cli running on yarn

Great! CLI is something I can work on. Will look into this tomorrow or so.

My datasets are libsvm fils with missing value.
Python loads files use

dtrain = xgb.DMatrix("train.libsvm", missing=np.nan)

and CLI doesn't config missing fill.

@chy-crypto
Copy link
Author

I produce a model with xgboost cli running on yarn

Great! CLI is something I can work on. Will look into this tomorrow or so.

I have read code of python bindings and found it would update learning_rate after each iteration.
But I don't find this method in CLI code, do you know if CLI update learning_rate(eta)?

@trivialfis
Copy link
Member

@songbiu I don't think Python binding does something like that unless you explicitly instruct it to do so via a call back function. I just added a test to show CLI having same model output with Python: #5535 . So I'm suspecting it's due to missing value handling.

When the data is loaded from a svm or csr file, DMatrix doesn't handle missing value as these two formats are sparse format leaving missing value as empty. But it's a feature we can add to maintain the consistency.

@chy-crypto
Copy link
Author

@trivialfis The input files of both bindings are libsvm files with missing value.
Both Python package and CLI call DMatrix:Load to load file. Are they different?

@chy-crypto
Copy link
Author

chy-crypto commented Apr 16, 2020

@trivialfis
I use follow dataset and mae after 1200 epoch of training set is
python:0.394748
CLI:1.653196

param = {
        'booster': 'gbtree',
        'objective': 'reg:linear',
        'eval_metric': 'mae',
        'updater': 'grow_colmaker,prune',
        'learning_rate': 0.020209749520997883,
        'num_round': 1,
        'min_child_weight' : 4.016289581092819,
        'gamma' : 0.4275434126320313,
        'reg_lambda' : 2.9996223584471635,
        'colsample_bytree': 0.7078453564565381,
        'subsample': 0.7485739182368198,
        'max_depth': 10,
        'seed': 20190129,
        'silent': True,
        'missing':np.NaN,
        'nthread':4
}

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/cpusmall

@chy-crypto
Copy link
Author

@trivialfis
When I train with cpusmall dataset, both bindings set updater=grow_histmaker,prune, them produce same model. But if I set updater=grow_colmaker,prune, the model produced by CLI is different from python produced.

So I am wonder if distributed CLI only support grow_histmake,prune?

@hcho3
Copy link
Collaborator

hcho3 commented Sep 8, 2020

I don't think grow_colmaker updater supports distributed training. So when you set updater=grow_colmaker,prune, the distributed CLI will silently change the updater to grow_histmaker,prune.

@hcho3 hcho3 closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants