Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] save all param values into model file #2589

Merged
merged 11 commits into from
Mar 6, 2020
Merged

Conversation

StrikerRUS
Copy link
Collaborator

Fix for Python part of #2208.

@StrikerRUS StrikerRUS changed the title [python] save all param values into model file [WIP][python] save all param values into model file Nov 24, 2019
Comment on lines 2410 to 2416
params_to_update = copy.deepcopy(self.params)
params_to_update.update(dict(kwargs,
predict_raw_score=raw_score,
predict_leaf_index=pred_leaf,
predict_contrib=pred_contrib,
num_iteration_predict=num_iteration))
self.reset_parameter(params_to_update)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke This seems to be quite computationally expensive part and it doesn't work in case params contain any "core" parameters, e.g.

[LightGBM] [Fatal] Cannot change metric during training
[LightGBM] [Fatal] Cannot change num_class during training
[LightGBM] [Fatal] Cannot change boosting during training

Maybe we can put predict parameters into config directly at cpp side during prediction itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, use reset parameter is not a good idea. BTW, do we need to save predict parameters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean remove predict params from model file completely? I think it makes sense, because they are not needed to restore a model and seems to be more "logging" stuff (or at least something different than params used to train model). I think we can introduce naming schema like predict_* for them and filter later to not write params in model file which start with predict_.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it makes senses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I'll remove my attempts to record prediction time params from here. Can you please help with cpp part to not store them at all (here or in a separate PR)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, seems that we already have [doc-only] directive:

if "[doc-only]" in y:
continue

BTW, why are for example boosting and objective already doc-only?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As they are not automatically generated. We manually write their code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke

Maybe we can add a tag, e.g. [no-save], to these parameters, and skip them in to_string.

I've added [no-save] tag in the latest commit and applied it to predict and convert_model tasks' params.
Maybe we can apply it to "output" training values as well? I mean, params like

// alias = model_output, model_out
// desc = filename of output model in training
// desc = **Note**: can be used only in CLI version
std::string output_model = "LightGBM_model.txt";

// alias = is_save_binary, is_save_binary_file
// desc = if ``true``, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time
// desc = **Note**: can be used only in CLI version; for language-specific packages you can use the correspondent function
bool save_binary = false;

// check = >0
// alias = output_freq
// desc = frequency for metric output
int metric_freq = 1;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke WDYT about the list of candidate params for the tag?

@StrikerRUS StrikerRUS changed the title [WIP][python] save all param values into model file [python] save all param values into model file Nov 25, 2019
@StrikerRUS StrikerRUS marked this pull request as ready for review November 25, 2019 14:00
@StrikerRUS
Copy link
Collaborator Author

@guolinke Just out of curiosity, what is the reason to keep text and JSON representations different?

@StrikerRUS
Copy link
Collaborator Author

@jameslamb @Laurae2 Would you mind fixing R-package right here or prefer create a separate PR later?

@StrikerRUS
Copy link
Collaborator Author

@guolinke One more problem is that it seems params for Dataset are not saved at all:

import numpy as np
import lightgbm as lgb

X = np.random.random((100, 2))
y = np.random.random(100)
lgb_data = lgb.Dataset(X, y, categorical_feature=[0, 1], params={"max_bin": 100})
bst = lgb.train({}, lgb_data, num_boost_round=5)
bst.save_model('model.txt')
...
[max_bin: 255]
...
[categorical_feature: ]
...

@guolinke
Copy link
Collaborator

guolinke commented Dec 1, 2019

@StrikerRUS yes, it is a problem. Currently, we only copy the params in lgb.train/... to the Dataset, but the copy to Booster from Dataset.

@guolinke
Copy link
Collaborator

guolinke commented Dec 1, 2019

@StrikerRUS

@guolinke Just out of curiosity, what is the reason to keep text and JSON representations different?

which part you refer to?

@StrikerRUS
Copy link
Collaborator Author

@guolinke

yes, it is a problem. Currently, we only copy the params in lgb.train/... to the Dataset, but the copy to Booster from Dataset.

Do you have any thoughts how to fix it?..

which part you refer to?

I mean, some fields that presented in text format are not included into JSON format and vice versa. For example, parameters, feature importances.

import json

import numpy as np
import lightgbm as lgb

X = np.random.random((100, 2))
y = np.random.random(100)
lgb_data = lgb.Dataset(X, y)
bst = lgb.train({}, lgb_data, num_boost_round=3)
bst.save_model('save_model.txt')
with open('dump_model.json', 'w') as json_dump:
    json.dump(bst.dump_model(), json_dump, indent=2)

image

image

@guolinke
Copy link
Collaborator

guolinke commented Dec 2, 2019

maybe in Booster Construct:

_safe_call(_LIB.LGBM_BoosterCreate(
train_set.construct().handle,
c_str(params_str),
ctypes.byref(self.handle)))

we can construct dataset first, and then copy its params to booster, then construct booster.
BTW, maybe this line in Dataset Construct is needed: https://github.com/microsoft/LightGBM/pull/2594/files#diff-732a5a5220860efcac575e9e956bbaeaR855

For the field mismatch, I think it is caused by multiple contributors, we can unify them.

@StrikerRUS
Copy link
Collaborator Author

@guolinke

we can construct dataset first, and then copy its params to booster, then construct booster.

OK, I see. Let's work on that within #2208 after merging this PR and #2594.

For the field mismatch, I think it is caused by multiple contributors, we can unify them.

OK, I think it'll be good refactoring to have and it will help a lot of third-party libraries that work with LightGBM model's dumps. I'll create a separate issue for this.

@guolinke
Copy link
Collaborator

guolinke commented Dec 3, 2019

@StrikerRUS actually, I think the parameter write back should be in #2594, otherwise, the reset config checking for dataset may fail..

@StrikerRUS
Copy link
Collaborator Author

Blocked by #2594.

@StrikerRUS
Copy link
Collaborator Author

@guolinke
As #2594 has been merged, I think we can get back to this PR. I'm copying my old comments from the thread above to make it easier to follow the discussion.

Maybe we can add a tag, e.g. [no-save], to these parameters, and skip them in to_string.

I've added [no-save] tag in the latest commit and applied it to predict and convert_model tasks' params.
Maybe we can apply it to "output" training values as well? I mean, params like

// alias = model_output, model_out
// desc = filename of output model in training
// desc = **Note**: can be used only in CLI version
std::string output_model = "LightGBM_model.txt";

// alias = is_save_binary, is_save_binary_file
// desc = if ``true``, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time
// desc = **Note**: can be used only in CLI version; for language-specific packages you can use the correspondent function
bool save_binary = false;

// check = >0
// alias = output_freq
// desc = frequency for metric output
int metric_freq = 1;

WDYT about the list of candidate params for the tag?

@guolinke
Copy link
Collaborator

Thanks @StrikerRUS , I agree with you, some of the parameters don't need to save.

output_model
verbosity
metric_freq
save_binary 

@StrikerRUS
Copy link
Collaborator Author

@guolinke I added some more params in the latest commit, please check. Maybe we need to ignore more params, e.g. data, valid?

@StrikerRUS StrikerRUS requested a review from guolinke March 4, 2020 21:26
@guolinke
Copy link
Collaborator

guolinke commented Mar 5, 2020

@StrikerRUS I think data and valid need to save, for the cli users.

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StrikerRUS StrikerRUS merged commit ba15a16 into master Mar 6, 2020
@StrikerRUS StrikerRUS deleted the save_params branch March 6, 2020 12:48
@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants