Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

Closed
trivialfis opened this issue Sep 14, 2019 · 16 comments
Closed

Comments

@trivialfis
Copy link
Member

trivialfis commented Sep 14, 2019

Background

I will use Python package as the example for below discussion, but it should be equally applicable to other language bindings. Right now when XGBoost saves it's "model", it saves the tree_model or linear_model, plus some attributes like max_delta_step and predictor, which are not strictly related to the "model" itself. Take neural network as an example, the "model" should be its weight and structure, and should has nothing to do with objective function or optimizer. Such boundary also exists in XGBoost, where the concept of model needs to be restricted to trees or linear weights only. I believe this is how the current IO interface was designed.

But now we have some more complicated situations, like the Booster is used in pickle, transferred between workers for distributed learning, used as checkpoing for model recovery etc. In some of these scenarios, optimizer(or the term used in XGBoost, tree_method) and some objective parameters are crucial components in the saved file (or memory buffer, you know what I mean). So with time we added some of training parameters (like the max_delta_step hyper-parameter) into the output file, and some more are coming with the excellent work by @chenqin for better rabit recovery procedures.

Also, there are generic parameters that define the runtime configuration, like gpu_id, nthread and verbosity, these are even less correlated to the concept of "model". Saving them to disk is not really making sense. But it's sometime necessary to serialize them in distributed setting. For instance, right now the distributed monitor actually doesn't work as expected in distributed environment due to the fact that verbosity is lost in transferred booster.

Old Proposal

I proposed saving the "complete snapshot" of XGBoost before while trying to introduce the json serialization format, which should mitigate the above issues. But as described, most of those parameters are not strictly related to the concept of "model", when user is only trying to save the resulting model and use it for prediction or want to share it over internet, those parameters are not only useless, but also adds additional complexity.

New one

So here I suggest we split the model IO into two different logics, one for the model itself, like trees, while the other for a complete closure.

For the first logic, it should be used in normal model IO, like Booster.save_model. We output only the trees and strictly related parameters like num_features. No max_delta_step, no predictor nor gpu_id. So it can not be used for model recovery or transferred between workers during distributed training. When users want to continue training on saved model, they need to set the parameters again via booster.set_param. Since a tree is a model, it doesn't know how to train itself. And we promise to keep the stability of this output.

While for transferring model between workers, or doing model checkpointing, we save the "complete snapshot", including both model (trees) and training parameters like max_delta_step, tree_method=gpu_hist objective=softmax etc. This output will contain all information needed to carry out training or whatever operations the booster are doing. But this output should only be used in helping distributed training, not model sharing and anything related. So we don't have to maintain model compatibility for it.

Problem

I'm not an expert of distributed computation, so my thoughts could be naive. Looking forward to your feedbacks.

@trivialfis trivialfis changed the title [RFC] Distinguish model, booster and runtime when performing IO actions. [RFC] Distinguish model, booster and runtime when performing IO operations. Sep 14, 2019
@trivialfis
Copy link
Member Author

@CodingCat
Copy link
Member

CodingCat commented Sep 14, 2019

I don’t have any strong opinion on this, but I am actually not aware of that we are adding training parameters to booster directly

Before we materialize the proposal here if we want, can we stop doing that? It is breaking compatibility of boosters across versions

For xgb-spark, the way I handled this is keeping everything in spark layer,

If you have special requirement for the bindings you work on, gpu, dask, and for rabit work and if we are adding anything related to rabit to xgboost params, we should definitely say no, then have your own param system

@trivialfis
Copy link
Member Author

@CodingCat See IO logic here:

auto it = cfg_.find("max_delta_step");

extra_attr.emplace_back("SAVED_PARAM_" + key, it->second);

and many more in the Save and Load. Also changes in #4808 . It's quite a headache.

@trivialfis
Copy link
Member Author

@CodingCat I traced down some of those commits by git blame. Right now I don't have a better solution to work with these parameters. Are you suggesting that we should make a copy of all parameters in language bindings?

@CodingCat
Copy link
Member

CodingCat commented Sep 14, 2019

my idea is that (forget about my comments on gpu, it’s different with dask, spark, etc.)

(1) we should keep what is output from that code you referred is general to all bindings

(2) each binding must have some special requirement, e.g. spark needs to decide whether to cache input dataset. All of such params should not go to the booster serialized form

(3) to interact with “core” params in xgb, yes, you need to have a way to pass param from your binding’s param system to “core” xgb params

@trivialfis
Copy link
Member Author

we should keep what is output from that code you referred is general to all bindings

Definitely.

The problem is, whether do we split model IO into two separated code path, one saves the model (trees) only, while the other saves all those core parameters to meet the requirements of distributed setting.

@CodingCat
Copy link
Member

I am actually not aware of that we are adding training parameters to booster directly

I mean I am unaware that we are adding new params to that, ( yes, I know we have already bunch of params there....)

@CodingCat
Copy link
Member

The problem is, whether do we split model IO into two separated code path, one saves the model (trees) only, while the other saves all those core parameters to meet the requirements of distributed setting.

For this, my personal suggestion,

(1) scope how many special parameters (for dask, spark, rabit, etc) which already added in, and see if we can clean up

(2) stop adding more

(3) to minimize the workload and changes on model format, keep all remaining parameters after a potential cleanup in (1) and ensure that all of things there are general to all bindings (e.g. everyone needs a max_delta_step, but not everyone needs a rabit thing or a spark thing etc.)

@chenqin
Copy link
Contributor

chenqin commented Sep 14, 2019

I recall both R and Python has some assumption how sequential IO performed in C++ layer.
The challenge of clean up are

  • parameters some times get overwritten based on run-time environment
  • parameters were stored with booster without clear section and offset information

We might consider adding a header section where all bindings can fetch certain sections payload starts from offsets in meta
https://en.wikipedia.org/wiki/Executable_and_Linkable_Format

@trivialfis
Copy link
Member Author

trivialfis commented Sep 14, 2019

@CodingCat

scope how many special parameters (for dask, spark, rabit, etc) which already added in, and see if we can clean up

We don't have any binding specific parameter in C++ code. No need to worry about that. ;-)

Let me clarify a little bit more. I want to split up the interface into two separated IO logics. One saves "trees" and the other saves everything like "core parameters along with trees".

For the first IO interface, say when I perform bst.save_model, only a bunch of binary trees are saved. It can not be used to recover training without providing appropriate parameters by user again, since it's just the model, no max_delta_step, predictor, no gpu_id etc, only the trained trees.

The other one saves everything, including predictor, max_delta_step, objective=softmax and of course, trees. It can be used for model recovery since it contains all information needed to carry out training. But normally you don't share this one on the internet...

@trivialfis
Copy link
Member Author

@chenqin I'm definitely going to JSON for this. If binary model is preferred, I would go with binary JSON. Search the term "bson".

@trivialfis
Copy link
Member Author

trivialfis commented Sep 14, 2019

One extra benefit of doing this is, we only need to keep model compatibility for the bare tree model. Since that's the format actually saved to disk and presented to users. For the "save everything option", it should only be used in model recovery or debugging, we can keep it internal and never concern its compatibility.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 15, 2019

@trivialfis +1 for the proposal. I introduced SAVED_PARAM_ hack a while ago to preserve the state of predictor param. Your new proposal will reduce the complexity of IO logic while enabling "complete snapshot", thus killing two birds with one stone.

@CodingCat I don't think the proposal has anything to do with binding-specific parameters. Even when we only look at C++ code, the current mix of model and training parameter presents an issue.

@thvasilo
Copy link
Contributor

+1 from me too. One situation I've encountered in the past had to do with model management or "provenance". The situation is I find a model artifact I've trained in the past that is somehow useful still, but can't be sure about how it was trained.

These days I'll usually dump the entire training parameter dictionary to a JSON file so in the future I can make sense of context around the model artifact. I think this proposal would help with that situation too,which I think is becoming common in ML pipelines.

@chenqin
Copy link
Contributor

chenqin commented Sep 16, 2019

+1 put additional information to make snapshot self contained. One possible application we might consider is debug retrain model. We might load previous snapshot with "retrain" and compare how new trees were shaped side by side old trees.

@trivialfis
Copy link
Member Author

Working on it. Might take some time due to the massive changes.

trivialfis added a commit to trivialfis/xgboost that referenced this issue Oct 14, 2019
* Apply Configurable to objective functions.
* Apply Model to Learner and Regtree, gbm.
* Add Load/SaveConfig to objs.
* Refactor obj tests to use smart pointer.
* Dummy methods for Save/Load Model.

A small portion of dmlc#4732.  Also related to dmlc#4855.
@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants