[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

trivialfis · 2019-09-14T15:04:41Z

Background

I will use Python package as the example for below discussion, but it should be equally applicable to other language bindings. Right now when XGBoost saves it's "model", it saves the tree_model or linear_model, plus some attributes like max_delta_step and predictor, which are not strictly related to the "model" itself. Take neural network as an example, the "model" should be its weight and structure, and should has nothing to do with objective function or optimizer. Such boundary also exists in XGBoost, where the concept of model needs to be restricted to trees or linear weights only. I believe this is how the current IO interface was designed.

But now we have some more complicated situations, like the Booster is used in pickle, transferred between workers for distributed learning, used as checkpoing for model recovery etc. In some of these scenarios, optimizer(or the term used in XGBoost, tree_method) and some objective parameters are crucial components in the saved file (or memory buffer, you know what I mean). So with time we added some of training parameters (like the max_delta_step hyper-parameter) into the output file, and some more are coming with the excellent work by @chenqin for better rabit recovery procedures.

Also, there are generic parameters that define the runtime configuration, like gpu_id, nthread and verbosity, these are even less correlated to the concept of "model". Saving them to disk is not really making sense. But it's sometime necessary to serialize them in distributed setting. For instance, right now the distributed monitor actually doesn't work as expected in distributed environment due to the fact that verbosity is lost in transferred booster.

Old Proposal

I proposed saving the "complete snapshot" of XGBoost before while trying to introduce the json serialization format, which should mitigate the above issues. But as described, most of those parameters are not strictly related to the concept of "model", when user is only trying to save the resulting model and use it for prediction or want to share it over internet, those parameters are not only useless, but also adds additional complexity.

New one

So here I suggest we split the model IO into two different logics, one for the model itself, like trees, while the other for a complete closure.

For the first logic, it should be used in normal model IO, like Booster.save_model. We output only the trees and strictly related parameters like num_features. No max_delta_step, no predictor nor gpu_id. So it can not be used for model recovery or transferred between workers during distributed training. When users want to continue training on saved model, they need to set the parameters again via booster.set_param. Since a tree is a model, it doesn't know how to train itself. And we promise to keep the stability of this output.

While for transferring model between workers, or doing model checkpointing, we save the "complete snapshot", including both model (trees) and training parameters like max_delta_step, tree_method=gpu_hist objective=softmax etc. This output will contain all information needed to carry out training or whatever operations the booster are doing. But this output should only be used in helping distributed training, not model sharing and anything related. So we don't have to maintain model compatibility for it.

Problem

I'm not an expert of distributed computation, so my thoughts could be naive. Looking forward to your feedbacks.

The text was updated successfully, but these errors were encountered:

trivialfis · 2019-09-14T15:05:07Z

@hcho3 @RAMitchell @CodingCat @chenqin

CodingCat · 2019-09-14T15:24:14Z

I don’t have any strong opinion on this, but I am actually not aware of that we are adding training parameters to booster directly

Before we materialize the proposal here if we want, can we stop doing that? It is breaking compatibility of boosters across versions

For xgb-spark, the way I handled this is keeping everything in spark layer,

If you have special requirement for the bindings you work on, gpu, dask, and for rabit work and if we are adding anything related to rabit to xgboost params, we should definitely say no, then have your own param system

trivialfis · 2019-09-14T15:39:45Z

@CodingCat See IO logic here:

xgboost/src/learner.cc

Line 329 in c89bcc4

auto it = cfg_.find("max_delta_step");

xgboost/src/learner.cc

Line 342 in c89bcc4

extra_attr.emplace_back("SAVED_PARAM_" + key, it->second);

and many more in the Save and Load. Also changes in #4808 . It's quite a headache.

trivialfis · 2019-09-14T15:43:06Z

@CodingCat I traced down some of those commits by git blame. Right now I don't have a better solution to work with these parameters. Are you suggesting that we should make a copy of all parameters in language bindings?

CodingCat · 2019-09-14T15:52:41Z

my idea is that (forget about my comments on gpu, it’s different with dask, spark, etc.)

(1) we should keep what is output from that code you referred is general to all bindings

(2) each binding must have some special requirement, e.g. spark needs to decide whether to cache input dataset. All of such params should not go to the booster serialized form

(3) to interact with “core” params in xgb, yes, you need to have a way to pass param from your binding’s param system to “core” xgb params

trivialfis · 2019-09-14T15:57:08Z

we should keep what is output from that code you referred is general to all bindings

Definitely.

The problem is, whether do we split model IO into two separated code path, one saves the model (trees) only, while the other saves all those core parameters to meet the requirements of distributed setting.

CodingCat · 2019-09-14T15:58:03Z

I am actually not aware of that we are adding training parameters to booster directly

I mean I am unaware that we are adding new params to that, ( yes, I know we have already bunch of params there....)

CodingCat · 2019-09-14T16:04:11Z

The problem is, whether do we split model IO into two separated code path, one saves the model (trees) only, while the other saves all those core parameters to meet the requirements of distributed setting.

For this, my personal suggestion,

(1) scope how many special parameters (for dask, spark, rabit, etc) which already added in, and see if we can clean up

(2) stop adding more

(3) to minimize the workload and changes on model format, keep all remaining parameters after a potential cleanup in (1) and ensure that all of things there are general to all bindings (e.g. everyone needs a max_delta_step, but not everyone needs a rabit thing or a spark thing etc.)

chenqin · 2019-09-14T16:19:51Z

I recall both R and Python has some assumption how sequential IO performed in C++ layer.
The challenge of clean up are

parameters some times get overwritten based on run-time environment
parameters were stored with booster without clear section and offset information

We might consider adding a header section where all bindings can fetch certain sections payload starts from offsets in meta
https://en.wikipedia.org/wiki/Executable_and_Linkable_Format

trivialfis · 2019-09-14T16:20:21Z

@CodingCat

scope how many special parameters (for dask, spark, rabit, etc) which already added in, and see if we can clean up

We don't have any binding specific parameter in C++ code. No need to worry about that. ;-)

Let me clarify a little bit more. I want to split up the interface into two separated IO logics. One saves "trees" and the other saves everything like "core parameters along with trees".

For the first IO interface, say when I perform bst.save_model, only a bunch of binary trees are saved. It can not be used to recover training without providing appropriate parameters by user again, since it's just the model, no max_delta_step, predictor, no gpu_id etc, only the trained trees.

The other one saves everything, including predictor, max_delta_step, objective=softmax and of course, trees. It can be used for model recovery since it contains all information needed to carry out training. But normally you don't share this one on the internet...

trivialfis · 2019-09-14T16:22:14Z

@chenqin I'm definitely going to JSON for this. If binary model is preferred, I would go with binary JSON. Search the term "bson".

trivialfis · 2019-09-14T16:26:37Z

One extra benefit of doing this is, we only need to keep model compatibility for the bare tree model. Since that's the format actually saved to disk and presented to users. For the "save everything option", it should only be used in model recovery or debugging, we can keep it internal and never concern its compatibility.

hcho3 · 2019-09-15T03:14:55Z

@trivialfis +1 for the proposal. I introduced SAVED_PARAM_ hack a while ago to preserve the state of predictor param. Your new proposal will reduce the complexity of IO logic while enabling "complete snapshot", thus killing two birds with one stone.

@CodingCat I don't think the proposal has anything to do with binding-specific parameters. Even when we only look at C++ code, the current mix of model and training parameter presents an issue.

thvasilo · 2019-09-15T15:23:11Z

+1 from me too. One situation I've encountered in the past had to do with model management or "provenance". The situation is I find a model artifact I've trained in the past that is somehow useful still, but can't be sure about how it was trained.

These days I'll usually dump the entire training parameter dictionary to a JSON file so in the future I can make sense of context around the model artifact. I think this proposal would help with that situation too,which I think is becoming common in ML pipelines.

chenqin · 2019-09-16T18:42:11Z

+1 put additional information to make snapshot self contained. One possible application we might consider is debug retrain model. We might load previous snapshot with "retrain" and compare how new trees were shaped side by side old trees.

trivialfis · 2019-09-25T05:55:30Z

Working on it. Might take some time due to the massive changes.

* Apply Configurable to objective functions. * Apply Model to Learner and Regtree, gbm. * Add Load/SaveConfig to objs. * Refactor obj tests to use smart pointer. * Dummy methods for Save/Load Model. A small portion of dmlc#4732. Also related to dmlc#4855.

trivialfis changed the title ~~[RFC] Distinguish model, booster and runtime when performing IO actions.~~ [RFC] Distinguish model, booster and runtime when performing IO operations. Sep 14, 2019

trivialfis added the status: RFC label Sep 14, 2019

trivialfis mentioned this issue Sep 14, 2019

Configure gpu_id after load. #4835

Closed

This was referenced Sep 17, 2019

Python predict() does not work with multiprocessing #4246

Closed

Xgboost modeling inside multiprocessing.Process fails on second run, on GPU-server #4877

Closed

trivialfis mentioned this issue Sep 26, 2019

Add JSON IO for various components. #4732

Closed

24 tasks

trivialfis mentioned this issue Oct 14, 2019

Add Model and Configurable interface. #4945

Merged

This was referenced Dec 1, 2019

[jvm-packages] do not use multiple jobs to make checkpoints #5082

Merged

Deprecate XGBRegressor.save_model() / XGBClassifier.save_model() #4639

Closed

trivialfis closed this as completed Dec 18, 2019

hcho3 mentioned this issue Feb 23, 2020

ONNX export is now available in onnxmltools #4626

Closed

lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

trivialfis commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 14, 2019

CodingCat commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 14, 2019

trivialfis commented Sep 14, 2019

CodingCat commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 14, 2019

CodingCat commented Sep 14, 2019

CodingCat commented Sep 14, 2019

chenqin commented Sep 14, 2019

trivialfis commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 14, 2019

trivialfis commented Sep 14, 2019 •

edited

Loading

hcho3 commented Sep 15, 2019

thvasilo commented Sep 15, 2019

chenqin commented Sep 16, 2019 •

edited

Loading

trivialfis commented Sep 25, 2019

[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

Comments

trivialfis commented Sep 14, 2019 • edited Loading

Background

Old Proposal

New one

Problem

trivialfis commented Sep 14, 2019

CodingCat commented Sep 14, 2019 • edited Loading

trivialfis commented Sep 14, 2019

trivialfis commented Sep 14, 2019

CodingCat commented Sep 14, 2019 • edited Loading

trivialfis commented Sep 14, 2019

CodingCat commented Sep 14, 2019

CodingCat commented Sep 14, 2019

chenqin commented Sep 14, 2019

trivialfis commented Sep 14, 2019 • edited Loading

trivialfis commented Sep 14, 2019

trivialfis commented Sep 14, 2019 • edited Loading

hcho3 commented Sep 15, 2019

thvasilo commented Sep 15, 2019

chenqin commented Sep 16, 2019 • edited Loading

trivialfis commented Sep 25, 2019

trivialfis commented Sep 14, 2019 •

edited

Loading

CodingCat commented Sep 14, 2019 •

edited

Loading

CodingCat commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 14, 2019 •

edited

Loading

chenqin commented Sep 16, 2019 •

edited

Loading