-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Distinguish model, booster and runtime when performing IO operations. #4855
Comments
I don’t have any strong opinion on this, but I am actually not aware of that we are adding training parameters to booster directly Before we materialize the proposal here if we want, can we stop doing that? It is breaking compatibility of boosters across versions For xgb-spark, the way I handled this is keeping everything in spark layer, If you have special requirement for the bindings you work on, gpu, dask, and for rabit work and if we are adding anything related to rabit to xgboost params, we should definitely say no, then have your own param system |
@CodingCat See IO logic here: Line 329 in c89bcc4
Line 342 in c89bcc4
and many more in the |
@CodingCat I traced down some of those commits by git blame. Right now I don't have a better solution to work with these parameters. Are you suggesting that we should make a copy of all parameters in language bindings? |
my idea is that (forget about my comments on gpu, it’s different with dask, spark, etc.) (1) we should keep what is output from that code you referred is general to all bindings (2) each binding must have some special requirement, e.g. spark needs to decide whether to cache input dataset. All of such params should not go to the booster serialized form (3) to interact with “core” params in xgb, yes, you need to have a way to pass param from your binding’s param system to “core” xgb params |
Definitely. The problem is, whether do we split model IO into two separated code path, one saves the model (trees) only, while the other saves all those core parameters to meet the requirements of distributed setting. |
I mean I am unaware that we are adding new params to that, ( yes, I know we have already bunch of params there....) |
For this, my personal suggestion, (1) scope how many special parameters (for dask, spark, rabit, etc) which already added in, and see if we can clean up (2) stop adding more (3) to minimize the workload and changes on model format, keep all remaining parameters after a potential cleanup in (1) and ensure that all of things there are general to all bindings (e.g. everyone needs a max_delta_step, but not everyone needs a rabit thing or a spark thing etc.) |
I recall both R and Python has some assumption how sequential IO performed in C++ layer.
We might consider adding a header section where all bindings can fetch certain sections payload starts from offsets in meta |
We don't have any binding specific parameter in C++ code. No need to worry about that. ;-) Let me clarify a little bit more. I want to split up the interface into two separated IO logics. One saves "trees" and the other saves everything like "core parameters along with trees". For the first IO interface, say when I perform The other one saves everything, including |
@chenqin I'm definitely going to JSON for this. If binary model is preferred, I would go with binary JSON. Search the term "bson". |
One extra benefit of doing this is, we only need to keep model compatibility for the bare tree model. Since that's the format actually saved to disk and presented to users. For the "save everything option", it should only be used in model recovery or debugging, we can keep it internal and never concern its compatibility. |
@trivialfis +1 for the proposal. I introduced @CodingCat I don't think the proposal has anything to do with binding-specific parameters. Even when we only look at C++ code, the current mix of model and training parameter presents an issue. |
+1 from me too. One situation I've encountered in the past had to do with model management or "provenance". The situation is I find a model artifact I've trained in the past that is somehow useful still, but can't be sure about how it was trained. These days I'll usually dump the entire training parameter dictionary to a JSON file so in the future I can make sense of context around the model artifact. I think this proposal would help with that situation too,which I think is becoming common in ML pipelines. |
+1 put additional information to make snapshot self contained. One possible application we might consider is debug retrain model. We might load previous snapshot with "retrain" and compare how new trees were shaped side by side old trees. |
Working on it. Might take some time due to the massive changes. |
Background
I will use Python package as the example for below discussion, but it should be equally applicable to other language bindings. Right now when XGBoost saves it's "model", it saves the
tree_model
orlinear_model
, plus some attributes likemax_delta_step
andpredictor
, which are not strictly related to the "model" itself. Take neural network as an example, the "model" should be its weight and structure, and should has nothing to do with objective function or optimizer. Such boundary also exists in XGBoost, where the concept of model needs to be restricted to trees or linear weights only. I believe this is how the current IO interface was designed.But now we have some more complicated situations, like the
Booster
is used in pickle, transferred between workers for distributed learning, used as checkpoing for model recovery etc. In some of these scenarios, optimizer(or the term used in XGBoost,tree_method
) and some objective parameters are crucial components in the saved file (or memory buffer, you know what I mean). So with time we added some of training parameters (like themax_delta_step
hyper-parameter) into the output file, and some more are coming with the excellent work by @chenqin for better rabit recovery procedures.Also, there are generic parameters that define the runtime configuration, like
gpu_id
,nthread
andverbosity
, these are even less correlated to the concept of "model". Saving them to disk is not really making sense. But it's sometime necessary to serialize them in distributed setting. For instance, right now the distributed monitor actually doesn't work as expected in distributed environment due to the fact thatverbosity
is lost in transferred booster.Old Proposal
I proposed saving the "complete snapshot" of XGBoost before while trying to introduce the json serialization format, which should mitigate the above issues. But as described, most of those parameters are not strictly related to the concept of "model", when user is only trying to save the resulting model and use it for prediction or want to share it over internet, those parameters are not only useless, but also adds additional complexity.
New one
So here I suggest we split the model IO into two different logics, one for the model itself, like trees, while the other for a complete closure.
For the first logic, it should be used in normal model IO, like
Booster.save_model
. We output only the trees and strictly related parameters likenum_features
. Nomax_delta_step
, nopredictor
norgpu_id
. So it can not be used for model recovery or transferred between workers during distributed training. When users want to continue training on saved model, they need to set the parameters again viabooster.set_param
. Since a tree is a model, it doesn't know how to train itself. And we promise to keep the stability of this output.While for transferring model between workers, or doing model checkpointing, we save the "complete snapshot", including both model (trees) and training parameters like
max_delta_step
,tree_method=gpu_hist
objective=softmax
etc. This output will contain all information needed to carry out training or whatever operations the booster are doing. But this output should only be used in helping distributed training, not model sharing and anything related. So we don't have to maintain model compatibility for it.Problem
I'm not an expert of distributed computation, so my thoughts could be naive. Looking forward to your feedbacks.
The text was updated successfully, but these errors were encountered: