You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training XGBoost models on Spark it is possible to set the value of "missing" as part of the parameters of the model (more description at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values). If you save the model in the format needed to be loaded in other bindings (via https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#interact-with-other-bindings-of-xgboost) then this missing parameter gets dropped (in general parameters not being included in the model is discussed in #4104). If the native model is then loaded (either on another platform or even again into Spark) the absence of this missing parameter will cause predictions to be inaccurate. An especially confusing aspect of this is that the missing parameter in python is a property of the DMatrix and so is a property of the dataset fed to XGBoost, in Spark however it's part of the model's parameters and so a property of the model. It can then be easy to forget to set the parameter correctly when constructing your DMatrix in python since it seems like it would be already baked into the model.
The text was updated successfully, but these errors were encountered:
trivialfis
changed the title
Add documentation warning for saving Spark trained model with non-default missing value in native format
[jvm-package] Add documentation warning for saving Spark trained model with non-default missing value in native format
Aug 4, 2019
When training XGBoost models on Spark it is possible to set the value of "missing" as part of the parameters of the model (more description at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values). If you save the model in the format needed to be loaded in other bindings (via https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#interact-with-other-bindings-of-xgboost) then this missing parameter gets dropped (in general parameters not being included in the model is discussed in #4104). If the native model is then loaded (either on another platform or even again into Spark) the absence of this missing parameter will cause predictions to be inaccurate. An especially confusing aspect of this is that the missing parameter in python is a property of the DMatrix and so is a property of the dataset fed to XGBoost, in Spark however it's part of the model's parameters and so a property of the model. It can then be easy to forget to set the parameter correctly when constructing your DMatrix in python since it seems like it would be already baked into the model.
I imagine including the value of the missing parameter along with the model is likely blocked by #3980 but would it be possible to add to the documentation page about dealing with missing values (https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values) that care needs to be taken to set the missing parameter correctly on the other side if saving the model in native format?
The text was updated successfully, but these errors were encountered: