-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480
Comments
@CodingCat reading your docs on using XGBoost with Spark I noticed that you stay within the MLLib environment. That works well for offline work but doesn't address online prediction scenarios. Have you had success loading XGBoost models built with Spark in other XGBoost libraries? |
saved XGBoostModel can only be read within XGBoost-Spark but if you call XGBoostModel.booster().save(), the output will be usable for other modules |
@ssimeonov please see #2265 and #2115. Regarding online prediction do you mean streaming retraining or only low latency evaluation? #edit |
@CodingCat thanks. |
@geoHeil I mean low-latency evaluation. |
@ssimeonov maybe https://www.slideshare.net/GeorgHeiler/machine-learning-model-to-production is interesting from the Hadoop user group Vienna. For xgb in particular see: https://github.com/komiya-atsushi/xgboost-predictor-java In general there is a trade off between using a different (fast) code base for one off predictions vs. batch training.
|
Hi , even i used xgboostModel.booster.saveModel() but still model loading failed and throwing "Unknown gbm type" error Please help. |
@ssimeonov xgboostModel.booster.saveModel("/tmp/xgbm") succeeds, However, when python's booster loaded successfully , the probability predicted by the spark's booster model is not the same as the probability predicted by the python's booster model even on the same instance. Do you facing this issue?? |
@devhaufer, yeah .. this is the biggest concern because our business partners are using standalone version and we are using spark distributed version and during validation , they find more then 30% gap in predicted probability for same instance of data |
for anyone facing inconsistent prediction problem, please check readme file in https://github.com/dmlc/xgboost/tree/master/jvm-packages
|
What if we do not have data in libsvm format both for training and scoring. I loaded the data in dataframe from hive table and use trainWithDataframe |
@sgatamex how many data points you have and how many workers you set? did you ever try to reduce the worker number and check again? |
Yes , reduce the workers to 1 , I have 2 m lillion rows and 700 columns.
I will share the code in some time
On Apr 25, 2018 9:58 PM, Nan Zhu <notifications@github.com> wrote:
@sgatamex<https://github.com/sgatamex> how many data points you have and how many workers you set? did you ever try to reduce the worker number and check again?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#2480 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AOV2IMhMwpJXuYjHaxmkJAcBZUeD83Dyks5tsVQJgaJpZM4OLx6G>.
|
Bumping this for the newest released // refactor version published 8 days ago. Ref: #3387 @yanboliang @CodingCat I tried something like the following:
But this doesn't work as ._boosted is private. I am looking to save the trained model for use in python. Any advice is appreciated |
@beautifulskylfsd For XGBoost-Spark users, it doesn't make sense to expose internal variables to users. But I think your requirement is reasonable, what about add a function |
I think it's reasonable to have another method for export XGBoost-formatted model, |
I am not that familiar with scala, but added the following, which seems to compile and run successfully:
However, I am running into a confusing problem: I can only save to /tmp folder, but after I run saveModel no files appear in my master node /tmp. Nothing appears to be saved. I tried changing the directory to /home/ or something else, but the permission is denied. Forgive me if this is a basic question -- help is very appreciated. |
@sgatamex I meet the same problems.Have you fix it now? |
I meet the problems, Have you fix it ? |
@DevHaufior , @sgatamex have you fix it? i also face this problem and i loaded the data in dataframe from hive table. useing ml.dmlc.xgboost4j.scala.XGBoost.train to train to the model and model.saveModel to save the model to local filesystem ... thanks |
A similar problem was reported in this issue, which was closed without any verification. The page cited as a reason for closing without verification claims there should be no problem, yet the claim flies in the face of multiple people having experienced the problem.
Here I'll attempt to provide specific steps to reproduce the problem based on the instructions for using XGBoost with Spark from Databricks. The steps should be reproducible in the Databricks Community Edition.
The instructions in the Scala notebook work sufficiently well for
xgboostModel.save("/tmp/myXgboostModel")
to generate/tmp/myXgboostModel/data
and/tmp/myXgboostModel/metadata/part-00000
(and the associated_SUCCESS
file) using saveModelAsHadoopFile() under the covers.The
data
file (download it) is 90388 bytes in my environment and begins with??_reg_??features??label?
.The metadata file is:
Attempting to load the model in Python with:
results in
The obvious question is whether the
data
file output is the same as typical model output? I can't find any info on this topic. If not, what's the correct way to read models output in Hadoop format in Python?Environment information:
master
ated8bc4521e2967d7c6290a4be5895c10327f021a
Python build instructions:
The text was updated successfully, but these errors were encountered: