[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

ssimeonov · 2017-07-03T02:42:04Z

A similar problem was reported in this issue, which was closed without any verification. The page cited as a reason for closing without verification claims there should be no problem, yet the claim flies in the face of multiple people having experienced the problem.

Here I'll attempt to provide specific steps to reproduce the problem based on the instructions for using XGBoost with Spark from Databricks. The steps should be reproducible in the Databricks Community Edition.

The instructions in the Scala notebook work sufficiently well for xgboostModel.save("/tmp/myXgboostModel") to generate /tmp/myXgboostModel/data and /tmp/myXgboostModel/metadata/part-00000 (and the associated _SUCCESS file) using saveModelAsHadoopFile() under the covers.

The data file (download it) is 90388 bytes in my environment and begins with ??_reg_??features??label?.

The metadata file is:

{"class":"ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel","timestamp":1499039951741,"sparkVersion":"2.0.2","uid":"XGBoostRegressionModel_e053248158d9","paramMap":{"use_external_memory":true,"lambda_bias":0.0,"lambda":1.0,"sample_type":"uniform","max_bin":16,"subsample":1.0,"labelCol":"label","alpha":0.0,"predictionCol":"prediction","skip_drop":0.0,"booster":"gbtree","min_child_weight":1.0,"scale_pos_weight":1.0,"grow_policy":"depthwise","tree_method":"auto","sketch_eps":0.03,"featuresCol":"features","colsample_bytree":1.0,"normalize_type":"tree","gamma":0.0,"max_depth":6,"eta":0.3,"max_delta_step":0.0,"colsample_bylevel":1.0,"rate_drop":0.0}}

Attempting to load the model in Python with:

import xgboost as xgb
bst = xgb.Booster({'nthread':4})
bst.load_model("/dbfs/tmp/myXgboostModel/data")

results in

XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type 
---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-10-b93cf7356f83> in <module>()
      1 import xgboost as xgb
      2 bst = xgb.Booster({'nthread':4})
----> 3 bst.load_model("/dbfs/tmp/myXgboostModel/data")

/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in load_model(self, fname)
   1005         if isinstance(fname, STRING_TYPES):
   1006             # assume file name, cannot use os.path.exist to check, file can be from URL.
-> 1007             _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname)))
   1008         else:
   1009             buf = fname

/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type 

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost15GradientBooster6CreateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISt10shared_ptrINS_7DMatrixEESaISC_EEf+0x429) [0x7f8941d33ce9]
[bt] (1) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl4LoadEPN4dmlc6StreamE+0x6d5) [0x7f8941bce9f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(XGBoosterLoadModel+0x28) [0x7f8941d364f8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f895fd05e40]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f895fd058ab]
[bt] (5) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f895ff153df]
[bt] (6) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f895ff19d82]
[bt] (7) /databricks/python/bin/python(PyObject_Call+0x43) [0x4b0de3]
[bt] (8) /databricks/python/bin/python(PyEval_EvalFrameEx+0x601f) [0x4c9b6f]
[bt] (9) /databricks/python/bin/python(PyEval_EvalCodeEx+0x255) [0x4c22e5]

The obvious question is whether the data file output is the same as typical model output? I can't find any info on this topic. If not, what's the correct way to read models output in Hadoop format in Python?

Environment information:

OS: Debbian Linux on AWS (Databricks Runtime 3.0 beta)
Scala & Python built from master at ed8bc4521e2967d7c6290a4be5895c10327f021a

Python build instructions:

cd /databricks/driver
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git checkout ed8bc4521e2967d7c6290a4be5895c10327f021a
make -j
cd python-package
sudo python setup.py install

The text was updated successfully, but these errors were encountered:

ssimeonov · 2017-07-03T02:48:48Z

@CodingCat reading your docs on using XGBoost with Spark I noticed that you stay within the MLLib environment. That works well for offline work but doesn't address online prediction scenarios. Have you had success loading XGBoost models built with Spark in other XGBoost libraries?

CodingCat · 2017-07-03T03:58:11Z

saved XGBoostModel can only be read within XGBoost-Spark

but if you call XGBoostModel.booster().save(), the output will be usable for other modules

geoHeil · 2017-07-03T04:28:06Z

@ssimeonov please see #2265 and #2115.

Regarding online prediction do you mean streaming retraining or only low latency evaluation?

#edit
You should be able to use pySpark though.

ssimeonov · 2017-07-03T04:47:36Z

@CodingCat thanks. xgboostModel.booster.saveModel("/tmp/xgbm") succeeds. This is probably worth adding to the docs...

ssimeonov · 2017-07-03T04:48:07Z

@geoHeil I mean low-latency evaluation.

geoHeil · 2017-07-03T05:25:23Z

@ssimeonov maybe https://www.slideshare.net/GeorgHeiler/machine-learning-model-to-production is interesting from the Hadoop user group Vienna.

For xgb in particular see: https://github.com/komiya-atsushi/xgboost-predictor-java

In general there is a trade off between using a different (fast) code base for one off predictions vs. batch training.

the classical solution is to use PMML i.e. as outlined here http://oryx.io
more modern approaches are e.g. http://mleap-docs.combust.ml
on the other hand you do not want to have 2 different code bases https://github.com/ucbrise/clipper tries to allow for quick predictions with the original code base and offers additional features
most cloud vendors have the possibility to create an REST API from your model nearly automatically
As linked in the slides there are a lot of competing solutions / platforms out there in the space of enterprise ML adoption. Not sure how long these will exist.
I think the approach of https://www.youtube.com/watch?v=Do7C4UJyWCM&list=PLDX4T_cnKjD2UC6wJr_wRbIvtlMtkc-n2&index=3 is particularly interesting i.e. via a DSL the departments can easily formulate their own models.

ssimeonov · 2017-07-03T05:44:49Z

@geoHeil thanks for the info!

Re: xgboost-predictor-java, a number of people challenge the reported performance, e.g., here.

PMML feels overly heavy/inflexible, for the reasons mleap avoided it.

clipper & mleap are interesting. I wonder who's using them in production.

sgatamex · 2017-08-03T21:15:42Z

Hi ,
I am also facing the same issue of MODEL training in spark-scala and loading the model using xgboost4j in java.

even i used xgboostModel.booster.saveModel() but still model loading failed and throwing "Unknown gbm type" error

Please help.

DevHaufior · 2018-04-26T04:22:33Z

@ssimeonov xgboostModel.booster.saveModel("/tmp/xgbm") succeeds, However, when python's booster loaded successfully , the probability predicted by the spark's booster model is not the same as the probability predicted by the python's booster model even on the same instance. Do you facing this issue??

sgatamex · 2018-04-26T04:41:00Z

@devhaufer, yeah .. this is the biggest concern because our business partners are using standalone version and we are using spark distributed version and during validation , they find more then 30% gap in predicted probability for same instance of data

CodingCat · 2018-04-26T04:43:43Z

for anyone facing inconsistent prediction problem, please check readme file in https://github.com/dmlc/xgboost/tree/master/jvm-packages

NOTE on LIBSVM Format: section

sgatamex · 2018-04-26T04:55:09Z

What if we do not have data in libsvm format both for training and scoring. I loaded the data in dataframe from hive table and use trainWithDataframe

CodingCat · 2018-04-26T04:58:29Z

@sgatamex how many data points you have and how many workers you set? did you ever try to reduce the worker number and check again?

sgatamex · 2018-04-26T05:17:31Z

Yes , reduce the workers to 1 , I have 2 m lillion rows and 700 columns. I will share the code in some time On Apr 25, 2018 9:58 PM, Nan Zhu <notifications@github.com> wrote: @sgatamex<https://github.com/sgatamex> how many data points you have and how many workers you set? did you ever try to reduce the worker number and check again? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2480 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AOV2IMhMwpJXuYjHaxmkJAcBZUeD83Dyks5tsVQJgaJpZM4OLx6G>.

beautifulskylfsd · 2018-06-27T00:09:26Z

Bumping this for the newest released // refactor version published 8 days ago. Ref: #3387 @yanboliang @CodingCat

I tried something like the following:

xgboostModel = new XGBoostClassifier(paramMap).fit(train_df) xgboostModel._boosted.saveModel("/tmp/xgbm")

But this doesn't work as ._boosted is private.

I am looking to save the trained model for use in python.

Any advice is appreciated

yanboliang · 2018-06-27T01:18:10Z

@beautifulskylfsd For XGBoost-Spark users, it doesn't make sense to expose internal variables to users. But I think your requirement is reasonable, what about add a function exportToLocal which can export the internal booster and save it? cc @CodingCat

CodingCat · 2018-06-27T01:54:11Z

I think it's reasonable to have another method for export XGBoost-formatted model,

beautifulskylfsd · 2018-06-28T13:00:47Z

I am not that familiar with scala, but added the following, which seems to compile and run successfully:

def exportToLocal(fpath: String): Unit = { _booster.saveModel(fpath) }

However, I am running into a confusing problem:

I can only save to /tmp folder, but after I run saveModel no files appear in my master node /tmp. Nothing appears to be saved.

I tried changing the directory to /home/ or something else, but the permission is denied.

Forgive me if this is a basic question -- help is very appreciated.

karterotte · 2018-09-14T08:02:45Z

Hi ,
I am also facing the same issue of MODEL training in spark-scala and loading the model using xgboost4j in java.

even i used xgboostModel.booster.saveModel() but still model loading failed and throwing "Unknown gbm type" error

Please help.

@sgatamex I meet the same problems.Have you fix it now?

cengjingmengxiang · 2018-09-16T13:54:07Z

@sgatamex

What if we do not have data in libsvm format both for training and scoring. I loaded the data in dataframe from hive table and use trainWithDataframe

I meet the problems, Have you fix it ?

wodo2008 · 2018-10-31T09:22:46Z

@ssimeonov xgboostModel.booster.saveModel("/tmp/xgbm") succeeds, However, when python's booster loaded successfully , the probability predicted by the spark's booster model is not the same as the probability predicted by the python's booster model even on the same instance. Do you facing this issue??

@DevHaufior , @sgatamex have you fix it? i also face this problem and i loaded the data in dataframe from hive table. useing ml.dmlc.xgboost4j.scala.XGBoost.train to train to the model and model.saveModel to save the model to local filesystem ... thanks

CodingCat closed this as completed Jul 7, 2017

This was referenced Jun 11, 2018

[WIP] Initial PySpark Support #3376

Closed

[jvm-packages] PySpark Support Checklist #3370

Closed

thesuperzapper mentioned this issue Jul 11, 2018

[DISCUSSION] Integration with PySpark #1698

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

ssimeonov commented Jul 3, 2017

ssimeonov commented Jul 3, 2017

CodingCat commented Jul 3, 2017

geoHeil commented Jul 3, 2017 •

edited

Loading

ssimeonov commented Jul 3, 2017

ssimeonov commented Jul 3, 2017

geoHeil commented Jul 3, 2017

ssimeonov commented Jul 3, 2017 •

edited

Loading

sgatamex commented Aug 3, 2017

DevHaufior commented Apr 26, 2018

sgatamex commented Apr 26, 2018

CodingCat commented Apr 26, 2018

sgatamex commented Apr 26, 2018

CodingCat commented Apr 26, 2018

sgatamex commented Apr 26, 2018 via email

beautifulskylfsd commented Jun 27, 2018 •

edited

Loading

yanboliang commented Jun 27, 2018

CodingCat commented Jun 27, 2018

beautifulskylfsd commented Jun 28, 2018

karterotte commented Sep 14, 2018

cengjingmengxiang commented Sep 16, 2018

wodo2008 commented Oct 31, 2018 •

edited

Loading

[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

Comments

ssimeonov commented Jul 3, 2017

ssimeonov commented Jul 3, 2017

CodingCat commented Jul 3, 2017

geoHeil commented Jul 3, 2017 • edited Loading

ssimeonov commented Jul 3, 2017

ssimeonov commented Jul 3, 2017

geoHeil commented Jul 3, 2017

ssimeonov commented Jul 3, 2017 • edited Loading

sgatamex commented Aug 3, 2017

DevHaufior commented Apr 26, 2018

sgatamex commented Apr 26, 2018

CodingCat commented Apr 26, 2018

sgatamex commented Apr 26, 2018

CodingCat commented Apr 26, 2018

sgatamex commented Apr 26, 2018 via email

beautifulskylfsd commented Jun 27, 2018 • edited Loading

yanboliang commented Jun 27, 2018

CodingCat commented Jun 27, 2018

beautifulskylfsd commented Jun 28, 2018

karterotte commented Sep 14, 2018

cengjingmengxiang commented Sep 16, 2018

wodo2008 commented Oct 31, 2018 • edited Loading

geoHeil commented Jul 3, 2017 •

edited

Loading

ssimeonov commented Jul 3, 2017 •

edited

Loading

beautifulskylfsd commented Jun 27, 2018 •

edited

Loading

wodo2008 commented Oct 31, 2018 •

edited

Loading