[jvm-packages] PySpark Support Checklist #3370

thesuperzapper · 2018-06-08T01:04:45Z

Overview:

This is a meta issue for implementing PySpark support.

Related PRs:

[jvm-packages] initial pyspark api (WIP) #4656

TODO:

Decide on the Python package name
- PR [jvm-packages] initial pyspark api (WIP) #4656 uses sparkxgb (which is pretty widely used at this point)
- Alternativly, we could use: spark-xgboost, pyspark-xgboost or pyspark-xgb
Develop PySpark wrapper code
- PR [jvm-packages] initial pyspark api (WIP) #4656 pretty much has this covered
- However, it probably needs to be rebased for XGBoost 1.5, as it was based on 0.9, lol!
Decide how to package the code
- PR [jvm-packages] initial pyspark api (WIP) #4656 embeds the PySpark API into the XGBoost spark jar
- This is how the Spark-NLP and GraphFrames PySpark packages work, so we can probably just start with that (rather than using PyPi)
Create a testing framework
- The tests can be put into a run_tests.py file under ./jvm-packages/xgboost4j-spark/src/main/resources/sparkxgb/
- These tests can be run in CI by adding a new spark-submit command to ./xgboost/blob/master/tests/ci_build/test_jvm_cross.sh#L38
Write API Documentation
- Since we are adding a whole new API to XGBoost, we will need to write some docs for the "JVM Package" section of the website
- The source for these pages on the website is found at ./doc/jvm/

The text was updated successfully, but these errors were encountered:

CodingCat · 2018-06-08T01:48:42Z

would you please send a WIP version of PR when you are doing the work so that we can discuss some problems undergoing if necessary

thesuperzapper · 2018-06-12T00:40:27Z

@CodingCat there are a few other non-essential things I think we should support.

We should fix saveAsHadoopFile( in the base Spark XGBoost API, so that there is at least rudimentary interoperablity with the Base Python XGBoost API. (See: [jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480).
- While you can use .booster.saveModel("xx") this does not support HDFS location, causing it to simply not work in yarn-cluster mode.
In a similar vein to above, I think we should implement some sort of toPython( method, which returns a Python API version of models/estimators, allowing us to use the visualization tools in that API. This would allow for some awesome pipeline where large data is used to train models on spark, and then these models are visualized in Base Python)
We should allow rawPredictionCol to be set in the base Spark XGBoost Estimator object. (Currently it can only be set in classification model objects.)
- While this makes sense as that estimator object could generate a regression model, it causes ex-facto parameter injection to take place if I want to rename the probabilities col, (e.g. for BinaryClassificationEvaluator)
Dealing with null values is a bit of pain, as having Float.NaN as the "missing" parameter seems to cause crashes. (Hilariously, Float.PositiveInfinity works fine, which I have been using.)
We need to implement early stopping, or something to that effect. Either we could implement the XGBoost object from the Scala API, or extend the XGBoostEstimator Scala object with some early stopping features.

CodingCat · 2018-06-12T03:47:06Z

@thesuperzapper

We should fix saveAsHadoopFile....While you can use .booster.saveModel("xx")

I think you can pass in a OutputStream which is created by FileSystem.create()....then you can work with HDFS cluster, https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java#L338

...toPython....

it's a good suggestion, but I am not sure if it really worth much effort to deal with this complexity, use file system to exchange model is good enough to me

...rawPredictionCol...

it's resolved in the undergoing refactoring work of xgboost4j-spark #3313

.., Dealing with null values is a bit of pain, as having Float.NaN as the "missing" parameter seems to cause crashes. (Hilariously, Float.PositiveInfinity works fine, which I have been using.)

noticed it, do you know if it happens in other API?

...early stop...

does https://github.com/dmlc/xgboost/pull/2710/files work for you?

thesuperzapper · 2018-07-04T01:58:58Z

@CodingCat sorry about the massive delay, I have a bit more time to get #3376 rebased now.

In terms of saving the API interchangeable model to a Hadoop location, I think this should be implemented in the new XGBoostClassifier/XGBoostRegressor scala objects (#3313), and the pyspark wrapper should call down to that method.

In terms of the Float.NaN issue, I find that missing values crash all spark based XGBoost-0.72 API's, (But specifying Float.PositiveInfinity as the missing value works fine, if you fill your nulls as that) it seems to be the presence of Float.NaN in training which causes the crash, rather than what you specify as a missing value.

For the early stopping #2710 addresses it by wrapping the spark estimator in a new scala object, which stops support for things like Pipelines and ParamGridBuilder. I would prefer to add an early stopping feature to the new XGBoostClassifier/XGBoostRegressor scala objects.

CodingCat · 2018-07-08T20:06:52Z

In terms of saving the API interchangeable model to a Hadoop location, I think this should be implemented in the new XGBoostClassifier/XGBoostRegressor scala objects (#3313), and the pyspark wrapper should call down to that method.

if you are talking about pyspark API other than python API, you can directly use MLLIB model persistence APIs to read models, as we have implemented MLWritable interface (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala#L310-L311)

CodingCat · 2018-07-08T20:12:59Z

added the other two things to feature requests

CodingCat · 2018-07-08T20:54:25Z

about NaN leading to jvm crash, I found https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24760 in Spark JIRA....I think it might be relevant

thesuperzapper · 2018-07-08T21:20:04Z

@CodingCat for the saving thing, I was meaning in the case where you want to save your model that you trained in spark, and read it back into Python/R, this situation would be benefited dramatically, if I could saveToHadoop(, on the model object.

For the NaN crash, this crash happens in the base XGBoost Spark Scala API, (At least in 0.72), the fact it happens in my PySpark wrapper is a side effect of this.

CodingCat · 2018-07-08T22:03:44Z

@thesuperzapper with the current code, you can persist model in xgboost4j-spark with MLLIB model persistence API and read it with the coming pyspark API, then with the read model you can do anything with model.booster

I do not see a problem here....

even we implement the functionality to support HDFS path in

xgboost/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java

Line 338 in 6bed54a

public void saveModel(OutputStream out) throws XGBoostError, IOException {

, are you sure you can read with pure python API (not pyspark API) directly?

CodingCat · 2018-07-08T22:09:18Z

for NaN, yes, the link I posted is about Pandas cannot work with NaN + JVM as well

I suspect there is something wrong with NaN when JVM interacts with other programming languages, maybe it was transformed to something weird in native layer

thesuperzapper · 2018-07-09T05:16:42Z

@CodingCat I think we might not be understanding each other.

For the model saving, you have to use .booster.saveModel( if you use any other model saving you get XGBoostError: b'[16:31:29] src/gbm/gbm.cc:20: Unknown gbm type ' (See: #2480)

For the NaN, this happens even if you just use the Scala API (No python anywhere in the chain), so to fix the bug, we need to get it working in the Scala API.

CodingCat · 2018-07-09T23:00:10Z

@thesuperzapper

what you need is booster.saveModel("a hdfs path here"), right?
my question is even you have the functionality in 1, are you sure the current xgboost supports loadModel("a hdfs path here")? (this is a question)
on the other side, I think we need to be clear about how we want to establish the interchangeable, you want to do spark <-> python/R, then we need to ensure the yes answer to question 2....if you want to do scala spark <-> pyspark <-> python/R(this is what I understand from you), we can do it through the already-existing MLLIB persistence API

CodingCat · 2018-07-09T23:43:01Z

regarding NaN...I am not talking about python, I am talking about cross language conversion, that's from scala to native

thesuperzapper · 2018-07-11T07:45:30Z

@CodingCat

Yea, that would be good, or just fix the already existing .saveModelAsHadoopFile(" to create a readable model for the python/R API's.
I am not sure, but I doubt that .loadModel supports hdfs. But we could implement the complement of above .readModelFromHadoopFile(" method in the spark API.
I really don't know how you can use the existing mllib persistence API to get a trained model into python/R, can you explain?
The NaN thing: Yea, its definitely related to some of that Scala/C interaction, as the error it throws is a C one.

CodingCat · 2018-07-12T02:31:24Z

looks like even loadModel does not support hdfs path, you can pass in inputstream which is built via HDFS FileSystem instance

check https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala#L418-L432

I really don't know how you can use the existing mllib persistence API to get a trained model into python/R, can you explain?

with the same piece of code, you can call the counterpart of XGBoostClassificationModel.load(....) in your pyspark API to read a model, then with the counterpart of

xgboost/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala

Line 215 in 2200939

def nativeBooster: Booster = _booster

in pyspark API, you can get booster in python....this booster is identical to that in python API so you can do whatever you want in pure python API

thesuperzapper · 2018-07-12T08:40:56Z

@CodingCat I think I was forgetting that the Scala API had been rewritten, will need to alter my wrapper for it to work with 0.8.

CodingCat · 2018-07-30T01:09:37Z

@thesuperzapper any update on this?

thesuperzapper · 2018-07-30T01:40:46Z

@CodingCat, sorry about the delay, haven't started the rewrite yet, was partially waiting for stability in the API, and partially extremely busy. I will give this a further look this week.

Are there any doc's yet for the new API structure?

CodingCat · 2018-07-30T02:03:02Z

@thesuperzapper no problem,

the new API structure is very simple, it only contains the standard interface of Spark MLLIB transform and fit

all the other configurations like transform leaf, etc. is enabled by setting a column name

we have a undergoing tutorial in CodingCat#4

CodingCat · 2018-09-01T21:48:59Z

any update on this?

thesuperzapper · 2018-09-05T11:16:02Z

@CodingCat yea, the issue I keep running into is allowing support for pipeline persistence. In my initial PR, I supported this by creating my own pipeline object, but I don't like that solution as its messy and leaves way to much code to support in the future.

As of Spark 2.3, there is DefaultParamsWritable, and I think there must be a way to get that working with the default pipeline object. However it is unlikely that a pipeline written in python will be readable in the Scala API.

CodingCat · 2018-09-28T20:50:04Z

@thesuperzapper are you still active working on this?

@yanboliang can take over or help you if you are busy in other stuffs

thesuperzapper · 2018-09-30T07:37:49Z

@CodingCat yea, a few weeks ago I put up a development build for 0.8 in issue #1698.

I really just need people to test it and tell me what I missed, and whats not working. Additionally, I don't want to introduce the XGBoostPipeline object. If possible I would really want to use the new Python Spark 2.3 DefaultParamsWritable API, but haven't really seen any reference implementation for that yet.

The only issue serious issue I keep running into is that classification models wont load back after being saved, giving the error: TypeError: 'JavaPackage' object is not callable. However, strangely XGBoostPipelineModel works just fine with an XGBoost classification stage. This leads me to think is an issue on my end, can someone verify if reading classification models works for them?

I will update this PR (#3376) with that dev build, but its still needs some work.

Igosuki · 2018-11-18T20:04:33Z

I intend to use this code with an incremental optimizer which has to run on just one machine, or at least a limited number of them (e.g: Tree Parzen Estimator). This has me having to wrap XGBoostClassifier/Regressor into another Estimator that'll just randomize params according to uniform distributions. This system allows us to converge a lot faster than with GridSearch.
It's just the search space that changes, auc based cross validation is still here.
Can this solution be easily implemented hacking the Py4J wrappers that'll come out here ?

Another thing is we use multiple environments and I see that the serialization code for the model is different between Python and XGBoost, is there planned operability through this PR ?

ecampana · 2019-01-08T15:51:43Z

@thesuperzapper I am using xgboost4j-spark-0.80.jar along with xgboost4j-0.80.jar, as well as the wrapper functions stored in sparkxgb.zip which is compatible with version 0.80 of xgboost. I am able to make predictions after fitting the data with a list of features and target value. I would like to now extract the information gain of each feature, but I have not managed to figure out a way to obtain these results. Would someone be able to point me in the right direction. Much appreciated!

prediction = ['prediction']

features = ['feature_1', 'feature_2',  'feature_3', 'feature_4']

vectorAssembler = VectorAssembler().setInputCols(features).setOutputCol('features')

xgboost = XGBoostClassifier(
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction'
)

pipeline = Pipeline().setStages([vectorAssembler, xgboost])

model = pipeline.fit(train)

results = model.transform(test).select(features+prediction)

results.select('prediction').distinct().collect()
[Row(prediction=0.0), Row(prediction=1.0)]

I have tried accessing the below with no luck.

print(model.stages[1].nativeBooster.getFeatureScore.__doc__)

Help on class Booster in package ml.dmlc.xgboost4j.scala:

Booster implements scala.Serializable, com.esotericsoftware.kryo.KryoSerializable {
|  
|  Methods defined here:
|  
|  getFeatureScore(String) : Map
|  
|  ------------------------------------------------------------
|  Fields defined here:
|  
|  ------------------------------------------------------------
|  Internal classes defined here:
|  
}

ecampana · 2019-05-16T21:31:15Z

@CodingCat has the above issue mentioned about information gain been resolved in version 1.0 that you mentioned in the other thread? Thank you in advance.

petergtam · 2019-05-29T17:30:14Z

@thesuperzapper Are you still working on the wrapper?

thesuperzapper · 2019-07-11T03:00:12Z

This new PR is where we are working on it #4656

thesuperzapper · 2022-02-01T07:55:32Z

@trivialfis @wbo4958 I have updated the TODO's based on my comment in #7578 (comment)

trivialfis · 2022-07-17T01:45:40Z

Closing as the initial support is merged. Thank you to everyone who has participated in the discussion!

This was referenced Jun 11, 2018

[WIP] Initial PySpark Support #3376

Closed

[DISCUSSION] Integration with PySpark #1698

Closed

hcho3 mentioned this issue Jul 8, 2018

Roadmap: feature requests #3439

Open

32 tasks

thesuperzapper mentioned this issue Jul 19, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

thesuperzapper mentioned this issue Feb 1, 2022

[FEA] Make XGBoost4j-Spark to support PySpark #7578

Closed

trivialfis closed this as completed Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] PySpark Support Checklist #3370

[jvm-packages] PySpark Support Checklist #3370

thesuperzapper commented Jun 8, 2018 •

edited

Loading

CodingCat commented Jun 8, 2018

thesuperzapper commented Jun 12, 2018

CodingCat commented Jun 12, 2018

thesuperzapper commented Jul 4, 2018

CodingCat commented Jul 8, 2018

CodingCat commented Jul 8, 2018

CodingCat commented Jul 8, 2018

thesuperzapper commented Jul 8, 2018

CodingCat commented Jul 8, 2018

CodingCat commented Jul 8, 2018

thesuperzapper commented Jul 9, 2018

CodingCat commented Jul 9, 2018 •

edited

Loading

CodingCat commented Jul 9, 2018

thesuperzapper commented Jul 11, 2018 •

edited

Loading

CodingCat commented Jul 12, 2018

thesuperzapper commented Jul 12, 2018

CodingCat commented Jul 30, 2018

thesuperzapper commented Jul 30, 2018

CodingCat commented Jul 30, 2018

CodingCat commented Sep 1, 2018

thesuperzapper commented Sep 5, 2018

CodingCat commented Sep 28, 2018

thesuperzapper commented Sep 30, 2018 •

edited

Loading

Igosuki commented Nov 18, 2018

ecampana commented Jan 8, 2019 •

edited

Loading

ecampana commented May 16, 2019

petergtam commented May 29, 2019

thesuperzapper commented Jul 11, 2019

thesuperzapper commented Feb 1, 2022

trivialfis commented Jul 17, 2022

[jvm-packages] PySpark Support Checklist #3370

[jvm-packages] PySpark Support Checklist #3370

Comments

thesuperzapper commented Jun 8, 2018 • edited Loading

Overview:

Related PRs:

TODO:

CodingCat commented Jun 8, 2018

thesuperzapper commented Jun 12, 2018

CodingCat commented Jun 12, 2018

thesuperzapper commented Jul 4, 2018

CodingCat commented Jul 8, 2018

CodingCat commented Jul 8, 2018

CodingCat commented Jul 8, 2018

thesuperzapper commented Jul 8, 2018

CodingCat commented Jul 8, 2018

CodingCat commented Jul 8, 2018

thesuperzapper commented Jul 9, 2018

CodingCat commented Jul 9, 2018 • edited Loading

CodingCat commented Jul 9, 2018

thesuperzapper commented Jul 11, 2018 • edited Loading

CodingCat commented Jul 12, 2018

thesuperzapper commented Jul 12, 2018

CodingCat commented Jul 30, 2018

thesuperzapper commented Jul 30, 2018

CodingCat commented Jul 30, 2018

CodingCat commented Sep 1, 2018

thesuperzapper commented Sep 5, 2018

CodingCat commented Sep 28, 2018

thesuperzapper commented Sep 30, 2018 • edited Loading

Igosuki commented Nov 18, 2018

ecampana commented Jan 8, 2019 • edited Loading

ecampana commented May 16, 2019

petergtam commented May 29, 2019

thesuperzapper commented Jul 11, 2019

thesuperzapper commented Feb 1, 2022

trivialfis commented Jul 17, 2022

thesuperzapper commented Jun 8, 2018 •

edited

Loading

CodingCat commented Jul 9, 2018 •

edited

Loading

thesuperzapper commented Jul 11, 2018 •

edited

Loading

thesuperzapper commented Sep 30, 2018 •

edited

Loading

ecampana commented Jan 8, 2019 •

edited

Loading