-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] PySpark Support Checklist #3370
Comments
would you please send a WIP version of PR when you are doing the work so that we can discuss some problems undergoing if necessary |
@CodingCat there are a few other non-essential things I think we should support.
|
I think you can pass in a OutputStream which is created by FileSystem.create()....then you can work with HDFS cluster, https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java#L338
it's a good suggestion, but I am not sure if it really worth much effort to deal with this complexity, use file system to exchange model is good enough to me
it's resolved in the undergoing refactoring work of xgboost4j-spark #3313
noticed it, do you know if it happens in other API?
does https://github.com/dmlc/xgboost/pull/2710/files work for you? |
@CodingCat sorry about the massive delay, I have a bit more time to get #3376 rebased now. In terms of saving the API interchangeable model to a Hadoop location, I think this should be implemented in the new XGBoostClassifier/XGBoostRegressor scala objects (#3313), and the pyspark wrapper should call down to that method. In terms of the Float.NaN issue, I find that missing values crash all spark based XGBoost-0.72 API's, (But specifying Float.PositiveInfinity as the missing value works fine, if you fill your nulls as that) it seems to be the presence of Float.NaN in training which causes the crash, rather than what you specify as a missing value. For the early stopping #2710 addresses it by wrapping the spark estimator in a new scala object, which stops support for things like Pipelines and ParamGridBuilder. I would prefer to add an early stopping feature to the new XGBoostClassifier/XGBoostRegressor scala objects. |
if you are talking about pyspark API other than python API, you can directly use MLLIB model persistence APIs to read models, as we have implemented MLWritable interface (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala#L310-L311) |
added the other two things to feature requests |
about NaN leading to jvm crash, I found https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24760 in Spark JIRA....I think it might be relevant |
@CodingCat for the saving thing, I was meaning in the case where you want to save your model that you trained in spark, and read it back into Python/R, this situation would be benefited dramatically, if I could saveToHadoop(, on the model object. For the NaN crash, this crash happens in the base XGBoost Spark Scala API, (At least in 0.72), the fact it happens in my PySpark wrapper is a side effect of this. |
@thesuperzapper with the current code, you can persist model in xgboost4j-spark with MLLIB model persistence API and read it with the coming pyspark API, then with the read model you can do anything with model.booster I do not see a problem here.... even we implement the functionality to support HDFS path in xgboost/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java Line 338 in 6bed54a
|
for NaN, yes, the link I posted is about Pandas cannot work with NaN + JVM as well I suspect there is something wrong with NaN when JVM interacts with other programming languages, maybe it was transformed to something weird in native layer |
@CodingCat I think we might not be understanding each other. For the model saving, you have to use For the NaN, this happens even if you just use the Scala API (No python anywhere in the chain), so to fix the bug, we need to get it working in the Scala API. |
|
regarding NaN...I am not talking about python, I am talking about cross language conversion, that's from scala to native |
|
with the same piece of code, you can call the counterpart of Line 215 in 2200939
|
@CodingCat I think I was forgetting that the Scala API had been rewritten, will need to alter my wrapper for it to work with 0.8. |
@thesuperzapper any update on this? |
@CodingCat, sorry about the delay, haven't started the rewrite yet, was partially waiting for stability in the API, and partially extremely busy. I will give this a further look this week. Are there any doc's yet for the new API structure? |
@thesuperzapper no problem, the new API structure is very simple, it only contains the standard interface of Spark MLLIB transform and fit all the other configurations like transform leaf, etc. is enabled by setting a column name we have a undergoing tutorial in CodingCat#4 |
any update on this? |
@CodingCat yea, the issue I keep running into is allowing support for pipeline persistence. In my initial PR, I supported this by creating my own pipeline object, but I don't like that solution as its messy and leaves way to much code to support in the future. As of Spark 2.3, there is DefaultParamsWritable, and I think there must be a way to get that working with the default pipeline object. However it is unlikely that a pipeline written in python will be readable in the Scala API. |
@thesuperzapper are you still active working on this? @yanboliang can take over or help you if you are busy in other stuffs |
@CodingCat yea, a few weeks ago I put up a development build for 0.8 in issue #1698. I really just need people to test it and tell me what I missed, and whats not working. Additionally, I don't want to introduce the XGBoostPipeline object. If possible I would really want to use the new Python Spark 2.3 DefaultParamsWritable API, but haven't really seen any reference implementation for that yet. The only issue serious issue I keep running into is that classification models wont load back after being saved, giving the error: TypeError: 'JavaPackage' object is not callable. However, strangely XGBoostPipelineModel works just fine with an XGBoost classification stage. This leads me to think is an issue on my end, can someone verify if reading classification models works for them? I will update this PR (#3376) with that dev build, but its still needs some work. |
I intend to use this code with an incremental optimizer which has to run on just one machine, or at least a limited number of them (e.g: Tree Parzen Estimator). This has me having to wrap XGBoostClassifier/Regressor into another Estimator that'll just randomize params according to uniform distributions. This system allows us to converge a lot faster than with GridSearch. Another thing is we use multiple environments and I see that the serialization code for the model is different between Python and XGBoost, is there planned operability through this PR ? |
@thesuperzapper I am using
I have tried accessing the below with no luck.
|
@CodingCat has the above issue mentioned about information gain been resolved in version 1.0 that you mentioned in the other thread? Thank you in advance. |
@thesuperzapper Are you still working on the wrapper? |
This new PR is where we are working on it #4656 |
@trivialfis @wbo4958 I have updated the TODO's based on my comment in #7578 (comment) |
Closing as the initial support is merged. Thank you to everyone who has participated in the discussion! |
Overview:
This is a meta issue for implementing PySpark support.
Related PRs:
TODO:
sparkxgb
(which is pretty widely used at this point)spark-xgboost
,pyspark-xgboost
orpyspark-xgb
run_tests.py
file under./jvm-packages/xgboost4j-spark/src/main/resources/sparkxgb/
spark-submit
command to./xgboost/blob/master/tests/ci_build/test_jvm_cross.sh#L38
./doc/jvm/
The text was updated successfully, but these errors were encountered: