-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] initial pyspark api (WIP) #4656
Conversation
@thesuperzapper Where is sparkxgb located at I can see one that is committed for version .80 but is there a newer one .9 for? |
@alibeyram You can use it in the same way as the above example, with the following change:
I would love it if people could test this and suggest changes we can make for usability. |
@thesuperzapper Thank you I was looking for that zip file. The one I had was for xgboost version 0.80 at sparkxgb 0.80 |
@thesuperzapper Thank you for the awesome package. I encountered the following error when loading the booster back using XGBoostClassificationModel.load(). /opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/pyspark/ml/util.py in load(cls, path) /opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/pyspark/ml/util.py in load(self, path) /opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/pyspark/ml/wrapper.py in _from_java(java_stage) /opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/pyspark/ml/wrapper.py in __get_class(clazz) ModuleNotFoundError: No module named 'ml' but I can 'import pyspark.ml' successfully without any problem. Do you have any idea why this happened? Thanks. |
@debinqiu are you 100% sure you have added either the compiled jar from this repo or the above .zip file using the following:
Or with:
|
@thesuperzapper Thanks for the quick response. I actually used 0.82 version instead of 0.90 because I only have Spark v2.3 in my environment. 0.90 requires 2.4+, right? I added a 'init.py' to make it as a python module so that I can call it directly. I also tested using scala XGBoostClassificationModel.load() to load the booster back in my environment, which was also working for me. |
@debinqiu its quite likely that the .zip file above will work with 0.82. (And the old .zip will not work properly with saving/loading) Also, what do you mean you added an init.py? |
@thesuperzapper Sorry, init.py is already in sparkxgb. So basically I just unzip your file and add it to my PYTHONPATH, which makes it as a normal python module. Also, for regression problem, XGBoostRegressionModel.load() works for me to load booster back. |
Hello @thesuperzapper . I run the following code as writen by you on Jupyter Notebook
However, i got the following error:
Can you help me, how do i solve this problem? Thanks in advance |
@cjkini Have you solved your problem yet?I have the same problem as you! |
@wei8171023 @cjkini just to check, are you using Windows or Linux? If your using Windows, can you try chaining the save/load paths to be fully qualified like so:
If that's not the issue, can you please clarify which line of code is throwing the error, (just progressively un-comment lines until you get the error if you have to). |
@thesuperzapper The This works just fine:
But if I try to set the 'lambda_' parameter:
I get an error:
Note that this is not the same error that I get if I make up a fake parameter name:
The As long as I don't try to set |
Hi, BTW, when we build this python code with latest xgboost-spark, we see errors during runtime. It also seems like, this branch is using Scala 2.11.12 and Java 1.7, while the master seems to be using Scala 2.12.x and Java 1.8. I am guessing the error we are seeing is due to these version mismatches. Below is the error message we are observing while running our code: Traceback (most recent call last): |
@thesuperzapper Thank you so much for this PySpark API! Would it be possible to make the evalSetsMap / eval_sets paramteter available through the API? I'd like to include early stopping evaluated on a particular dataframe, but that doesn't seem to be currently possible, since the variable is protected in the underlying scala implementation. Thanks again! |
Hi guys, I am going to pick this back up again this week. I have been moving countries over the last month, so been a bit busy, I want to get this ready for 1.0 RC1. The main things left to do are:
Note:
|
261ab52
to
119282e
Compare
119282e
to
0fc1162
Compare
return r | ||
|
||
|
||
class XGboostEstimator(JavaEstimator, XGBoostReadable, JavaMLWritable, ParamGettersSetters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need specific setXXXParam methods.
You could reference the code here (create a shared base class named _XgboostParams
)
https://github.com/apache/spark/blob/8cf76f8d61b393bb3abd9780421b978e98db8cae/python/pyspark/ml/tree.py#L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 This code automatically generates the getters/setters from the parameter list. Is there something I am missing?
@thesuperzapper please rebase to latest master and lint error should be gone. Also, is there any updates on ETA? |
Hi @thesuperzapper, Thanks for providing such a great library. Is there any update on this PR? I'd be glad to help with this library if you are busy these days. Some updates on this library: It has been included in rapidsai/xgboost. I cherry-picked this commit, fixed one or two minor bugs, and added GPU support to it: https://github.com/rapidsai/xgboost/commits/rapids-spark/jvm-packages/xgboost4j-spark/src/main/resources. I have also added some examples to demonstrate this Python library: https://github.com/rapidsai/spark-examples/tree/master/examples/apps/python/ai/rapids/spark/examples. |
With 1.0 now out, are there any plans to start incorporating this in the main XGBoost codebase? Or any updates in this commit for 1.0 rather than using 0.90? |
@thesuperzapper Any updates ? If you're busy, do you mind I take over this PR ? Thanks! |
Seems author has no response. I will take over this. @CodingCat If you have any concern or suggestion let me know! :) |
@WeichenXu123 Just saw this, if might be easier if we work together (or at least have a chat on Zoom first), as there are only a few things left, I just never found the time to do. |
Hi,
When i include "maxBins" as one of the parameters: paramMap = { “evalMetric”: ‘auc’, xgb = xgb_est(**paramMap) I received an error saying "AttributeError: ‘XGBoostClassifier’ object has no attribute ‘maxBins’". Thanks |
@thesuperzapper I know this is very late, but can you let me know how you have generated the .zip archive from the source code, i.e. the pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip? I am trying to generate a zip archive for my spark job using your code but slightly modified, but it seems like I am doing it incorrectly since Spark is complaining that the 'ml' module is not found. |
Thank you for the work and for joining the design discussion! |
This is a cleaned up version of the PySpark API, which previously had the PR #3376.
There are numerous improvements:
xgboost4j-spark/src/main/resources
which means the python code will end up in the .jar fileHere is an example: (Don't forget to spark-submit the
xgboost4j.jar
andxgboost4j-spark.jar
)Still to do: