-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark XGBoost End to End Script #5177
Comments
Hi @MrPhilosopher, I could run a sample application successfully. I am also using Spark 2.4.4 and these standard libraries. To make it simple, jupyter-notebook is not involved. I submitted my application to a standalone Spark cluster with one master and one worker node: Both the data file Please note, I have to add parameter So I think maybe you could try below solutions:
The
|
Hi @chuanlihao Can you please answer the following.
Thanks. |
Just found the "/PATH/TO/SPARK_HOME/data/mllib/sample_binary_classification_data.txt" file. I don't have a Jupyter notebook environment currently, but I could run your application successfully with both "spark-submit --jars xgboost4j-0.90.jar,xgboost4j-spark-0.90.jar --py-files pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip sample.py" and "python sample.py" To run it with "spark-submit", no need to add any jars or Python packages. To run it with "python sample.py", I first install pyspark with "pip install pyspark", and I also updated the program a bit:
Below is the updated program:
|
Just to answer your questions:
|
@chuanlihao Thanks for the quick responses. So I used the code edited by you and it still it throws me the same error, please see the complete console logs below, in case if you can find the issue. LOGS BELOW(base) C:\Users\hafiz.qaiser50\Desktop>python xgBoostPySpark_new.py 20/01/02 17:14:00 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names' 20/01/02 17:14:13 ERROR TaskSetManager: Task 0 in stage 5.0 failed 1 times; aborting job 20/01/02 17:14:13 ERROR RabitTracker: Uncaught exception thrown by worker: Driver stacktrace: Traceback (most recent call last): |
ohh I'm using windows is that's why it's saying ERROR NativeLibLoader: failed to load xgboost4j library from jar |
This is the error: It seems the xgboost4j-0.90.jar downloaded from maven only supports Linux & MacOS. Here is the discussion #1807 and please focus on the latest comments in June 2018. |
So after replacing my xgboost4j-0.90 with xgboost4j-0.90-criteo-20190702_2.11-win64.jar I got another error, can you please look into it ERROR BELOW20/01/03 10:00:13 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names' |
Until there's effort on pushing the PR I don't think we can answer any question about it. |
Hi Jia Ming,I discover the issue. It is just a typo. Name should be 'maxBin'. The wrapper is showing as 'maxBins'.
Regards
On Tuesday, 18 August 2020, 02:50:41 am GMT+8, Jiaming Yuan <notifications@github.com> wrote:
Until there's effort on pushing the PR I don't think we can answer any question about it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yeah, sorry guys about abandoning the PR, (BTW, the latest code in the PR is actually newer than the latest zip file I uploaded) The main thing that stopped me getting it done was getting release infrastructure setup for a python package. |
Can we help on the release infrastructure? |
@trivialfis probably, but I am still not sure of the best way to distribute the Python package. The main question is are we going to make a seperate package in PyPi called I prefer the second case, and other packages like graphframes already do it, this lets people run commands like this which automatically prepend the python package into thier PYTHONPATH:
|
Hi Everyone.
I'm looking towards a complete end to end script to deploy XGBoost over my PySpark environment.
script I'm using atm is referenced here by @thesuperzapper : #4656
i'm using xgboost 0.9
and spark 2.4.4
jupyter-notebook.
already added Classpath Entries of both jar files (downloaded from https://mvnrepository.com/artifact/ml.dmlc/xgboost4j/0.90 & https://mvnrepository.com/artifact/ml.dmlc/xgboost4j-spark/0.90 ) by pasting them in pyspark/jars/
but I'm still having error while running the following code on training.
Please help, I'm new with xgboost and pyspark so I could be missing small things, secondly my main goal is to deploy a decision tree eventually so also let me know if it's possible to deploy decision tree over xgboost with PySPARK.
CODE:
import os
from pyspark.sql import SparkSession
spark= SparkSession.builder.getOrCreate()
import pyspark
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
spark.sparkContext.addPyFile("xgboost4j-spark-0.90.jar")
spark.sparkContext.addPyFile("xgboost4j-0.90.jar")
spark.sparkContext.addPyFile("pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip")
from sparkxgb import XGBoostClassifier, XGBoostClassificationModel
dataPath = "sample_binary_classification_data.txt"
data = spark.read.format("libsvm").option("vectorType", "dense").load(dataPath)
dataSplit = data.randomSplit([0.8, 0.2], seed = 1000)
dataTrain = dataSplit[0]
dataTest = dataSplit[1]
# Define the model
paramMap = { "eta": 0.1, "maxDepth": 2, "objective": "binary:logistic", "numRound": 5, "numWorkers": 2 }
xgbClassifier = XGBoostClassifier(**paramMap) \ .setFeaturesCol("features") \ .setLabelCol("label")
xgboostModel = xgbClassifier.fit(dataTrain)
ERROR BELOW
Py4JJavaError: An error occurred while calling o34.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:194)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:44)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered: