Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark XGBoost End to End Script #5177

Closed
MrPhilosopher opened this issue Jan 1, 2020 · 13 comments
Closed

PySpark XGBoost End to End Script #5177

MrPhilosopher opened this issue Jan 1, 2020 · 13 comments

Comments

@MrPhilosopher
Copy link

Hi Everyone.

I'm looking towards a complete end to end script to deploy XGBoost over my PySpark environment.

script I'm using atm is referenced here by @thesuperzapper : #4656

i'm using xgboost 0.9
and spark 2.4.4
jupyter-notebook.
already added Classpath Entries of both jar files (downloaded from https://mvnrepository.com/artifact/ml.dmlc/xgboost4j/0.90 & https://mvnrepository.com/artifact/ml.dmlc/xgboost4j-spark/0.90 ) by pasting them in pyspark/jars/

but I'm still having error while running the following code on training.
Please help, I'm new with xgboost and pyspark so I could be missing small things, secondly my main goal is to deploy a decision tree eventually so also let me know if it's possible to deploy decision tree over xgboost with PySPARK.

CODE:

import os
from pyspark.sql import SparkSession
spark= SparkSession.builder.getOrCreate()
import pyspark
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark.sparkContext.addPyFile("xgboost4j-spark-0.90.jar")
spark.sparkContext.addPyFile("xgboost4j-0.90.jar")
spark.sparkContext.addPyFile("pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip")
from sparkxgb import XGBoostClassifier, XGBoostClassificationModel

dataPath = "sample_binary_classification_data.txt"
data = spark.read.format("libsvm").option("vectorType", "dense").load(dataPath)
dataSplit = data.randomSplit([0.8, 0.2], seed = 1000)
dataTrain = dataSplit[0]
dataTest = dataSplit[1]

# Define the model
paramMap = { "eta": 0.1, "maxDepth": 2, "objective": "binary:logistic", "numRound": 5, "numWorkers": 2 }
xgbClassifier = XGBoostClassifier(**paramMap) \ .setFeaturesCol("features") \ .setLabelCol("label")

xgboostModel = xgbClassifier.fit(dataTrain)

ERROR BELOW

Py4JJavaError: An error occurred while calling o34.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:194)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:44)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

@chuanlihao
Copy link

Hi @MrPhilosopher,

I could run a sample application successfully. I am also using Spark 2.4.4 and these standard libraries. To make it simple, jupyter-notebook is not involved.

I submitted my application to a standalone Spark cluster with one master and one worker node: spark-submit --jars xgboost4j-0.90.jar,xgboost4j-spark-0.90.jar --py-files pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip agaricus.py

Both the data file agaricus.csv and the program file agaricus.py could be found here agaricus.zip

Please note, I have to add parameter "missing": 0.0 to run my program.

So I think maybe you could try below solutions:

  1. add parameter "missing": 0.0 to your program
  2. try running agaricus.py with data file agaricus.csv (without jupyter-notebook)
  3. if both 1 & 2 above failed, I think you need to share your data file and more error logs (driver logs and executor logs) for further debugging

The agaricus.py is also listed here:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col 
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler
from ml.dmlc.xgboost4j.scala.spark import *

label = "label"
features = [ "feature_" + str(i) for i in range(0, 126) ]
schema = StructType([ StructField(x, FloatType()) for x in [label] + features ])

df = (SparkSession
    .builder
    .getOrCreate()
    .read
    .schema(schema)
    .csv("agaricus.csv"))

df = (VectorAssembler()
    .setInputCols(features)
    .setOutputCol("features")
    .transform(df)
    .select("features", label))

paramMap = {
    "missing": 0.0,
    "eta": 0.1,
    "maxDepth": 2,
    "objective": "binary:logistic",
    "numRound": 5,
    "numWorkers": 2,
}
xgbClassifier = (XGBoostClassifier(**paramMap)
    .setFeaturesCol("features")
    .setLabelCol("label"))

xgboostModel = xgbClassifier.fit(df)

@MrPhilosopher
Copy link
Author

Hi @chuanlihao
Thanks for the detailed answer, as I'm a beginner so I need some more elaboration from you in successfully running this script.

Can you please answer the following.

  1. Where you want me to add "missing": 0.0 ?
  2. Can you please tell me the procedure of submitting .py file on spark-shell?

Thanks.

@chuanlihao
Copy link

Just found the "/PATH/TO/SPARK_HOME/data/mllib/sample_binary_classification_data.txt" file.

I don't have a Jupyter notebook environment currently, but I could run your application successfully with both "spark-submit --jars xgboost4j-0.90.jar,xgboost4j-spark-0.90.jar --py-files pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip sample.py" and "python sample.py"

To run it with "spark-submit", no need to add any jars or Python packages. To run it with "python sample.py", I first install pyspark with "pip install pyspark", and I also updated the program a bit:

  1. removed spark.sparkContext.addPyFile("xgboost4j-spark-0.90.jar") and spark.sparkContext.addPyFile("xgboost4j-0.90.jar"), because these jars are not Python packages
  2. instead, added these jars via spark= SparkSession.builder.config("spark.jars", "xgboost4j-0.90.jar,xgboost4j-spark-0.90.jar").getOrCreate()

Below is the updated program:

import os
from pyspark.sql import SparkSession

spark= SparkSession.builder.config("spark.jars", "xgboost4j-0.90.jar,xgboost4j-spark-0.90.jar").getOrCreate()

import pyspark
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

# spark.sparkContext.addPyFile("xgboost4j-spark-0.90.jar")
# spark.sparkContext.addPyFile("xgboost4j-0.90.jar")
spark.sparkContext.addPyFile("pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip")
from sparkxgb import XGBoostClassifier, XGBoostClassificationModel

dataPath = "sample_binary_classification_data.txt"
data = spark.read.format("libsvm").option("vectorType", "dense").load(dataPath)
dataSplit = data.randomSplit([0.8, 0.2], seed = 1000)
dataTrain = dataSplit[0]
dataTest = dataSplit[1]

dataTrain.show(1)

# Define the model
paramMap = { "eta": 0.1, "maxDepth": 2, "objective": "binary:logistic", "numRound": 5, "numWorkers": 2 }
xgbClassifier = XGBoostClassifier(**paramMap).setFeaturesCol("features").setLabelCol("label")

xgboostModel = xgbClassifier.fit(dataTrain)

@chuanlihao
Copy link

Hi @chuanlihao
Thanks for the detailed answer, as I'm a beginner so I need some more elaboration from you in successfully running this script.

Can you please answer the following.

1. Where you want me to add **`"missing": 0.0`** ?

2. Can you please tell me the procedure of submitting .py file on spark-shell?

Thanks.

Just to answer your questions:

  1. The "missing": 0.0 should be added to paramMap = { "eta": 0.1, "maxDepth": 2, ... }. I just ran your program with sample_binary_classification_data.txt, it seems that there is no need to add the parameter for this dataset.
  2. spark-submit is a command to submit applications to Spark clusters. It's located at "$SPARK_HOME/bin/spark-submit". You need to start a cluster first before running "spark-submit": https://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually

@MrPhilosopher
Copy link
Author

MrPhilosopher commented Jan 2, 2020

@chuanlihao Thanks for the quick responses.

So I used the code edited by you and it still it throws me the same error, please see the complete console logs below, in case if you can find the issue.

LOGS BELOW

(base) C:\Users\hafiz.qaiser50\Desktop>python xgBoostPySpark_new.py
20/01/02 17:13:54 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.(Shell.java:387)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2823)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:191)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:145)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:343)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:343)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:343)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/01/02 17:13:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/01/02 17:13:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/01/02 17:13:57 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|[0.0,0.0,0.0,0.0,...|
+-----+--------------------+
only showing top 1 row

20/01/02 17:14:00 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.0.75.1, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=2}
20/01/02 17:14:13 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
20/01/02 17:14:13 ERROR NativeLibLoader: failed to load xgboost4j library from jar
20/01/02 17:14:13 ERROR DMatrix: Failed to load native library
java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.
at ml.dmlc.xgboost4j.java.NativeLibLoader.createTempFileFromResource(NativeLibLoader.java:126)
at ml.dmlc.xgboost4j.java.NativeLibLoader.loadLibraryFromJar(NativeLibLoader.java:69)
at ml.dmlc.xgboost4j.java.NativeLibLoader.initXGBoost(NativeLibLoader.java:41)
at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:34)
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/01/02 17:14:13 WARN BlockManager: Block rdd_26_1 could not be removed as it was not found on disk or in memory
20/01/02 17:14:13 WARN BlockManager: Block rdd_26_0 could not be removed as it was not found on disk or in memory
20/01/02 17:14:13 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 6)
java.lang.ExceptionInInitializerError
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.
at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:37)
... 26 more
Caused by: java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.
at ml.dmlc.xgboost4j.java.NativeLibLoader.createTempFileFromResource(NativeLibLoader.java:126)
at ml.dmlc.xgboost4j.java.NativeLibLoader.loadLibraryFromJar(NativeLibLoader.java:69)
at ml.dmlc.xgboost4j.java.NativeLibLoader.initXGBoost(NativeLibLoader.java:41)
at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:34)
... 26 more
20/01/02 17:14:13 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/01/02 17:14:13 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)

at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

20/01/02 17:14:13 ERROR TaskSetManager: Task 0 in stage 5.0 failed 1 times; aborting job
20/01/02 17:14:13 WARN TaskSetManager: Lost task 1.0 in stage 5.0 (TID 6, localhost, executor driver): java.lang.ExceptionInInitializerError
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.
at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:37)
... 26 more
Caused by: java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.
at ml.dmlc.xgboost4j.java.NativeLibLoader.createTempFileFromResource(NativeLibLoader.java:126)
at ml.dmlc.xgboost4j.java.NativeLibLoader.loadLibraryFromJar(NativeLibLoader.java:69)
at ml.dmlc.xgboost4j.java.NativeLibLoader.initXGBoost(NativeLibLoader.java:41)
at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:34)
... 26 more

20/01/02 17:14:13 ERROR RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI

at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2$$anon$1.run(XGBoost.scala:452)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:53)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:675)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:343)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
File "xgBoostPySpark_new.py", line 51, in
xgboostModel = xgbClassifier.fit(dataTrain)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\ml\base.py", line 132, in fit
return self._fit(dataset)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 295, in _fit
java_model = self._fit_java(dataset)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 292, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\py4j\java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o39.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)

at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:194)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:44)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

@MrPhilosopher
Copy link
Author

ohh I'm using windows is that's why it's saying ERROR NativeLibLoader: failed to load xgboost4j library from jar
I've downloaded the jar from maven

@chuanlihao
Copy link

This is the error:
20/01/02 17:14:13 ERROR NativeLibLoader: failed to load xgboost4j library from jar
20/01/02 17:14:13 ERROR DMatrix: Failed to load native library
java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.

It seems the xgboost4j-0.90.jar downloaded from maven only supports Linux & MacOS. Here is the discussion #1807 and please focus on the latest comments in June 2018.

@MrPhilosopher
Copy link
Author

So after replacing my xgboost4j-0.90 with xgboost4j-0.90-criteo-20190702_2.11-win64.jar I got another error, can you please look into it

ERROR BELOW

20/01/03 10:00:13 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
[Stage 5:> (0 + 2) / 2][0] train-error:0.000000
[0] train-error:0.000000
[1] train-error:0.000000
[1] train-error:0.000000
[2] train-error:0.000000
[2] train-error:0.000000
[3] train-error:0.000000
[3] train-error:0.000000
[4] train-error:0.000000
[4] train-error:0.000000
20/01/03 10:00:14 ERROR RabitTracker: Uncaught exception thrown by worker:
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:206)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:243)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:728)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2$$anon$1.run(XGBoost.scala:452)
Traceback (most recent call last):
File "xgBoostPySpark_new.py", line 51, in
xgboostModel = xgbClassifier.fit(dataTrain)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\ml\base.py", line 132, in fit
return self._fit(dataset)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 295, in _fit
java_model = self._fit_java(dataset)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 292, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\py4j\java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\hafiz.qaiser50\AppData\Local\Continuum\anaconda3\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o39.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:194)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:44)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

@trivialfis
Copy link
Member

Until there's effort on pushing the PR I don't think we can answer any question about it.

@d901971
Copy link

d901971 commented Aug 21, 2020 via email

@thesuperzapper
Copy link
Contributor

Yeah, sorry guys about abandoning the PR, (BTW, the latest code in the PR is actually newer than the latest zip file I uploaded)

The main thing that stopped me getting it done was getting release infrastructure setup for a python package.
I think the best way to go about it in the short term is embedding it in the Jar file. (Like the PR dose right now)

@trivialfis
Copy link
Member

Can we help on the release infrastructure?

@thesuperzapper
Copy link
Contributor

@trivialfis probably, but I am still not sure of the best way to distribute the Python package.

The main question is are we going to make a seperate package in PyPi called sparkxgb, or should we just embed the package into the spark jar.

I prefer the second case, and other packages like graphframes already do it, this lets people run commands like this which automatically prepend the python package into thier PYTHONPATH:

  • pyspark --packages XXX:XXX:XXX
  • spark-submit --jars XXX.jar --py-files XXX.jar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants