[jvm-packages] spark hangs when training is run in quick succession #4628

thesuperzapper · 2019-07-02T10:08:45Z

I am getting an infinite hang when I run the following code a few times in quick succession:

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

val dataPath = "__SPARK_HOME_LOCATION__/data/mllib/sample_binary_classification_data.txt"
val data = spark.read.format("libsvm").option("vectorType", "dense").load(dataPath)
val xgbClassifier = new XGBoostClassifier()

xgbClassifier.fit(data).transform(data).show()

Steps to reproduce:

open spark-shell with xgboost jars
Run the above code
quickly rerun the last line until a hang happens

Other information:
When the hang happens I only get the tracker message, and nothing after that, I have to kill the spark job. (If I wait between runs, they always succeed. )

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=XXX.XXX.XXX.XXX, DMLC_TRACKER_PORT=9096, DMLC_NUM_WORKER=2}

My environment:

XGBoost Master
Spark 2.4.3
(Happens in both: Zeppelin and Spark-Shell)

The text was updated successfully, but these errors were encountered:

CodingCat · 2019-07-02T12:32:52Z

Currently, parallel training for multiple models is not supported

thesuperzapper · 2019-07-02T21:13:09Z

@CodingCat

In this case, I am letting each training finish, so it's only one model.

However, you're almost certainly right about the cause. That is, I think there is some "cool down" after training each model, in which you can't train another without hanging.

thesuperzapper · 2019-07-11T07:33:55Z

@CodingCat
After further investigation, this crash is caused by running the same training within exactly 60sec of the last time you ran it.

Steps to reproduce:

Download Spark 2.4.3 from here, and extract it.
Download the following jars from maven:

Run:

./spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[1] --jars ./xgboost4j-0.90.jar,./xgboost4j-spark-0.90.jar,./akka-actor_2.11-2.3.11.jar,./config-1.2.1.jar

Run the following code:

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

val dataPath = "./spark-2.4.3-bin-hadoop2.7/data/mllib/sample_binary_classification_data.txt"
val data = spark.read.format("libsvm").option("vectorType", "dense").load(dataPath)
val xgbClassifier = new XGBoostClassifier()

Run the following code:

xgbClassifier.fit(data).transform(data).show()

Wait less than exactly 60sec.
Run the following code:

xgbClassifier.fit(data).transform(data).show()

..Crash/Hang..

Note:

This also happens with the Scala Rabit tracker.
I have tested this on multiple servers/computers running Ubuntu and RedHat.

thesuperzapper · 2019-07-19T03:42:53Z

@CodingCat are you able to reproduce this?

chenqin · 2019-08-16T05:39:11Z

Can you try adding rabit.Shutdown() after calling

xgbClassifier.fit(data).transform(data).show()

leafjungle · 2020-04-16T05:29:23Z

I got the same problem:

org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:956)
ml.dmlc.xgboost4j.scala.spark.XGBoost$$anon$2.run(XGBoost.scala:295)

the spark application hang here for more than 24hours, even I just have 10 train data and set iteration as 2. I am not sure what it is waiting for.

a-johnston · 2020-08-24T20:14:00Z

I understand that this issue is regarding successive training runs and not parallel but @CodingCat can you give some brief context for why parallel training runs are not supported? Ideally I would like to allocate N workers for each of M parallel training calls (given a cluster allocated to have N*M available workers) to better scale tuning as it there's a sharp falloff in benefit from adding more workers to a single training job. However, bar this, this issue seems like it could be a blocker for even purely sequential tuning which IMO would be a pretty significant issue.

Edit: Forgot to mention that this is affecting some code of mine. It appears that models do train, instead hanging after completing training but prior (of course) to returning to non-xgboost4j code. Running with verbosity=3, it is unclear what the issue could be as Rabit seems to be tracking all the tasks as expected without warnings, errors etc. I'm fairly new to the code and am not sure where to first look but I would love to better understand the issue to implement a workaround if not a proper fix.

trivialfis · 2022-04-28T22:22:33Z

Closing as barrier mode is now used in the spark package. Feel free to reopen if the issue persists with the latest branch.

austinzh · 2023-04-17T14:48:10Z

Same issue still happen with spark 3.2.2 and xgboost 1.6.1

spark-shell  --master "local[1]" --packages ml.dmlc:xgboost4j-spark_2.12:1.6.1  ml.dmlc:xgboost4j_2.12:1.6.1 com.typesafe.akka:akka-actor_2.12:2.5.23 com.typesafe:config:1.3.3
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.2
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.18)

New observation:
In the same test replace xgbClassifier.fit(data).transform(data).show() with xgbClassifier.fit(data)
Then I can run without any failure.
And when it fail I see similar log every time
[07:49:07] [0] train-logloss:0.43654189050197600

I confirm that it's the PreXGBoost.scala:transformDataset cause this issue.
We shutdown Rabit only on the next() being call and hasNext() is false. As it show below.

        override def next(): Row = {
          val ret = batchIterImpl.next()
          if (!batchIterImpl.hasNext) {
            Rabit.shutdown()
          }
          ret
        }

But when we call show() or limit() instead of collect() or write() iterator won't reach hasNext() == false
This can be prove by change
xgbClassifier.fit(data).transform(data).show()
to
xgbClassifier.fit(data).transform(data).collect()

@trivialfis Would you mind reopen this ticket?

I can work on it if you feel like it.

austinzh · 2023-04-17T18:19:08Z

@trivialfis and @wbo4958
There one more thing I am not sure why is when predict(transform), why we need rabit? It's for distribution training, but for the case of prediction, each worker don't need to talk to each other, Why rabit?

trivialfis · 2023-04-17T20:22:09Z

You are correct that prediction doesn't need communication. A PR for the proposed fix would be really appreciated!

wbo4958 · 2023-04-17T22:06:31Z

@austinzh will you put up a PR to fix it?

austinzh · 2023-04-17T22:09:51Z

Sure. Let me take it.

austinzh · 2023-04-17T22:14:00Z

For folks has similar issue.
A quick fix can be achieved by running this before your training job.

    val spark = org.apache.spark.sql.SparkSession.builder().getOrCreate()
    val nWorker = spark.sparkContext.defaultParallelism
    spark.range(0, nWorker).rdd.barrier.mapPartitions { x => { ml.dmlc.xgboost4j.java.Rabit.shutdown(); x } }.collect()

austinzh · 2023-04-18T12:42:26Z

@wbo4958 Any doc or tips about setup a development environment for xgboost4j ?
I am using mac with intellij or vscode.

wbo4958 · 2023-04-24T00:54:01Z

@trivialfis since the PR is merged. could you help to close this issue?

thesuperzapper mentioned this issue Jul 11, 2019

[jvm-packages] initial pyspark api (WIP) #4656

Closed

thesuperzapper mentioned this issue Aug 16, 2019

[jvm-packages] Remove Python Rabit Tracker #4781

Closed

trivialfis closed this as completed Apr 28, 2022

trivialfis reopened this Apr 17, 2023

austinzh mentioned this issue Apr 19, 2023

Stop using Rabit in prediction #9054

Merged

trivialfis closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] spark hangs when training is run in quick succession #4628

[jvm-packages] spark hangs when training is run in quick succession #4628

thesuperzapper commented Jul 2, 2019

CodingCat commented Jul 2, 2019

thesuperzapper commented Jul 2, 2019

thesuperzapper commented Jul 11, 2019

thesuperzapper commented Jul 19, 2019

chenqin commented Aug 16, 2019

leafjungle commented Apr 16, 2020

a-johnston commented Aug 24, 2020 •

edited

Loading

trivialfis commented Apr 28, 2022

austinzh commented Apr 17, 2023 •

edited

Loading

austinzh commented Apr 17, 2023

trivialfis commented Apr 17, 2023

wbo4958 commented Apr 17, 2023

austinzh commented Apr 17, 2023

austinzh commented Apr 17, 2023 •

edited

Loading

austinzh commented Apr 18, 2023

wbo4958 commented Apr 24, 2023

[jvm-packages] spark hangs when training is run in quick succession #4628

[jvm-packages] spark hangs when training is run in quick succession #4628

Comments

thesuperzapper commented Jul 2, 2019

CodingCat commented Jul 2, 2019

thesuperzapper commented Jul 2, 2019

thesuperzapper commented Jul 11, 2019

thesuperzapper commented Jul 19, 2019

chenqin commented Aug 16, 2019

leafjungle commented Apr 16, 2020

a-johnston commented Aug 24, 2020 • edited Loading

trivialfis commented Apr 28, 2022

austinzh commented Apr 17, 2023 • edited Loading

austinzh commented Apr 17, 2023

trivialfis commented Apr 17, 2023

wbo4958 commented Apr 17, 2023

austinzh commented Apr 17, 2023

austinzh commented Apr 17, 2023 • edited Loading

austinzh commented Apr 18, 2023

wbo4958 commented Apr 24, 2023

a-johnston commented Aug 24, 2020 •

edited

Loading

austinzh commented Apr 17, 2023 •

edited

Loading

austinzh commented Apr 17, 2023 •

edited

Loading