-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] spark hangs when training is run in quick succession #4628
Comments
Currently, parallel training for multiple models is not supported |
In this case, I am letting each training finish, so it's only one model. However, you're almost certainly right about the cause. That is, I think there is some "cool down" after training each model, in which you can't train another without hanging. |
@CodingCat Steps to reproduce:
Note:
|
@CodingCat are you able to reproduce this? |
Can you try adding rabit.Shutdown() after calling
|
I got the same problem: org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:956) the spark application hang here for more than 24hours, even I just have 10 train data and set iteration as 2. I am not sure what it is waiting for. |
I understand that this issue is regarding successive training runs and not parallel but @CodingCat can you give some brief context for why parallel training runs are not supported? Ideally I would like to allocate N workers for each of M parallel training calls (given a cluster allocated to have N*M available workers) to better scale tuning as it there's a sharp falloff in benefit from adding more workers to a single training job. However, bar this, this issue seems like it could be a blocker for even purely sequential tuning which IMO would be a pretty significant issue. Edit: Forgot to mention that this is affecting some code of mine. It appears that models do train, instead hanging after completing training but prior (of course) to returning to non-xgboost4j code. Running with verbosity=3, it is unclear what the issue could be as Rabit seems to be tracking all the tasks as expected without warnings, errors etc. I'm fairly new to the code and am not sure where to first look but I would love to better understand the issue to implement a workaround if not a proper fix. |
Closing as barrier mode is now used in the spark package. Feel free to reopen if the issue persists with the latest branch. |
Same issue still happen with spark 3.2.2 and xgboost 1.6.1 spark-shell --master "local[1]" --packages ml.dmlc:xgboost4j-spark_2.12:1.6.1 ml.dmlc:xgboost4j_2.12:1.6.1 com.typesafe.akka:akka-actor_2.12:2.5.23 com.typesafe:config:1.3.3
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.18) New observation: I confirm that it's the PreXGBoost.scala:transformDataset cause this issue.
But when we call show() or limit() instead of collect() or write() iterator won't reach @trivialfis Would you mind reopen this ticket? I can work on it if you feel like it. |
@trivialfis and @wbo4958 |
You are correct that prediction doesn't need communication. A PR for the proposed fix would be really appreciated! |
@austinzh will you put up a PR to fix it? |
Sure. Let me take it. |
For folks has similar issue.
|
@wbo4958 Any doc or tips about setup a development environment for xgboost4j ? |
@trivialfis since the PR is merged. could you help to close this issue? |
I am getting an infinite hang when I run the following code a few times in quick succession:
Steps to reproduce:
Other information:
When the hang happens I only get the tracker message, and nothing after that, I have to kill the spark job. (If I wait between runs, they always succeed. )
My environment:
The text was updated successfully, but these errors were encountered: