LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

OldDreamHunter · 2021-05-20T02:35:31Z

I have already noticed the issue #542, but the answer cannot solve my problem.

I have a dataset nearly 72GB and 145 columns. My spark config is
spark-submit
--master yarn
--deploy-mode client
--executor-memory 15g
--driver-memory 15g
--executor-cores 1
--num-executors 20
--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
--conf spark.default.parallelism=5000
--conf spark.sql.shuffle.partitions=5000
--conf spark.dynamicAllocation.enabled=false
--conf spark.memory.storageFraction=0.3
--conf spark.executor.memoryOverhead=15g
--conf spark.driver.maxResultSize=10g \

if I reduce the dataset size to 24 GB, I could train the model in 40 minutes. But if I increase the dataset to 72GB, the training process would be stuck at "reduce at LightGBMClassifier.scala:150" and report some failed information, "ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128370 ms", "java.lang.Exception: Dataset create call failed in LightGBM with error: Socket recv error, code: 104", "java.net.ConnectException: Connection refused"

AB#1188553

welcome · 2021-05-20T02:35:32Z

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

imatiach-msft · 2021-05-24T22:12:31Z

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47
What are the parameters to lightgbm?

OldDreamHunter · 2021-05-25T08:29:10Z

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47
What are the parameters to lightgbm?

Thanks for your reply @imatiach-msft , I don't increase the socket timeout and would try it. And the parameters of my model
as described below.

lgb = LightGBMClassifier(
objective="binary",
boostingType='gbdt',
isUnbalance=True,
featuresCol='features',
labelCol='label',
maxBin=64,
earlyStoppingRound=100,
learningRate=0.5,
maxDepth=6,
numLeaves=48,
lambdaL1=0.8,
lambdaL2=45.0,
baggingFraction=0.7,
featureFraction=0.7,
numIterations=200)

OldDreamHunter · 2021-05-26T01:24:24Z

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47
What are the parameters to lightgbm?

hi, @imatiach-msft, I have increased the timeout and changed the parallelism type to "voting_parallel", but the job still failed as "reduce at LightGBMBase.scala:230" with the failure reason of "Job aborted due to stage failure: Task 8 in stage 4.0 failed 4 times, most recent failure: Lost task 8.3 in stage 4.0 (TID 6027, pro-dchadoop-195-81, executor 22): java.net.ConnectException: Connection refused (Connection refused)"

boostingType='gbdt',
isUnbalance=True,
featuresCol='features',
labelCol='label',
maxBin=64,
earlyStoppingRound=100,
learningRate=0.5,
maxDepth=5,
numLeaves=32,
lambdaL1=7.0,
lambdaL2=7.0,
baggingFraction=0.7,
featureFraction=0.7,
numIterations=200,
parallelism='voting_parallel',
timeout=120000.0)

imatiach-msft · 2021-05-26T04:27:05Z

@OldDreamHunter I think that is a red herring, the real error is on one of the other nodes. Can you send all of the unique task error messages? Please ignore the connection refused error.

imatiach-msft · 2021-05-26T04:27:44Z

you can also try to set useBarrierExecutionMode=True, I think it might give a better error message

imatiach-msft · 2021-05-26T04:29:32Z

I would only use voting_parallel if you have a high number of features, see guide:
https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html

icankeep · 2021-06-01T09:15:32Z

same problem.
Everything will work well when I reduce the number of training data

Simon-LLong · 2021-12-30T09:07:40Z

same problem. Voting Parallel works fine, but accuracy is very low. Much data is skipped.

imatiach-msft · 2021-12-30T16:52:02Z

@Simon-LLong sorry about the problems you are encountering. Indeed Voting Parallel can give lower accuracy, but with much better speedup and lower memory usage.

Can you also please try the new mode:
useSingleDatasetMode = True
numThreads = num cores - 1
These two PRs should resolve this:

#1222
#1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1 (as well as lower memory usage).
The two PRs above will be available in 0.9.5 or you can get them with the latest build right now.
In 0.9.5 these params will be set by default, but in earlier versions like currently released 0.9.4 you can set them directly.

For more information on the new single dataset mode please see the PR description:
#1066

This new mode was created after extensive internal benchmarking.

I have some ideas on how a streaming mode can also be added to distributed lightgbm, where data is streamed into the native histogram binned representation, which should use a small fraction of the total spark dataset when everything is loaded in memory. It might be a little slower to setup, but it should vastly reduce memory usage. This is something I will be looking into in the near-future.

nitinmnsn · 2022-02-20T05:00:48Z

numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
Is this number of cores on my executor node, number of cores in my executor or number of cores on my cluster?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

OldDreamHunter commented May 20, 2021 •

edited by azure-boards bot

Loading

welcome bot commented May 20, 2021

imatiach-msft commented May 24, 2021

OldDreamHunter commented May 25, 2021

OldDreamHunter commented May 26, 2021

imatiach-msft commented May 26, 2021 •

edited

Loading

imatiach-msft commented May 26, 2021

imatiach-msft commented May 26, 2021

icankeep commented Jun 1, 2021

Simon-LLong commented Dec 30, 2021

imatiach-msft commented Dec 30, 2021

nitinmnsn commented Feb 20, 2022

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

Comments

OldDreamHunter commented May 20, 2021 • edited by azure-boards bot Loading

welcome bot commented May 20, 2021

imatiach-msft commented May 24, 2021

OldDreamHunter commented May 25, 2021

OldDreamHunter commented May 26, 2021

imatiach-msft commented May 26, 2021 • edited Loading

imatiach-msft commented May 26, 2021

imatiach-msft commented May 26, 2021

icankeep commented Jun 1, 2021

Simon-LLong commented Dec 30, 2021

imatiach-msft commented Dec 30, 2021

nitinmnsn commented Feb 20, 2022

OldDreamHunter commented May 20, 2021 •

edited by azure-boards bot

Loading

imatiach-msft commented May 26, 2021 •

edited

Loading