-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053
Comments
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it. |
hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: |
Thanks for your reply @imatiach-msft , I don't increase the socket timeout and would try it. And the parameters of my model lgb = LightGBMClassifier( |
hi, @imatiach-msft, I have increased the timeout and changed the parallelism type to "voting_parallel", but the job still failed as "reduce at LightGBMBase.scala:230" with the failure reason of "Job aborted due to stage failure: Task 8 in stage 4.0 failed 4 times, most recent failure: Lost task 8.3 in stage 4.0 (TID 6027, pro-dchadoop-195-81, executor 22): java.net.ConnectException: Connection refused (Connection refused)" boostingType='gbdt', |
@OldDreamHunter I think that is a red herring, the real error is on one of the other nodes. Can you send all of the unique task error messages? Please ignore the connection refused error. |
you can also try to set useBarrierExecutionMode=True, I think it might give a better error message |
I would only use voting_parallel if you have a high number of features, see guide: |
same problem. |
same problem. Voting Parallel works fine, but accuracy is very low. Much data is skipped. |
@Simon-LLong sorry about the problems you are encountering. Indeed Voting Parallel can give lower accuracy, but with much better speedup and lower memory usage. Can you also please try the new mode: In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1 (as well as lower memory usage). For more information on the new single dataset mode please see the PR description: This new mode was created after extensive internal benchmarking. I have some ideas on how a streaming mode can also be added to distributed lightgbm, where data is streamed into the native histogram binned representation, which should use a small fraction of the total spark dataset when everything is loaded in memory. It might be a little slower to setup, but it should vastly reduce memory usage. This is something I will be looking into in the near-future. |
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores. |
I have already noticed the issue #542, but the answer cannot solve my problem.
I have a dataset nearly 72GB and 145 columns. My spark config is
spark-submit
--master yarn
--deploy-mode client
--executor-memory 15g
--driver-memory 15g
--executor-cores 1
--num-executors 20
--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
--conf spark.default.parallelism=5000
--conf spark.sql.shuffle.partitions=5000
--conf spark.dynamicAllocation.enabled=false
--conf spark.memory.storageFraction=0.3
--conf spark.executor.memoryOverhead=15g
--conf spark.driver.maxResultSize=10g \
if I reduce the dataset size to 24 GB, I could train the model in 40 minutes. But if I increase the dataset to 72GB, the training process would be stuck at "reduce at LightGBMClassifier.scala:150" and report some failed information, "ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128370 ms", "java.lang.Exception: Dataset create call failed in LightGBM with error: Socket recv error, code: 104", "java.net.ConnectException: Connection refused"
AB#1188553
The text was updated successfully, but these errors were encountered: