Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark on Angel: GBDT #376

Closed
kefault opened this issue Aug 23, 2018 · 1 comment
Closed

Spark on Angel: GBDT #376

kefault opened this issue Aug 23, 2018 · 1 comment

Comments

@kefault
Copy link

kefault commented Aug 23, 2018

我在使用spark on angel的GBDT进行训练时,训练一段时间,就会挂掉,报如下错误:
18/08/23 03:20:57 INFO CoarseGrainedExecutorBackend: Got assigned task 14757 18/08/23 03:20:57 INFO Executor: Running task 5.0 in stage 14.0 (TID 14757) 18/08/23 03:20:57 INFO TorrentBroadcast: Started reading broadcast variable 14 18/08/23 03:20:57 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 2.4 KB, free 4.4 GB) 18/08/23 03:20:57 INFO TorrentBroadcast: Reading broadcast variable 14 took 10 ms 18/08/23 03:20:57 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 4.1 KB, free 4.4 GB) 18/08/23 03:20:57 INFO ShuffleBlockFetcherIterator: Getting 4900 non-empty blocks out of 4900 blocks 18/08/23 03:20:57 INFO ShuffleBlockFetcherIterator: Started 5 remote fetches in 36 ms 18/08/23 03:21:27 INFO MemoryStore: Will not store rdd_24_5 18/08/23 03:21:27 WARN MemoryStore: Not enough space to cache rdd_24_5 in memory! (computed 3.5 GB so far) 18/08/23 03:21:27 INFO MemoryStore: Memory use = 216.1 MB (blocks) + 3.5 GB (scratch space shared across 1 tasks(s)) = 3.7 GB. Storage limit = 4.6 GB. 18/08/23 03:21:27 WARN BlockManager: Block rdd_24_5 could not be removed as it was not found on disk or in memory 18/08/23 03:21:27 WARN BlockManager: Putting block rdd_24_5 failed 18/08/23 03:21:27 INFO BlockManager: Found block rdd_27_5 locally 18/08/23 03:21:27 WARN BlockManager: Putting block rdd_30_5 failed due to an exception 18/08/23 03:21:27 WARN BlockManager: Block rdd_30_5 could not be removed as it was not found on disk or in memory 18/08/23 03:21:27 ERROR Executor: Exception in task 5.0 in stage 14.0 (TID 14757) java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at com.tencent.angel.spark.ml.gbt.GBDTLearner$$anonfun$13.apply(GBDTLearner.scala:209) at com.tencent.angel.spark.ml.gbt.GBDTLearner$$anonfun$13.apply(GBDTLearner.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

提交的脚本中有配置:
--conf spark.ps.instances=10 \ --conf spark.ps.cores=2 \ --conf spark.ps.memory=10g \

求救!

@zyq11223
Copy link

zyq11223 commented Oct 9, 2018

咋解决的啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants