You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
我在使用spark on angel的GBDT进行训练时,训练一段时间,就会挂掉,报如下错误: 18/08/23 03:20:57 INFO CoarseGrainedExecutorBackend: Got assigned task 14757 18/08/23 03:20:57 INFO Executor: Running task 5.0 in stage 14.0 (TID 14757) 18/08/23 03:20:57 INFO TorrentBroadcast: Started reading broadcast variable 14 18/08/23 03:20:57 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 2.4 KB, free 4.4 GB) 18/08/23 03:20:57 INFO TorrentBroadcast: Reading broadcast variable 14 took 10 ms 18/08/23 03:20:57 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 4.1 KB, free 4.4 GB) 18/08/23 03:20:57 INFO ShuffleBlockFetcherIterator: Getting 4900 non-empty blocks out of 4900 blocks 18/08/23 03:20:57 INFO ShuffleBlockFetcherIterator: Started 5 remote fetches in 36 ms 18/08/23 03:21:27 INFO MemoryStore: Will not store rdd_24_5 18/08/23 03:21:27 WARN MemoryStore: Not enough space to cache rdd_24_5 in memory! (computed 3.5 GB so far) 18/08/23 03:21:27 INFO MemoryStore: Memory use = 216.1 MB (blocks) + 3.5 GB (scratch space shared across 1 tasks(s)) = 3.7 GB. Storage limit = 4.6 GB. 18/08/23 03:21:27 WARN BlockManager: Block rdd_24_5 could not be removed as it was not found on disk or in memory 18/08/23 03:21:27 WARN BlockManager: Putting block rdd_24_5 failed 18/08/23 03:21:27 INFO BlockManager: Found block rdd_27_5 locally 18/08/23 03:21:27 WARN BlockManager: Putting block rdd_30_5 failed due to an exception 18/08/23 03:21:27 WARN BlockManager: Block rdd_30_5 could not be removed as it was not found on disk or in memory 18/08/23 03:21:27 ERROR Executor: Exception in task 5.0 in stage 14.0 (TID 14757) java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at com.tencent.angel.spark.ml.gbt.GBDTLearner$$anonfun$13.apply(GBDTLearner.scala:209) at com.tencent.angel.spark.ml.gbt.GBDTLearner$$anonfun$13.apply(GBDTLearner.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
我在使用spark on angel的GBDT进行训练时,训练一段时间,就会挂掉,报如下错误:
18/08/23 03:20:57 INFO CoarseGrainedExecutorBackend: Got assigned task 14757 18/08/23 03:20:57 INFO Executor: Running task 5.0 in stage 14.0 (TID 14757) 18/08/23 03:20:57 INFO TorrentBroadcast: Started reading broadcast variable 14 18/08/23 03:20:57 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 2.4 KB, free 4.4 GB) 18/08/23 03:20:57 INFO TorrentBroadcast: Reading broadcast variable 14 took 10 ms 18/08/23 03:20:57 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 4.1 KB, free 4.4 GB) 18/08/23 03:20:57 INFO ShuffleBlockFetcherIterator: Getting 4900 non-empty blocks out of 4900 blocks 18/08/23 03:20:57 INFO ShuffleBlockFetcherIterator: Started 5 remote fetches in 36 ms 18/08/23 03:21:27 INFO MemoryStore: Will not store rdd_24_5 18/08/23 03:21:27 WARN MemoryStore: Not enough space to cache rdd_24_5 in memory! (computed 3.5 GB so far) 18/08/23 03:21:27 INFO MemoryStore: Memory use = 216.1 MB (blocks) + 3.5 GB (scratch space shared across 1 tasks(s)) = 3.7 GB. Storage limit = 4.6 GB. 18/08/23 03:21:27 WARN BlockManager: Block rdd_24_5 could not be removed as it was not found on disk or in memory 18/08/23 03:21:27 WARN BlockManager: Putting block rdd_24_5 failed 18/08/23 03:21:27 INFO BlockManager: Found block rdd_27_5 locally 18/08/23 03:21:27 WARN BlockManager: Putting block rdd_30_5 failed due to an exception 18/08/23 03:21:27 WARN BlockManager: Block rdd_30_5 could not be removed as it was not found on disk or in memory 18/08/23 03:21:27 ERROR Executor: Exception in task 5.0 in stage 14.0 (TID 14757) java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at com.tencent.angel.spark.ml.gbt.GBDTLearner$$anonfun$13.apply(GBDTLearner.scala:209) at com.tencent.angel.spark.ml.gbt.GBDTLearner$$anonfun$13.apply(GBDTLearner.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
提交的脚本中有配置:
--conf spark.ps.instances=10 \ --conf spark.ps.cores=2 \ --conf spark.ps.memory=10g \
求救!
The text was updated successfully, but these errors were encountered: