-
Notifications
You must be signed in to change notification settings - Fork 357
Training hangs at one place.There is no error no log. #102
Comments
More detailed log from the running app => 16/07/05 13:35:50 INFO BlockManager: Found block rdd_1_0 locally |
Check ThreadDump from spark UI while the job is stuck. It is likely one of the transformer threads was dead. |
I found the problem. |
I am trying to train to train 192000 image data which has been resized to 32x32 while creating LMDB using create_imageset.sh.
I kept the LMDB files to hdfs and made its location's entry into the network.
The network prototxt file i am using is same as what has been given in CIFAR10 example.
Log generated is
16/07/05 13:36:13 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Job 4 finished: reduce at CaffeOnSpark.scala:210, took 23.290814 s
16/07/05 13:36:13 INFO spark.SparkContext: Starting job: reduce at CaffeOnSpark.scala:210
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Got job 5 (reduce at CaffeOnSpark.scala:210) with 2 output partitions
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (reduce at CaffeOnSpark.scala:210)
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[8] at mapPartitions at CaffeOnSpark.scala:190), which has no missing parents
16/07/05 13:36:13 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.4 KB, free 20.8 KB)
16/07/05 13:36:13 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2.2 KB, free 22.9 KB)
16/07/05 13:36:13 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 172.31.47.103:45502 (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:36:13 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (MapPartitionsRDD[8] at mapPartitions at CaffeOnSpark.scala:190)
16/07/05 13:36:13 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
16/07/05 13:36:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, slave2-172-31-47-102, partition 0,PROCESS_LOCAL, 2244 bytes)
16/07/05 13:36:13 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, master-172-31-47-103, partition 1,PROCESS_LOCAL, 2300 bytes)
16/07/05 13:36:13 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on master-172-31-47-103:33837 (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:36:13 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on slave2-172-31-47-102:58709 (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 172.31.47.103:45502 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slave2-172-31-47-102:58709 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on master-172-31-47-103:33837 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned accumulator 5
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 172.31.47.103:45502 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on slave2-172-31-47-102:58709 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on master-172-31-47-103:33837 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned accumulator 4
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on 172.31.47.103:45502 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slave2-172-31-47-102:58709 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on master-172-31-47-103:33837 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned accumulator 3
16/07/05 13:42:10 INFO storage.BlockManager: Removing RDD 6
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned RDD 6
The solver and train_test prototxt file is atatched.
network.zip
Command used to run the script is is attached in cmd.txt
cmd.txt
The text was updated successfully, but these errors were encountered: