Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

Training hangs at one place.There is no error no log. #102

Closed
abhaymise opened this issue Jul 5, 2016 · 3 comments
Closed

Training hangs at one place.There is no error no log. #102

abhaymise opened this issue Jul 5, 2016 · 3 comments

Comments

@abhaymise
Copy link

I am trying to train to train 192000 image data which has been resized to 32x32 while creating LMDB using create_imageset.sh.

I kept the LMDB files to hdfs and made its location's entry into the network.
The network prototxt file i am using is same as what has been given in CIFAR10 example.

Log generated is

16/07/05 13:36:13 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Job 4 finished: reduce at CaffeOnSpark.scala:210, took 23.290814 s
16/07/05 13:36:13 INFO spark.SparkContext: Starting job: reduce at CaffeOnSpark.scala:210
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Got job 5 (reduce at CaffeOnSpark.scala:210) with 2 output partitions
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (reduce at CaffeOnSpark.scala:210)
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[8] at mapPartitions at CaffeOnSpark.scala:190), which has no missing parents
16/07/05 13:36:13 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.4 KB, free 20.8 KB)
16/07/05 13:36:13 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2.2 KB, free 22.9 KB)
16/07/05 13:36:13 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 172.31.47.103:45502 (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:36:13 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
16/07/05 13:36:13 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (MapPartitionsRDD[8] at mapPartitions at CaffeOnSpark.scala:190)
16/07/05 13:36:13 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
16/07/05 13:36:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, slave2-172-31-47-102, partition 0,PROCESS_LOCAL, 2244 bytes)
16/07/05 13:36:13 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, master-172-31-47-103, partition 1,PROCESS_LOCAL, 2300 bytes)
16/07/05 13:36:13 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on master-172-31-47-103:33837 (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:36:13 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on slave2-172-31-47-102:58709 (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 172.31.47.103:45502 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slave2-172-31-47-102:58709 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on master-172-31-47-103:33837 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned accumulator 5
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 172.31.47.103:45502 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on slave2-172-31-47-102:58709 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on master-172-31-47-103:33837 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned accumulator 4
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on 172.31.47.103:45502 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slave2-172-31-47-102:58709 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on master-172-31-47-103:33837 in memory (size: 2.2 KB, free: 511.5 MB)
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned accumulator 3
16/07/05 13:42:10 INFO storage.BlockManager: Removing RDD 6
16/07/05 13:42:10 INFO spark.ContextCleaner: Cleaned RDD 6

The solver and train_test prototxt file is atatched.
network.zip

Command used to run the script is is attached in cmd.txt
cmd.txt

@abhaymise
Copy link
Author

More detailed log from the running app =>

16/07/05 13:35:50 INFO BlockManager: Found block rdd_1_0 locally
I0705 13:35:50.376612 18557 solver.cpp:237] Iteration 0, loss = 2.9456
I0705 13:35:50.376672 18557 solver.cpp:253] Train net output #0: loss = 2.9456 (* 1 = 2.9456 loss)
I0705 13:35:50.390033 18557 sgd_solver.cpp:106] Iteration 0, lr = 0.001
I0705 13:35:52.638326 18557 solver.cpp:237] Iteration 100, loss = 26.1633
I0705 13:35:52.638391 18557 solver.cpp:253] Train net output #0: loss = 26.1633 (* 1 = 26.1633 loss)
I0705 13:35:52.652004 18557 sgd_solver.cpp:106] Iteration 100, lr = 0.001
I0705 13:35:54.924104 18557 solver.cpp:237] Iteration 200, loss = 67.1725
I0705 13:35:54.924163 18557 solver.cpp:253] Train net output #0: loss = 67.1725 (* 1 = 67.1725 loss)
I0705 13:35:54.941706 18557 sgd_solver.cpp:106] Iteration 200, lr = 0.001
I0705 13:35:57.788662 18557 solver.cpp:237] Iteration 300, loss = 34.1383
I0705 13:35:57.788727 18557 solver.cpp:253] Train net output #0: loss = 34.1383 (* 1 = 34.1383 loss)
I0705 13:35:57.803102 18557 sgd_solver.cpp:106] Iteration 300, lr = 0.001
I0705 13:36:00.677893 18557 solver.cpp:237] Iteration 400, loss = 27.2034
I0705 13:36:00.677945 18557 solver.cpp:253] Train net output #0: loss = 27.2034 (* 1 = 27.2034 loss)
I0705 13:36:00.694277 18557 sgd_solver.cpp:106] Iteration 400, lr = 0.001
I0705 13:36:03.324339 18557 solver.cpp:237] Iteration 500, loss = 37.5662
I0705 13:36:03.324394 18557 solver.cpp:253] Train net output #0: loss = 37.5662 (* 1 = 37.5662 loss)
I0705 13:36:03.339056 18557 sgd_solver.cpp:106] Iteration 500, lr = 0.001
I0705 13:36:05.590054 18557 solver.cpp:237] Iteration 600, loss = 35.8908
I0705 13:36:05.590111 18557 solver.cpp:253] Train net output #0: loss = 35.8908 (* 1 = 35.8908 loss)
I0705 13:36:05.604562 18557 sgd_solver.cpp:106] Iteration 600, lr = 0.001
I0705 13:36:07.884639 18557 solver.cpp:237] Iteration 700, loss = 34.151
I0705 13:36:07.884697 18557 solver.cpp:253] Train net output #0: loss = 34.151 (* 1 = 34.151 loss)
I0705 13:36:07.898679 18557 sgd_solver.cpp:106] Iteration 700, lr = 0.001
I0705 13:36:10.181591 18557 solver.cpp:237] Iteration 800, loss = 32.3517
I0705 13:36:10.181645 18557 solver.cpp:253] Train net output #0: loss = 32.3517 (* 1 = 32.3517 loss)
I0705 13:36:10.196151 18557 sgd_solver.cpp:106] Iteration 800, lr = 0.001
I0705 13:36:12.480675 18557 solver.cpp:237] Iteration 900, loss = 36.6911
I0705 13:36:12.480734 18557 solver.cpp:253] Train net output #0: loss = 36.6911 (* 1 = 36.6911 loss)
I0705 13:36:12.495223 18557 sgd_solver.cpp:106] Iteration 900, lr = 0.001
16/07/05 13:36:13 INFO Executor: Finished task 0.0 in stage 4.0 (TID 9). 2015 bytes result sent to driver
16/07/05 13:36:13 INFO CoarseGrainedExecutorBackend: Got assigned task 10
16/07/05 13:36:13 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
16/07/05 13:36:13 INFO TorrentBroadcast: Started reading broadcast variable 6
16/07/05 13:36:13 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2.2 KB, free 19.6 KB)
16/07/05 13:36:13 INFO TorrentBroadcast: Reading broadcast variable 6 took 11 ms
16/07/05 13:36:13 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.4 KB, free 23.0 KB)
16/07/05 13:36:13 INFO BlockManager: Found block rdd_1_0 locally
I0705 13:36:14.781328 18557 solver.cpp:237] Iteration 1000, loss = 73.3627
I0705 13:36:14.781388 18557 solver.cpp:253] Train net output #0: loss = 73.3627 (* 1 = 73.3627 loss)
I0705 13:36:14.795292 18557 sgd_solver.cpp:106] Iteration 1000, lr = 0.001
I0705 13:36:17.078786 18557 solver.cpp:237] Iteration 1100, loss = 36.6632
I0705 13:36:17.078845 18557 solver.cpp:253] Train net output #0: loss = 36.6632 (* 1 = 36.6632 loss)
I0705 13:36:17.092947 18557 sgd_solver.cpp:106] Iteration 1100, lr = 0.001
16/07/05 13:42:10 INFO BlockManager: Removing RDD 6

@junshi15
Copy link
Collaborator

junshi15 commented Jul 6, 2016

Check ThreadDump from spark UI while the job is stuck. It is likely one of the transformer threads was dead.

@abhaymise
Copy link
Author

I found the problem.
The memory for the executor had to be increased.
When i increased it it started to run.
Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants