root@ip-172-31-29-83:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077 --files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt --conf spark.cores.max=${TOTAL_CORES} --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -clusterSize ${SPARK_WORKER_INSTANCES} -devices ${DEVICES} -connection ethernet -model /mnist.model -output /mnist_features_result Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/03/28 01:03:43 INFO SparkContext: Running Spark version 1.6.0 16/03/28 01:03:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/28 01:03:44 WARN SparkConf: SPARK_WORKER_INSTANCES was detected (set to '3'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --num-executors to specify the number of executors - Or set SPARK_EXECUTOR_INSTANCES - spark.executor.instances to configure the number of instances in the spark config. 16/03/28 01:03:44 INFO SecurityManager: Changing view acls to: root 16/03/28 01:03:44 INFO SecurityManager: Changing modify acls to: root 16/03/28 01:03:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 16/03/28 01:03:44 INFO Utils: Successfully started service 'sparkDriver' on port 60452. 16/03/28 01:03:45 INFO Slf4jLogger: Slf4jLogger started 16/03/28 01:03:45 INFO Remoting: Starting remoting 16/03/28 01:03:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.31.29.83:39179] 16/03/28 01:03:45 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 39179. 16/03/28 01:03:45 INFO SparkEnv: Registering MapOutputTracker 16/03/28 01:03:45 INFO SparkEnv: Registering BlockManagerMaster 16/03/28 01:03:45 INFO DiskBlockManager: Created local directory at /mnt/spark/blockmgr-292f6266-d419-4e62-a581-f2ac5b73c410 16/03/28 01:03:45 INFO DiskBlockManager: Created local directory at /mnt2/spark/blockmgr-01796f1d-7fc3-4497-a874-f069ce27f4fd 16/03/28 01:03:45 INFO MemoryStore: MemoryStore started with capacity 511.5 MB 16/03/28 01:03:45 INFO SparkEnv: Registering OutputCommitCoordinator 16/03/28 01:03:45 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/03/28 01:03:45 INFO SparkUI: Started SparkUI at http://ec2-54-171-180-143.eu-west-1.compute.amazonaws.com:4040 16/03/28 01:03:45 INFO HttpFileServer: HTTP File server directory is /mnt/spark/spark-a40b6eca-399b-48fd-a6f9-068d0f1a3cb1/httpd-b70aa2b0-d131-41b3-8030-f1eeeb74cde0 16/03/28 01:03:45 INFO HttpServer: Starting HTTP Server 16/03/28 01:03:45 INFO Utils: Successfully started service 'HTTP file server' on port 43299. 16/03/28 01:03:45 INFO SparkContext: Added JAR file:/root/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at http://172.31.29.83:43299/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1459127025615 16/03/28 01:03:45 INFO Utils: Copying /root/CaffeOnSpark/data/lenet_memory_train_test.prototxt to /mnt/spark/spark-a40b6eca-399b-48fd-a6f9-068d0f1a3cb1/userFiles-441d108a-61f8-47bb-bb7c-78cb0088873b/lenet_memory_train_test.prototxt 16/03/28 01:03:45 INFO SparkContext: Added file file:/root/CaffeOnSpark/data/lenet_memory_train_test.prototxt at http://172.31.29.83:43299/files/lenet_memory_train_test.prototxt with timestamp 1459127025724 16/03/28 01:03:45 INFO Utils: Copying /root/CaffeOnSpark/data/lenet_memory_solver.prototxt to /mnt/spark/spark-a40b6eca-399b-48fd-a6f9-068d0f1a3cb1/userFiles-441d108a-61f8-47bb-bb7c-78cb0088873b/lenet_memory_solver.prototxt 16/03/28 01:03:45 INFO SparkContext: Added file file:/root/CaffeOnSpark/data/lenet_memory_solver.prototxt at http://172.31.29.83:43299/files/lenet_memory_solver.prototxt with timestamp 1459127025731 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Connecting to master spark://ip-172-31-29-83.eu-west-1.compute.internal:7077... 16/03/28 01:03:45 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160328010345-0003 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Executor added: app-20160328010345-0003/0 on worker-20160328000757-172.31.26.194-59131 (172.31.26.194:59131) with 32 cores 16/03/28 01:03:45 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160328010345-0003/0 on hostPort 172.31.26.194:59131 with 32 cores, 53.0 GB RAM 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Executor added: app-20160328010345-0003/1 on worker-20160328000757-172.31.26.193-43515 (172.31.26.193:43515) with 32 cores 16/03/28 01:03:45 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160328010345-0003/1 on hostPort 172.31.26.193:43515 with 32 cores, 53.0 GB RAM 16/03/28 01:03:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50487. 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Executor added: app-20160328010345-0003/2 on worker-20160328000758-172.31.26.195-57911 (172.31.26.195:57911) with 32 cores 16/03/28 01:03:45 INFO NettyBlockTransferService: Server created on 50487 16/03/28 01:03:45 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160328010345-0003/2 on hostPort 172.31.26.195:57911 with 32 cores, 53.0 GB RAM 16/03/28 01:03:45 INFO BlockManagerMaster: Trying to register BlockManager 16/03/28 01:03:45 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.29.83:50487 with 511.5 MB RAM, BlockManagerId(driver, 172.31.29.83, 50487) 16/03/28 01:03:45 INFO BlockManagerMaster: Registered BlockManager 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Executor updated: app-20160328010345-0003/0 is now RUNNING 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Executor updated: app-20160328010345-0003/1 is now RUNNING 16/03/28 01:03:45 INFO AppClient$ClientEndpoint: Executor updated: app-20160328010345-0003/2 is now RUNNING 16/03/28 01:03:48 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-26-193.eu-west-1.compute.internal:40877) with ID 1 16/03/28 01:03:48 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-26-195.eu-west-1.compute.internal:54929) with ID 2 16/03/28 01:03:48 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-26-194.eu-west-1.compute.internal:57463) with ID 0 16/03/28 01:03:48 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-26-193.eu-west-1.compute.internal:54036 with 37.9 GB RAM, BlockManagerId(1, ip-172-31-26-193.eu-west-1.compute.internal, 54036) 16/03/28 01:03:48 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-26-195.eu-west-1.compute.internal:58822 with 37.9 GB RAM, BlockManagerId(2, ip-172-31-26-195.eu-west-1.compute.internal, 58822) 16/03/28 01:03:48 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-26-194.eu-west-1.compute.internal:34356 with 37.9 GB RAM, BlockManagerId(0, ip-172-31-26-194.eu-west-1.compute.internal, 34356) 16/03/28 01:03:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0 16/03/28 01:03:49 INFO DataSource$: Source data layer:0 16/03/28 01:03:49 INFO LMDB: Batch size:64 16/03/28 01:03:49 INFO SparkContext: Starting job: collect at CaffeOnSpark.scala:126 16/03/28 01:03:49 INFO DAGScheduler: Got job 0 (collect at CaffeOnSpark.scala:126) with 3 output partitions 16/03/28 01:03:49 INFO DAGScheduler: Final stage: ResultStage 0 (collect at CaffeOnSpark.scala:126) 16/03/28 01:03:49 INFO DAGScheduler: Parents of final stage: List() 16/03/28 01:03:49 INFO DAGScheduler: Missing parents: List() 16/03/28 01:03:49 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at map at CaffeOnSpark.scala:115), which has no missing parents 16/03/28 01:03:49 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 3.2 KB) 16/03/28 01:03:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.0 KB, free 5.2 KB) 16/03/28 01:03:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.31.29.83:50487 (size: 2.0 KB, free: 511.5 MB) 16/03/28 01:03:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006 16/03/28 01:03:50 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at map at CaffeOnSpark.scala:115) 16/03/28 01:03:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 16/03/28 01:03:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-26-195.eu-west-1.compute.internal, partition 0,PROCESS_LOCAL, 2200 bytes) 16/03/28 01:03:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-26-193.eu-west-1.compute.internal, partition 1,PROCESS_LOCAL, 2200 bytes) 16/03/28 01:03:50 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, ip-172-31-26-194.eu-west-1.compute.internal, partition 2,PROCESS_LOCAL, 2200 bytes) 16/03/28 01:03:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-26-194.eu-west-1.compute.internal:34356 (size: 2.0 KB, free: 37.9 GB) 16/03/28 01:03:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-26-193.eu-west-1.compute.internal:54036 (size: 2.0 KB, free: 37.9 GB) 16/03/28 01:03:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-26-195.eu-west-1.compute.internal:58822 (size: 2.0 KB, free: 37.9 GB) 16/03/28 01:03:52 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2299 ms on ip-172-31-26-194.eu-west-1.compute.internal (1/3) 16/03/28 01:03:52 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2312 ms on ip-172-31-26-193.eu-west-1.compute.internal (2/3) 16/03/28 01:03:52 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2381 ms on ip-172-31-26-195.eu-west-1.compute.internal (3/3) 16/03/28 01:03:52 INFO DAGScheduler: ResultStage 0 (collect at CaffeOnSpark.scala:126) finished in 2.384 s 16/03/28 01:03:52 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/03/28 01:03:52 INFO DAGScheduler: Job 0 finished: collect at CaffeOnSpark.scala:126, took 2.617513 s 16/03/28 01:03:52 INFO CaffeOnSpark: rank = 0, RDMAaddress = ,ip-172-31-26-195:39072,ip-172-31-26-195:39072, hostname = ip-172-31-26-195 16/03/28 01:03:52 INFO CaffeOnSpark: rank = 1, RDMAaddress = ip-172-31-26-193:40755,,ip-172-31-26-193:40755, hostname = ip-172-31-26-193 16/03/28 01:03:52 INFO CaffeOnSpark: rank = 2, RDMAaddress = ip-172-31-26-194:58031,ip-172-31-26-194:58031,, hostname = ip-172-31-26-194 16/03/28 01:03:52 INFO CaffeOnSpark: rank 0:ip-172-31-26-195 16/03/28 01:03:52 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 840.0 B, free 6.1 KB) 16/03/28 01:03:52 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 132.0 B, free 6.2 KB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.29.83:50487 (size: 132.0 B, free: 511.5 MB) 16/03/28 01:03:52 INFO SparkContext: Created broadcast 1 from broadcast at CaffeOnSpark.scala:145 16/03/28 01:03:52 INFO SparkContext: Starting job: collect at CaffeOnSpark.scala:154 16/03/28 01:03:52 INFO DAGScheduler: Got job 1 (collect at CaffeOnSpark.scala:154) with 3 output partitions 16/03/28 01:03:52 INFO DAGScheduler: Final stage: ResultStage 1 (collect at CaffeOnSpark.scala:154) 16/03/28 01:03:52 INFO DAGScheduler: Parents of final stage: List() 16/03/28 01:03:52 INFO DAGScheduler: Missing parents: List() 16/03/28 01:03:52 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at map at CaffeOnSpark.scala:148), which has no missing parents 16/03/28 01:03:52 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 8.8 KB) 16/03/28 01:03:52 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1580.0 B, free 10.3 KB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.29.83:50487 (size: 1580.0 B, free: 511.5 MB) 16/03/28 01:03:52 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 16/03/28 01:03:52 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at map at CaffeOnSpark.scala:148) 16/03/28 01:03:52 INFO TaskSchedulerImpl: Adding task set 1.0 with 3 tasks 16/03/28 01:03:52 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, ip-172-31-26-194.eu-west-1.compute.internal, partition 0,PROCESS_LOCAL, 2200 bytes) 16/03/28 01:03:52 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, ip-172-31-26-195.eu-west-1.compute.internal, partition 1,PROCESS_LOCAL, 2200 bytes) 16/03/28 01:03:52 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, ip-172-31-26-193.eu-west-1.compute.internal, partition 2,PROCESS_LOCAL, 2200 bytes) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-26-194.eu-west-1.compute.internal:34356 (size: 1580.0 B, free: 37.9 GB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-26-195.eu-west-1.compute.internal:58822 (size: 1580.0 B, free: 37.9 GB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-26-193.eu-west-1.compute.internal:54036 (size: 1580.0 B, free: 37.9 GB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-26-194.eu-west-1.compute.internal:34356 (size: 132.0 B, free: 37.9 GB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-26-195.eu-west-1.compute.internal:58822 (size: 132.0 B, free: 37.9 GB) 16/03/28 01:03:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-26-193.eu-west-1.compute.internal:54036 (size: 132.0 B, free: 37.9 GB) 16/03/28 01:04:13 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 20529 ms on ip-172-31-26-193.eu-west-1.compute.internal (1/3) 16/03/28 01:04:13 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 20569 ms on ip-172-31-26-194.eu-west-1.compute.internal (2/3) 16/03/28 01:04:13 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 20563 ms on ip-172-31-26-195.eu-west-1.compute.internal (3/3) 16/03/28 01:04:13 INFO DAGScheduler: ResultStage 1 (collect at CaffeOnSpark.scala:154) finished in 20.572 s 16/03/28 01:04:13 INFO DAGScheduler: Job 1 finished: collect at CaffeOnSpark.scala:154, took 20.583677 s 16/03/28 01:04:13 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/03/28 01:04:13 INFO LmdbRDD: local LMDB path:/root/CaffeOnSpark/data/mnist_train_lmdb 16/03/28 01:04:13 INFO LmdbRDD: 3 LMDB RDD partitions 16/03/28 01:04:13 INFO SparkContext: Starting job: min at CaffeOnSpark.scala:180 16/03/28 01:04:13 INFO DAGScheduler: Got job 2 (min at CaffeOnSpark.scala:180) with 3 output partitions 16/03/28 01:04:13 INFO DAGScheduler: Final stage: ResultStage 2 (min at CaffeOnSpark.scala:180) 16/03/28 01:04:13 INFO DAGScheduler: Parents of final stage: List() 16/03/28 01:04:13 INFO DAGScheduler: Missing parents: List() 16/03/28 01:04:13 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[6] at mapPartitions at CaffeOnSpark.scala:170), which has no missing parents 16/03/28 01:04:13 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.4 KB, free 13.7 KB) 16/03/28 01:04:13 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB, free 15.9 KB) 16/03/28 01:04:13 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.31.29.83:50487 (size: 2.2 KB, free: 511.5 MB) 16/03/28 01:04:13 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006 16/03/28 01:04:13 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 2 (MapPartitionsRDD[6] at mapPartitions at CaffeOnSpark.scala:170) 16/03/28 01:04:13 INFO TaskSchedulerImpl: Adding task set 2.0 with 3 tasks 16/03/28 01:04:13 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 6, ip-172-31-26-193.eu-west-1.compute.internal, partition 0,PROCESS_LOCAL, 2152 bytes) 16/03/28 01:04:13 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 7, ip-172-31-26-194.eu-west-1.compute.internal, partition 1,PROCESS_LOCAL, 2152 bytes) 16/03/28 01:04:13 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 8, ip-172-31-26-195.eu-west-1.compute.internal, partition 2,PROCESS_LOCAL, 2152 bytes) 16/03/28 01:04:13 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-26-193.eu-west-1.compute.internal:54036 (size: 2.2 KB, free: 37.9 GB) 16/03/28 01:04:13 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-26-194.eu-west-1.compute.internal:34356 (size: 2.2 KB, free: 37.9 GB) 16/03/28 01:04:13 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-26-195.eu-west-1.compute.internal:58822 (size: 2.2 KB, free: 37.9 GB) 16/03/28 01:04:13 INFO BlockManagerInfo: Added rdd_1_0 on disk on ip-172-31-26-193.eu-west-1.compute.internal:54036 (size: 15.5 MB) 16/03/28 01:04:13 INFO BlockManagerInfo: Added rdd_1_1 on disk on ip-172-31-26-194.eu-west-1.compute.internal:34356 (size: 15.5 MB) 16/03/28 01:04:13 INFO BlockManagerInfo: Added rdd_1_2 on disk on ip-172-31-26-195.eu-west-1.compute.internal:58822 (size: 15.5 MB)