Skip to content

createchmtrainjob.py

Chris Churas edited this page Jul 18, 2017 · 29 revisions

createchmtrainjob.py is a tool that creates a script to run CHM train job on a compute cluster such as Comet and Rocce.

Shows createchmtrainjob overview graphic

Example usage:

createchmtrainjob.py ./images ./labels ./run --chmbin /bin/chm.img \ 
 --cluster rocce

Example usage on Rocce

This example takes a training dataset that is downloaded in Step 2 and look like the images in the figure above and runs CHM Train to generated what is known as a trained model. A trained model is simply a directory containing some Matlab files.

Step 1 Connect to Rocce

Open a terminal and connect to rocce via ssh. Replace <USER> with your username

ssh <USER>@rocce.ucsd.edu

Step 2 Download training data

Replace <USER> with your username

cd /data/scratch/<USER>
mkdir -p testtrain/images testtrain/labels
cd testtrain/images
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/images/x.000.png
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/images/x.001.png
cd ../labels
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/labels/x.000.png
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/labels/x.001.png
cd ..

If above is successful, running tree command as seen below should output the following:

tree
.
|-- images
|   |-- x.000.png
|   `-- x.001.png
`-- labels
    |-- x.000.png
    `-- x.001.png

Step 3 Run createchmtrainjob.py to create chm train job script

createchmtrainjob.py ./images ./labels ./run \
--chmbin <path to chm singularity image> \
--cluster rocce --maxmem 40 --stage 2 \
--level 1

If above command works then text similar to the following will be displayed:

To submit run: cd /data/scratch/<USER>/testtrain/run; qsub /data/scratch/<USER>/testtrain/run/runtrain.rocce

In output above <USER> would be set to your username

Step 4 Check and wait for job completion (takes ~6 hours to run)

NOTE: This 2 image training dataset takes roughly 6 hours to run on Rocce

To see if job has finished run:

qstat

If job is still running the output will look like this:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  12711 0.50500 chmtrainjo churas       r     06/13/2017 11:30:08 all.q@compute-0-22.local           1        

The column under state denotes if the job is queued to run qw, running r or if the job is not listed done.

Step 5 check completion status.

Look at stdout/#####.out file and see if job completed with 0 exit code

cat stdout/12711.out

HOST: compute-0-22.local
DATE: Tue Jun 13 11:30:08 PDT 2017
JOBID: 12711
Extracting features ... stage 1 level 0 
Start learning LDNN ... stage 1 level 0 
Run clustering...Done. It took 43.825148  
Number of training samples = 524288 
Epoch No. 1 ... error = 0.064621 
Epoch No. 2 ... error = 0.052156 
Epoch No. 3 ... error = 0.049860 
Epoch No. 4 ... error = 0.048393 
.
.
.
Generating outputs ... stage 2 level 0 
	Command being timed: "/data/scratch/<USER>/chm_s22.img train /data/scratch/<USER>/testtrain/images /data/scratch/<USER>/testtrain/labels -S 2 -L 2 -m /data/<USER>/testtrain/tmp"
	User time (seconds): 20301.93
	System time (seconds): 29.93
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:39:03
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 36365776
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1085
	Minor (reclaiming a frame) page faults: 677771
	Voluntary context switches: 25307
	Involuntary context switches: 2037831
	Swaps: 0
	File system inputs: 272114
	File system outputs: 218920
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
chm_s22.img exited with code: 0

If command ran successfully the trained model will be in model/ directory

ls
model  Outputs_v2.mat  readme.txt  runtrain.rocce  stdout  tmp