-
Notifications
You must be signed in to change notification settings - Fork 1
createchmtrainjob.py
createchmtrainjob.py is a tool that creates a script to run CHM train job on a compute cluster such as Comet and Rocce.
Example usage:
createchmtrainjob.py ./images ./labels ./run --chmbin /bin/chm.img \
--cluster rocce
This example takes a training dataset that is downloaded in Step 2 and look like the images in the figure above and runs CHM Train to generated what is known as a trained model. A trained model is simply a directory containing some Matlab files.
Open a terminal and connect to rocce via ssh. Replace <USER> with your username
ssh <USER>@rocce.ucsd.edu
Replace <USER> with your username
cd /data/scratch/<USER>
mkdir -p testtrain/images testtrain/labels
cd testtrain/images
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/images/x.000.png
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/images/x.001.png
cd ../labels
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/labels/x.000.png
wget https://raw.githubusercontent.com/wiki/CRBS/chmutil/images/traindata/labels/x.001.png
cd ..
If above is successful, running tree command as seen below should output the following:
tree
.
|-- images
| |-- x.000.png
| `-- x.001.png
`-- labels
|-- x.000.png
`-- x.001.png
createchmtrainjob.py ./images ./labels ./run \
--chmbin <path to chm singularity image> \
--cluster rocce --maxmem 40 --stage 2 \
--level 1
If above command works then text similar to the following will be displayed:
To submit run: cd /data/scratch/<USER>/testtrain/run; qsub /data/scratch/<USER>/testtrain/run/runtrain.rocce
In output above <USER> would be set to your username
NOTE: This 2 image training dataset takes roughly 6 hours to run on Rocce
To see if job has finished run:
qstat
If job is still running the output will look like this:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
12711 0.50500 chmtrainjo churas r 06/13/2017 11:30:08 all.q@compute-0-22.local 1
The column under state denotes if the job is queued to run qw, running r or if the job is not listed done.
Look at stdout/#####.out file and see if job completed with 0 exit code
cat stdout/12711.out
HOST: compute-0-22.local
DATE: Tue Jun 13 11:30:08 PDT 2017
JOBID: 12711
Extracting features ... stage 1 level 0
Start learning LDNN ... stage 1 level 0
Run clustering...Done. It took 43.825148
Number of training samples = 524288
Epoch No. 1 ... error = 0.064621
Epoch No. 2 ... error = 0.052156
Epoch No. 3 ... error = 0.049860
Epoch No. 4 ... error = 0.048393
.
.
.
Generating outputs ... stage 2 level 0
Command being timed: "/data/scratch/<USER>/chm_s22.img train /data/scratch/<USER>/testtrain/images /data/scratch/<USER>/testtrain/labels -S 2 -L 2 -m /data/<USER>/testtrain/tmp"
User time (seconds): 20301.93
System time (seconds): 29.93
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:39:03
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 36365776
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1085
Minor (reclaiming a frame) page faults: 677771
Voluntary context switches: 25307
Involuntary context switches: 2037831
Swaps: 0
File system inputs: 272114
File system outputs: 218920
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
chm_s22.img exited with code: 0
If command ran successfully the trained model will be in model/ directory
ls
model Outputs_v2.mat readme.txt runtrain.rocce stdout tmp