Skip to content

Latest commit



772 lines (629 loc) · 20.2 KB

File metadata and controls

772 lines (629 loc) · 20.2 KB


switch from data2/ to data/: find {task,scripts,data} -type f ( -iname *.py -o -iname *.sh -o -iname *.prototxt -o -iname *.txt ) | xargs -i sed -i 's//data2///data//g' {}

gradient checker: look inside caffe/test/

cmd: ./build/tools/caffe train -solver models/clampdet/solver.prototxt
-weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel

err: I1013 18:51:06.336516 26677 net.cpp:67] Creating Layer data I1013 18:51:06.336534 26677 net.cpp:356] data -> data I1013 18:51:06.336555 26677 net.cpp:356] data -> label I1013 18:51:06.336573 26677 net.cpp:96] Setting up data I1013 18:51:06.336585 26677 image_data_layer.cpp:30] Opening file data/clampdet/train.txt I1013 18:51:06.336643 26677 image_data_layer.cpp:45] A total of 0 images. Segmentation fault

sol: data//{train,val}.txt missing label numbers

dirty protobuf hack: layer_factory.cpp: case OLD: return IMPOSTOR loss_layers.hpp: IMPOSTOR::LayerParameter_LayerType() { return OLD }

cleaner protobuf hack: caffe.proto: give new layer smaller ID, swap


during make all:

/usr/bin/ld: cannot find -lcblas /usr/bin/ld: cannot find -latlas


scp ~/.local/lib

create image mean:

Check failed: proto.SerializeToOStream(&output)

hack solution:

use a sufficiently similar, previously computed image mean


paths specified in make_* do not exist, fix

threshold layer:

Check failed: (*top)[0]->num() == (*top)[1]->num() (0 vs. 50) The data and label should have the same number.


you'd scp -r 'ed the data from another graphic machine, symlinks were followed, and actual images were in the data dir. that's not really supposed to be a pb though. list index out of range

look at log.{train,test} and see if last line pathogenic

python wrappers:

ImportError: No module named _caffe


make pycaffe

leveldb locked:

IO error: lock *_leveldb/LOCK: already held by process

solution 1

rm -rf *leveldb ./create

solution 2

{train,val}.prototxt data_param { source: reference correct? }

inf nan consec Test Score in log:

Test score #32: 0.146749 Test score #33: -0.214647 Test score #34: 0.0004478 Test score #35: -0.312895 [...] Iteration 1, lr = 9.995e-05 Iteration 1, loss = nan


you have a layer L linking to layer L+1 and L+2 or smth like that

adding new layer

need to modify this crazy

"\0011"\351\001\n\024ConvolutionParameter" stuff in build/src/caffe/proto/ ?

ImportError: No module named _caffe

_ usually stands for .so written in C(++)!

cd /data/add6813/caffe make pycaffe


ValueError: axes don't match array

when it works: shape of data blob (10, 3, 227, 227) shape of mean file: (3, 256, 256) but for some reason we want mean to have shape: (3, 227, 227) when it doesn't: shape of data blob (10, 3, 227, 227) shape of mean file: (1, 3, 256, 256) but for some reason we want mean to have shape: (3, 227, 227)


mean_f = mean_f[0]

badmin all over the place on clampdet at some point

ok, just realised:

  • clampdet/conv1 bad min (conv2,3 also)
  • clampdet/none_reinit no bad min

what the hell? I thought without US, impossible to get no bad min so what is the magic trick? -> H1 enable backprop on conv1? -> H2 re-initialise fc6? -> H3 STUPID MISTAKE? -> conv1 has train fc7_new, val fc7

clampdet/none trains on graphic09 -> no reinit, so if it works, H2 wrong -> but batchsize 96 might create bad min -> compare with clampdet/none_reinit as well, just interesting -> compare with clampdet/tl_wout for Transfer Learning test run

looks like it was stupid mistake. -> so can revert to studying all of transfer learning without under sampling -> so stupid mistake made val error nonsensical, and yet we were getting consistent 0.5 pca. -> this accuracy layer is still confusing need to understand what's going on (?)

clampdet_os/none_reinit: badmin -> very confusing. badmin with osampling, not with normal?? -> once again, maybe not badmin, just mistake in oversampling

clampdet/conv1 trained again -> if works well now, shows stupid mistake last time -> if so, need to pick more challenging bad min for task 3 -> show that transfer learning helps tackle imbalance


attrib | varname | meaning

prob_.num() | num | batchSize prob_.count() | | prob_.cpu_data() | prob_data |

bottom[1] |
bottom[1]->count() |

labels_ | labels_.count() |


the main functions from which net is trained: ":Solve(" in src/caffe/solver.cpp ":Forward(" in src/caffe/net.cpp ":Backward(" in src/caffe/net.cpp ":Backward(const" in src/caffe/layer.hpp ":ComputeUpdateValue(" in src/caffe/solver.cpp ":Update(" in src/caffe/ ":Update(" in src/caffe/blob.cpp (crux) "void caffe_cpu_axpby(" in src/caffe/util/math_functions.cpp


params_[0] dimensions: num: 96 channels: 3 height: 11 width: 11 count: 34848

params_[1] dimensions: num: 1 channels: 1 height: 1 width: 96 count: 96


params_[2] dimensions: num: 256 channels: 48 height: 5 width: 5 count: 307200

params_[3] dimensions: num: 1 channels: 1 height: 1 width: 256 count: 256


params_[4] dimensions: num: 384 channels: 256 height: 3 width: 3 count: 884736

params_[5] dimensions: num: 1 channels: 1 height: 1 width: 384 count: 384


params_[6] dimensions: num: 384 channels: 192 height: 3 width: 3 count: 663552

params_[7] dimensions: num: 1 channels: 1 height: 1 width: 384 count: 384


params_[8] dimensions: num: 256 channels: 192 height: 3 width: 3 count: 442368

params_[9] dimensions: num: 1 channels: 1 height: 1 width: 256 count: 256


params_[10] dimensions: num: 1 channels: 1 height: 4096 width: 9216 count: 37748736

params_[11] dimensions: num: 1 channels: 1 height: 1 width: 4096 count: 4096


params_[12] dimensions: num: 1 channels: 1 height: 4096 width: 4096 count: 16777216

params_[13] dimensions: num: 1 channels: 1 height: 1 width: 4096 count: 4096


softmax weights!

params_[14] dimensions: num: 1 channels: 1 height: 2 # one for each softmax neuron width: 4096 # one for each neuron below count: 8192

params_[15] dimensions: num: 1 channels: 1 height: 1 width: 2 count: 2


solved. you were wrong, fwd pass not ok. move to step 2

debug SBL:

  • fwd pass: OK

  • bwd pass: OK

  • update: PROB -> solver.cpp l.250: caffe_cpu_axpby(net_params[param_id]->count(), local_rate, net_params[param_id]->cpu_diff(), momentum, history_[param_id]->mutable_cpu_data());

    -> cpu_diff() might be where PROB is -> only for param_id = {14,15} do we have nonzero diff, why?? -> because backprop accidentally active on fc8 only -> woah! exploding/vanishing cpu_diff() with SBL -> which stage outscales the cpu_diff()? none! what happens to cpu_diff() b4/after bwd pass? add couts in net.cpp l.269 -> net.hpp::ForwardBackward calls net::Backward calls layer::Backward calls specific_layer::Backward_cpu solver::ComputeUpdateValue solver::net_->Update() -> compared w/ benchmark throughout an iteration, similar values (also outscale for benchmark)

    -> actual parameter values might be where PROB is -> solver::net_->Update() calls net::Update calls blob::Update -> compare logs -> PROB1: layer[10] is 0 for sbl only -> should be ..? -> loss just after ForwardBackward: already fucked up -> cpu_diff just after ForwardBackward: sbl max benchmark max net_15 e+34 e+31 net_14 0.92 0.99 net_13 0.015 0.009 net_12 0 ! 0.013 (but no neg values!) net_11 0.0055 0.0027 -> cpu_diff just after ComputeUpdateValue(): fine -> current params, diff, new params: fine -> cpu_diff just after Update(): fine

    -> so the fucked up stuff occurs inside ForwardBackward() -> occurs inside net.hpp::Forward() or before ->


issue with the update. after 1 iteration, next loss is 14 for SBL, 73 for SL. bottom_diff takes on rubbish values.


  • bwd pass is wrong
  • weight update is wrong

-> after 1 iteration, net only outputs 1s or 0s! -> so z = <x,w> + b can easily = 0 ? how?

-> ok, cost function seems to be working now. no nans or infs, and trianing error gets minimised.


Why is accuracy so weird? -> find out whether same net loaded in by printing out param values -> have identical train and val files with just 128 cases compare output probs

Examine outputs -> is this harsh error preventing the net from learning anything? -> ie all outputs are around 0.5, it's very confused -> find out by comparing benchmarks -> ground sheet outputs -> min class is 1, so other way around -> uh oh, forgot to get SL to print them out -> 23-08-2014 has them, 22-08-2014 doesnt -> 22-08-2014 is from old build, you can compare train time series with 23* to make sure new build isn't doing anything different or wrong -> scrape zones outputs -> min class is 0 -> if so, make it less harsh? -> less extreme renormalisation -> only penalise if output <=0.5 ie introduce kink in cost function -> formula? -> implement under-sampling like the paper says

Test if correctly implemented: -> graphic06: on a dataset of 6 images, perfectly balanced, batchsize 6, train and val sets the same -> prob outputs not same for val as for train -> calling bottom[0] in SBL, prob_ in PCA is one of them wrong? maybe SBL is wrong, hence bad results below? -> loss same for sbl and sl at iter_1, but not afterwards -> bottom_diff not same for sbl as for sl -> CAREFUL! after debugging, get back data/ground_sheet/temp

Read the paper threshold paper properly!

IDENTIFIED PROBS & SOLS: -> what if prior is (1,0) -> must implement under- and over-sampling as well shit that will be hard


Fuck it, that is too hard. And it might not even work. Threshold works, is easier to implement, and probably has more powerful results.

currently: -> graphic07 writing python wrappers for running the net -> idea is to get the prediction probs, and assign flags based on them & threshold -> debugging -> done


Need to:

  • train nets -> use optimal backprop freeze

  • use Redbox data -> script to use Redbox data from a certain date -> graphic07 -> try multiple threshold dates -> use best performing network so far -> clampdet 94%, what arch was that? -> clampdet 0.2 -> no_thresh-fine 0.12 -> thresh 0.12 -> thresh_freeze_backprop5/13 0.7 -> thresh_freeze_backprop5/14 0.7 -> thresh_freeze_backprop5.5/11 0.15 -> thresh_freeze_backprop5.5/12 0.15 -> thresh_freeze_backprop5.5/13 0.39 -> thresh_freeze_backprop5.5/14 0.4
    -> thresh_freeze_backprop5.5/15 0.18 -> thresh_freeze_backprop6/11 0.17
    -> thresh_freeze_backprop6/13 0.4
    -> thresh_freeze_backprop7/11 0.17 ok seems perf driven by:

    • expressiveness
    • whether lr_policy fucked up still space for optimising both

    -> better than optimal backprop freeze?

  • write up threshold

  • write up sbl


screw the Redbox data. focus on running experiments from below.

-> ReLU maths -> neat writeup

-> Early stopping maths -> draft -> neat writeup

-> Generic clamp -> restructure -> neat writeup

-> Transfer learning -> freezing backprop -> initialising weights

-> Class imbalance -> under-sampling -> in-net threshold -> SBL -> test-time threshold

-> Final Results



  • Generic Clamp: -> mis-labelling, how to show?

  • Transfer Learning -> test run -> with: DONE -> w/out: DONE -> clampdet, freeze backprop on: -> none: clampdet/none DONE -> conv1: clampdet/ DONE -> conv2: clampdet/ DONE -> conv3: clampdet/ DONE -> conv4: clampdet/ TODO -> conv5: clampdet/ TODO -> fc6: clampdet/ TODO? -> fc7: clampdet/ DONE -> weight initialisation -> reinit: clampdet/none_reinit DONE -> ¬reinit: clampdet/none DONE -> parametric vs non parametric -> linear SVM: clampdet/linSVM TODO -> best net fr above: clampdet/none?

Class Imbalance: -> Test Run without TL clampdet/tl_wout DONE clampdetCI98/tl_wout DONE -> Transfer Learning clampdetCI98/tl_wout DONE clampdetCI98/none_reinit_bs128_lr4 DONE clampdetCI98/none_bs128_lr4 DONE -> Batch Size clampdetCI98/none_bs128_lr4 DONE clampdetCI98/none_bs256_lr4 DONE -> Learning Rate clampdetCI98/none_bs256_lr4 DONE clampdetCI98/none_bs256_lr5 TRAINING -> Examine Impact with run_classifier clampdet/none DONE clampdetCI(97)/none(_bs256_lr5) DONE
clampdetCI98/none(_bs256_lr5) DONE
-> SBL clampdetCI98/none_bs256_lr4 DONE
clampdetCI98/none_bs256_lr4_sbl TRAINING clampdetCI98/none_bs256_lr5_sbl RETRAINING clampdetCI98/conv5_bs256_lr5_sbl TRAINING clampdetCI98/fc7_bs256_lr5_sbl DONE

Conserving Spatial Information: -> Test Run soil_contam/tl_wout soil_contam/noneC_lr5_sbl TRAINING note bs128 saturates gpu mem -> Remove pooling and an fc layer soil_contam/nopool_sl TRAINING -> SBL soil_contam/nopool TRAINING

Final Results: -> transfer top notch clampdet net instead? clampdet/none_best DONE soil_contam/noneC DONE hatch_markings/noneC DONE hatch_markings/none TODO -> pooling loses spatial informations soil_contam/poolless TODO

-> what is the best arch? -> do NOT reinit (not enough data, at least not with UnderSampling) -> clampdet -> ground sheet -> hatch markings -> insertion depth markings -> scrape zones -> joint misaligned -> contamination -> fitting proximity -> scraping peeling


TRAIN OLD Class Imbalance: -> Examine Impact:

-> Test Run clampdetCI/BULLSHIT DONE clampdetCI/none TRAINING -> Under-Sampling clampdetCI/tl_wout TODO
clampdetCI_us/tl_wout TODO
clampdet/none TRAINING clampdet_us/none DONE -> Transfer Learning clampdet/tl_wout DONE clampdet/none TRAINING clampdetCI_us0.5/none () TODO
clampdetCI_usAbove/none (a) TODO - need (
) clampdetCI_usBelow/none (b) TODO - need (*) --- if fail: {freezeBest} (clampdetCI_usAbove/fc{6or7}?) TODO? - dep (a,b) (clampdetCI_usBelow/fc{6or7}?) TODO? - dep (a,b) -> Bayesian Cross Entropy clampdetCI{best_from_above}/sbl TODO - dep trans l -> Over-Sampling clampdetCI_os/none TODO -> Test time Threshold clampdetCI{best_from_above}/thresh at target_min



What do I still need to write (from scratch)?

  • Background: -> why neural nets so good? because they generalise so well why do we care about generalising well? because of curse of dimensionality how does neural net generalise so well? with distributed representation ie hierarchical representation ie compositionality of parameters ie exponential compactness

    -> grad descent polynomial approximation

    -> AlexNet in detail: stoch pooling paper not Rob Fergus tutorial! too long

  • Justify independent binary classifiers

  • SBL


ANALYSE: -> on them -> plots -> write up: -> comments -> plot -> table from run_classifier

DONE when final plots & rough comments present TODO otherwise

Transfer Learning: -> Test Run TODO -> Freeze Backprop TODO clampdet/conv1 clampdet/conv2 clampdet/conv3 clampdet/conv4 clampdet/conv5 clampdet/fc6 clampdet/fc7 -> Reinit Weights TODO clampdet/none_reinit clampdet/none -> Parametric vs Non-Parametric TODO clampdet/linSVM clampdet/none ? (best so far)

Class Imbalance: -> Test Run without TL clampdet/tl_wout clampdetCI98/tl_wout -> Transfer Learning clampdetCI98/tl_wout clampdetCI98/none_reinit_bs128_lr4 clampdetCI98/none_bs128_lr4 -> Batch Size clampdetCI98/none_bs128_lr4 clampdetCI98/none_bs256_lr4 -> Learning Rate clampdetCI98/none_bs256_lr4 clampdetCI98/none(_bs256_lr5) -> Examine Impact with run_classifier clampdet/none clampdetCI(97)/none(_bs256_lr5) clampdetCI98/none(_bs256_lr5)

-> SBL talk about choosing appropriate backprop ground_sheet_sbl/logs/pcba/GettingThere has trace

-> Under-Sampling -> Transfer Learning -> Bayesian Cross Entropy -> Over-Sampling -> Test time Threshold clampdetCI/tl_wout - benchmark
clampdetCI{best_from_above}/thresh at target_min

Final Results: -> transfer top notch clampdet instead? clampdet_fine_train_iter_{best} -> transfer top notch clampdet net instead? clampdet/none_best TODO soil_contam/noneC TODO hatch_markings/noneC TODO hatch_markings/none TODO -> pooling loses spatial information soil_contam/poolless TODO


ANALYSE OLD class imbalance

-> Test Run / batchSize TODO clampdetCI/BULLSHIT clampdetCI/none -> Under-Sampling clampdetCI/tl_wout
--- plot_clampdet_none
-> Transfer Learning clampdetCI/tl_wout
--- clampdetCI_us0.5/none
--- : us{Best} clampdetCI/tl_wout - benchmark
--- if fail: {freezeBest} (clampdetCI_usAbove/fc{6or7}?)
-> Bayesian Cross Entropy clampdetCI/tl_wout - benchmark
clampdetCI{best_from_above} clampdetCI{best_from_above}/sbl -> Over-Sampling with clampdet you didnt try no reinit, do so now: clampdetCI/tl_wout - benchmark
clampdetCI{best_from_above} clampdetCI_os/none --- if clampdetCI_os/none better than clampdetCI_us0.5/none: clampdetCI_os/{freezeBest} ADD TO TODO LIST! -> Test time Threshold clampdetCI/tl_wout - benchmark
clampdetCI{best_from_above}/thresh at target_min
