Guideline for training a model from scratch #389

dagap · 2019-07-24T11:11:42Z

One thing that I am puzzled by is how to train a single class model from scratch (without using any pretrained weights).

I have curated a dataset in the required format but when I run train.py, the first thing it does is attempts to download the darknet weights etc. I am not sure how I can actually train the model from scratch i.e. initialized with random weights instead of pretrained weights.

glenn-jocher · 2019-07-24T11:31:10Z

@dagap the line that does this is 132 here. You can simply comment 131 and 132 to prevent a backbone from loading.

yolov3/train.py

Lines 128 to 133 in 5a34d3c

    
           else:  # Initialize model with backbone (optional) 
        
               if '-tiny.cfg' in cfg: 
        
                   cutoff = load_darknet_weights(model, weights + 'yolov3-tiny.conv.15') 
        
               else: 
        
                   cutoff = load_darknet_weights(model, weights + 'darknet53.conv.74')

dagap · 2019-07-24T11:36:16Z

@glenn-jocher Thank you such a quick reply. So that is what I tried, just now but then I get this error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f28e64de441 in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f28e64ddd7a in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f28e7004abc in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c753d (0x7f28e6ffa53d in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130fac (0x7f28e6a63fac in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #30: __libc_start_main + 0xf0 (0x7f28eae6f830 in /lib/x86_64-linux-gnu/libc.so.6)

After 1 epoch. I am not sure if these issues are related. I am assuming not but was not sure.

So, I have the following data file:

classes=1
train=./data/custom/train.txt
valid=./data/custom/valid.txt
names=./data/custom/classes.names
backup=backup
eval=coco

I am not sure what to do about the last 2 parameters and if they are responsible for this error. I also use the cfg/yolov3-1cls.cfg as configuration.

My command line is:

python train.py --data data/custom/mine.data --cfg cfg/yolov3-1cls.cfg

dagap · 2019-07-24T13:38:59Z

Ok, this got fixed by making the validation and training sets to be of even size, which is very strange.

glenn-jocher · 2019-07-24T15:02:52Z

This is odd indeed. The coco.data training has dataset sizes of 117263 and 5000, which are even and odd, and both work.

dagap · 2019-07-24T15:15:29Z

Yeah, I thought so as well. I will have a bit of a look into this and report if I find anything.

glenn-jocher · 2019-07-24T15:41:28Z

@dagap ok sounds good!

glenn-jocher closed this as completed Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guideline for training a model from scratch #389

Guideline for training a model from scratch #389

dagap commented Jul 24, 2019

glenn-jocher commented Jul 24, 2019

dagap commented Jul 24, 2019 •

edited

Loading

dagap commented Jul 24, 2019

glenn-jocher commented Jul 24, 2019

dagap commented Jul 24, 2019

glenn-jocher commented Jul 24, 2019

Guideline for training a model from scratch #389

Guideline for training a model from scratch #389

Comments

dagap commented Jul 24, 2019

glenn-jocher commented Jul 24, 2019

dagap commented Jul 24, 2019 • edited Loading

dagap commented Jul 24, 2019

glenn-jocher commented Jul 24, 2019

dagap commented Jul 24, 2019

glenn-jocher commented Jul 24, 2019

dagap commented Jul 24, 2019 •

edited

Loading