Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guideline for training a model from scratch #389

Closed
dagap opened this issue Jul 24, 2019 · 6 comments
Closed

Guideline for training a model from scratch #389

dagap opened this issue Jul 24, 2019 · 6 comments

Comments

@dagap
Copy link

dagap commented Jul 24, 2019

One thing that I am puzzled by is how to train a single class model from scratch (without using any pretrained weights).

I have curated a dataset in the required format but when I run train.py, the first thing it does is attempts to download the darknet weights etc. I am not sure how I can actually train the model from scratch i.e. initialized with random weights instead of pretrained weights.

@glenn-jocher
Copy link
Member

@dagap the line that does this is 132 here. You can simply comment 131 and 132 to prevent a backbone from loading.

yolov3/train.py

Lines 128 to 133 in 5a34d3c

else: # Initialize model with backbone (optional)
if '-tiny.cfg' in cfg:
cutoff = load_darknet_weights(model, weights + 'yolov3-tiny.conv.15')
else:
cutoff = load_darknet_weights(model, weights + 'darknet53.conv.74')

@dagap
Copy link
Author

dagap commented Jul 24, 2019

@glenn-jocher Thank you such a quick reply. So that is what I tried, just now but then I get this error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f28e64de441 in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f28e64ddd7a in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f28e7004abc in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c753d (0x7f28e6ffa53d in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130fac (0x7f28e6a63fac in /home/pd/anaconda3/envs/t1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #30: __libc_start_main + 0xf0 (0x7f28eae6f830 in /lib/x86_64-linux-gnu/libc.so.6)

After 1 epoch. I am not sure if these issues are related. I am assuming not but was not sure.

So, I have the following data file:

classes=1
train=./data/custom/train.txt
valid=./data/custom/valid.txt
names=./data/custom/classes.names
backup=backup
eval=coco

I am not sure what to do about the last 2 parameters and if they are responsible for this error. I also use the cfg/yolov3-1cls.cfg as configuration.

My command line is:

python train.py --data data/custom/mine.data --cfg cfg/yolov3-1cls.cfg

@dagap
Copy link
Author

dagap commented Jul 24, 2019

Ok, this got fixed by making the validation and training sets to be of even size, which is very strange.

@glenn-jocher
Copy link
Member

This is odd indeed. The coco.data training has dataset sizes of 117263 and 5000, which are even and odd, and both work.

@dagap
Copy link
Author

dagap commented Jul 24, 2019

Yeah, I thought so as well. I will have a bit of a look into this and report if I find anything.

@glenn-jocher
Copy link
Member

@dagap ok sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants