Custom model training fails, need to downgrade torch (and setuptools) #15

glemoine62 · 2022-11-02T08:23:56Z

Hi,

I am using the deepquestai/deepstack:gpu-2022.01.1 container to do custom training. It comes with torch for cuda 11.3 but train.py fails after initiation (see error below). This is resolved when I downgrade to torch for cuda 11.0 (pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html as per the collab notebook).

docker run --gpus all -it --rm -v /home/eouser/deepstack:/deepstack/code -w /deepstack/code/deepstack-trainer deepquestai/deepstack_updated:gpu python3 train.py --dataset-path /deepstack/code/data
Traceback (most recent call last):
File "train.py", line 530, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 90, in train
model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device) # create
File "/deepstack/code/deepstack-trainer/models/yolo.py", line 96, in init
self._initialize_biases() # only run once
File "/deepstack/code/deepstack-trainer/models/yolo.py", line 151, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

I first need to downgrade setuptools inside the container, btw, because otherwise it throws:

Traceback (most recent call last):
File "train.py", line 21, in
from torch.utils.tensorboard import SummaryWriter
File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/init.py", line 4, in
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'

(resolved with: pip install setuptools==59.5.0)

I am now happily training with the revised setup, so nothing too urgent, but maybe worth checking out.

Thx for this wonderful framework!

Guido

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom model training fails, need to downgrade torch (and setuptools) #15

Custom model training fails, need to downgrade torch (and setuptools) #15

glemoine62 commented Nov 2, 2022 •

edited

Loading

Custom model training fails, need to downgrade torch (and setuptools) #15

Custom model training fails, need to downgrade torch (and setuptools) #15

Comments

glemoine62 commented Nov 2, 2022 • edited Loading

glemoine62 commented Nov 2, 2022 •

edited

Loading