Adaptive Learning Rate got reset when I restart the training? #2866

GiorgioSgl · 2021-04-20T14:55:43Z

❔Question

Hi,

My question is simple, I'm trying to train the YOLOv5 on google colab and they are kill my session in genereal after 5 hours of using the GPU, so I can't do training only in one shot, but I have to restart it all the time using last checkpoint of the weight (last.pt).

As you will know the Learning Rate is change over epochs and I want to ask if it got reset when I restart the training or not?

Thank you in advance,
Giorgio

Additional context

glenn-jocher · 2021-04-20T15:07:31Z

@GiorgioSgl --resume picks up the training seamlessly from where you left of, and continues the previous LR scheduler and warmup also. You can observe your learning rate over epochs in W&B also to verify:

Resuming Training

Resuming an interrupted training run is simple. There are two options:

python train.py --resume  # resume latest training
python train.py --resume path/to/last.pt  # specify resume checkpoint

If you started the training with a multi-GPU command then you must resume it with the same exact configuration (and vice versa). Multi-GPU resume commands are here, assuming you are using 8 GPUs:

python -m torch.distributed.launch --nproc_per_node 8 train.py --resume  # resume latest training
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Note that you may not change any settings when resuming, you must resume with the same exact settings (--epochs, --batch, --data, etc).

suyong2 · 2021-04-21T08:51:42Z

@glenn-jocher
Hi.
When I resume training, I get an error.

"github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
File "train.py", line 509, in
opt = argparse.Namespace(**yaml.load(f, Loader=yaml.SafeLoader)) # replace
File "/usr/local/lib/python3.7/dist-packages/yaml/init.py", line 114, in load
return loader.get_single_data()
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 51, in get_single_data
return self.construct_document(node)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 60, in construct_document
for dummy in generator:
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 413, in construct_yaml_map
value = self.construct_mapping(node)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 218, in construct_mapping
return super().construct_mapping(node, deep=deep)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 143, in construct_mapping
value = self.construct_object(value_node, deep=deep)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 100, in construct_object
data = constructor(self, node)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 429, in construct_undefined
node.start_mark)
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:pathlib.PosixPath'
in "runs/train/exp8/opt.yaml", line 39, column 11"

It is no different when python train.py --resume or python train.py --resume path/to/last.pt .
I want to resume when I stopped the colab or I was finished by colab.

Thanks.

glenn-jocher · 2021-04-21T10:43:15Z

@suyong2 thanks for the message! I think this may be related to a recent change in save_dir handling in train.py. We switched to Path() directories for this variable, which may not be playing well with yaml save and resume, which would be my fault if true.

I'll check it out.

glenn-jocher · 2021-04-21T12:41:59Z

@suyong2 good news 😃! Your original issue has now been fixed ✅ in PR #2876. To receive this update you can:

git pull from within your yolov5/ directory
git clone https://github.com/ultralytics/yolov5 again

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

suyong2 · 2021-04-22T06:48:28Z

@glenn-jocher Fortunately, it works well with the changed Yolov5 source.
I think it's good if a yaml file config could be used too instead of the command-line option.
Thank you for your quick updating.

GiorgioSgl added the question Further information is requested label Apr 20, 2021

glenn-jocher mentioned this issue Apr 21, 2021

Implement yaml.safe_load() #2876

Merged

glenn-jocher linked a pull request Apr 21, 2021 that will close this issue

Implement yaml.safe_load() #2876

Merged

glenn-jocher closed this as completed in #2876 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Learning Rate got reset when I restart the training? #2866

Adaptive Learning Rate got reset when I restart the training? #2866

GiorgioSgl commented Apr 20, 2021

glenn-jocher commented Apr 20, 2021 •

edited

Loading

suyong2 commented Apr 21, 2021 •

edited

Loading

glenn-jocher commented Apr 21, 2021

glenn-jocher commented Apr 21, 2021

suyong2 commented Apr 22, 2021 •

edited

Loading

Adaptive Learning Rate got reset when I restart the training? #2866

Adaptive Learning Rate got reset when I restart the training? #2866

Comments

GiorgioSgl commented Apr 20, 2021

❔Question

Additional context

glenn-jocher commented Apr 20, 2021 • edited Loading

Resuming Training

suyong2 commented Apr 21, 2021 • edited Loading

glenn-jocher commented Apr 21, 2021

glenn-jocher commented Apr 21, 2021

suyong2 commented Apr 22, 2021 • edited Loading

glenn-jocher commented Apr 20, 2021 •

edited

Loading

suyong2 commented Apr 21, 2021 •

edited

Loading

suyong2 commented Apr 22, 2021 •

edited

Loading