Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Learning Rate got reset when I restart the training? #2866

Closed
GiorgioSgl opened this issue Apr 20, 2021 · 5 comments · Fixed by #2876
Closed

Adaptive Learning Rate got reset when I restart the training? #2866

GiorgioSgl opened this issue Apr 20, 2021 · 5 comments · Fixed by #2876
Labels
question Further information is requested

Comments

@GiorgioSgl
Copy link

❔Question

Hi,

My question is simple, I'm trying to train the YOLOv5 on google colab and they are kill my session in genereal after 5 hours of using the GPU, so I can't do training only in one shot, but I have to restart it all the time using last checkpoint of the weight (last.pt).

As you will know the Learning Rate is change over epochs and I want to ask if it got reset when I restart the training or not?

Thank you in advance,
Giorgio

Additional context

@GiorgioSgl GiorgioSgl added the question Further information is requested label Apr 20, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 20, 2021

@GiorgioSgl --resume picks up the training seamlessly from where you left of, and continues the previous LR scheduler and warmup also. You can observe your learning rate over epochs in W&B also to verify:

Screen Shot 2021-04-20 at 5 06 44 PM

Resuming Training

Resuming an interrupted training run is simple. There are two options:

python train.py --resume  # resume latest training
python train.py --resume path/to/last.pt  # specify resume checkpoint

If you started the training with a multi-GPU command then you must resume it with the same exact configuration (and vice versa). Multi-GPU resume commands are here, assuming you are using 8 GPUs:

python -m torch.distributed.launch --nproc_per_node 8 train.py --resume  # resume latest training
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Note that you may not change any settings when resuming, you must resume with the same exact settings (--epochs, --batch, --data, etc).

@suyong2
Copy link

suyong2 commented Apr 21, 2021

@glenn-jocher
Hi.
When I resume training, I get an error.

"github: up to date with https://github.com/ultralytics/yolov5
Traceback (most recent call last):
File "train.py", line 509, in
opt = argparse.Namespace(**yaml.load(f, Loader=yaml.SafeLoader)) # replace
File "/usr/local/lib/python3.7/dist-packages/yaml/init.py", line 114, in load
return loader.get_single_data()
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 51, in get_single_data
return self.construct_document(node)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 60, in construct_document
for dummy in generator:
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 413, in construct_yaml_map
value = self.construct_mapping(node)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 218, in construct_mapping
return super().construct_mapping(node, deep=deep)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 143, in construct_mapping
value = self.construct_object(value_node, deep=deep)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 100, in construct_object
data = constructor(self, node)
File "/usr/local/lib/python3.7/dist-packages/yaml/constructor.py", line 429, in construct_undefined
node.start_mark)
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:pathlib.PosixPath'
in "runs/train/exp8/opt.yaml", line 39, column 11"

It is no different when python train.py --resume or python train.py --resume path/to/last.pt .
I want to resume when I stopped the colab or I was finished by colab.

Thanks.

@glenn-jocher
Copy link
Member

@suyong2 thanks for the message! I think this may be related to a recent change in save_dir handling in train.py. We switched to Path() directories for this variable, which may not be playing well with yaml save and resume, which would be my fault if true.

I'll check it out.

@glenn-jocher
Copy link
Member

@suyong2 good news 😃! Your original issue has now been fixed ✅ in PR #2876. To receive this update you can:

  • git pull from within your yolov5/ directory
  • git clone https://github.com/ultralytics/yolov5 again

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@suyong2
Copy link

suyong2 commented Apr 22, 2021

@glenn-jocher Fortunately, it works well with the changed Yolov5 source.
I think it's good if a yaml file config could be used too instead of the command-line option.
Thank you for your quick updating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants