🐛 [BUG] Cannot use training loss as metrics key #213

peastman · 2022-05-23T21:13:42Z

Describe the bug
The example config file includes this line:

metrics_key: validation_loss                                                       # metrics used for scheduling and saving best model. Options: `set`_`quantity`, set can be either "train" or "validation, "quantity" can be loss or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmse

Following those instructions, I set the value to train_loss. When I do, it fails with the exception

RuntimeError: metrics_key should start with either validation or training

Apparently it actually wants the value to be training_loss instead of train_loss. But when I change it to that, it fails with a different exception:

KeyError: 'training_loss'

It seems that some parts of the code expect one and other parts expect the other, so that neither works.

Environment (please complete the following information):

OS: [e.g. Ubuntu, Windows] Ubuntu 18.04
python version (python --version) 3.9
python environment (commands are given for python interpreter):
- nequip version (import nequip; nequip.__version__) 0.5.4
- e3nn version (import e3nn; e3nn.__version__) 0.4.4
- pytorch version (import torch; torch.__version__) 1.10.0
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

The text was updated successfully, but these errors were encountered:

Linux-cpp-lisp · 2022-05-24T01:34:25Z

Hi @peastman,

Thanks for the report--- this is a bug arising from an unsupported combination of options. (Using a training key for metrics_key errors out because the first evaluation of the un-trained model on the validation set before training counts as an "epoch" and is trying to access the non-existant training_loss to determine if it is the best_model so far.)

Set:

report_init_validation: False

in your config to disable this first "epoch".

I've added an error to the code so that this fails more informatively.

This resolves the issue in my repro; let me know if you have further issues.

peastman · 2022-05-24T03:47:00Z

Thanks! That works.

peastman added the bug Something isn't working label May 23, 2022

Linux-cpp-lisp assigned Linux-cpp-lisp and unassigned Linux-cpp-lisp May 23, 2022

Linux-cpp-lisp closed this as completed May 24, 2022

Linux-cpp-lisp mentioned this issue Jun 16, 2022

0.5.5 #221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [BUG] Cannot use training loss as metrics key #213

🐛 [BUG] Cannot use training loss as metrics key #213

peastman commented May 23, 2022

Linux-cpp-lisp commented May 24, 2022 •

edited

Loading

peastman commented May 24, 2022

🐛 [BUG] Cannot use training loss as metrics key #213

🐛 [BUG] Cannot use training loss as metrics key #213

Comments

peastman commented May 23, 2022

Linux-cpp-lisp commented May 24, 2022 • edited Loading

peastman commented May 24, 2022

Linux-cpp-lisp commented May 24, 2022 •

edited

Loading