You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The example config file includes this line:
metrics_key: validation_loss # metrics used for scheduling and saving best model. Options: `set`_`quantity`, set can be either "train" or "validation, "quantity" can be loss or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmse
Following those instructions, I set the value to train_loss. When I do, it fails with the exception
RuntimeError: metrics_key should start with either validation or training
Apparently it actually wants the value to be training_loss instead of train_loss. But when I change it to that, it fails with a different exception:
KeyError: 'training_loss'
It seems that some parts of the code expect one and other parts expect the other, so that neither works.
Environment (please complete the following information):
OS: [e.g. Ubuntu, Windows] Ubuntu 18.04
python version (python --version) 3.9
python environment (commands are given for python interpreter):
nequip version (import nequip; nequip.__version__) 0.5.4
e3nn version (import e3nn; e3nn.__version__) 0.4.4
pytorch version (import torch; torch.__version__) 1.10.0
(if relevant) GPU support with CUDA
cuda Version according to nvcc (nvcc --version)
cuda version according to PyTorch (import torch; torch.version.cuda)
The text was updated successfully, but these errors were encountered:
Thanks for the report--- this is a bug arising from an unsupported combination of options. (Using a training key for metrics_key errors out because the first evaluation of the un-trained model on the validation set before training counts as an "epoch" and is trying to access the non-existant training_loss to determine if it is the best_model so far.)
Set:
report_init_validation: False
in your config to disable this first "epoch".
I've added an error to the code so that this fails more informatively.
This resolves the issue in my repro; let me know if you have further issues.
Describe the bug
The example config file includes this line:
Following those instructions, I set the value to
train_loss
. When I do, it fails with the exceptionApparently it actually wants the value to be
training_loss
instead oftrain_loss
. But when I change it to that, it fails with a different exception:It seems that some parts of the code expect one and other parts expect the other, so that neither works.
Environment (please complete the following information):
python --version
) 3.9import nequip; nequip.__version__
) 0.5.4import e3nn; e3nn.__version__
) 0.4.4import torch; torch.__version__
) 1.10.0nvcc --version
)import torch; torch.version.cuda
)The text was updated successfully, but these errors were encountered: