Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ [QUESTION] Restart run #343

Closed
IZugec opened this issue Jun 3, 2023 · 1 comment
Closed

❓ [QUESTION] Restart run #343

IZugec opened this issue Jun 3, 2023 · 1 comment
Labels
question Further information is requested

Comments

@IZugec
Copy link

IZugec commented Jun 3, 2023

Hello,

I have a situation in which I have really huge dataset so much so that even with multiprocessing it still takes day and a half/two days to preprocess it. Now, it happened that due to the unexpected crash on the node I would like to continue training starting from the best_model.pth weights. However I would really like to avoid processing this huge dataset again.

I tried both initial_model_state / initialize_from_state and load_model_state / load_model_state

however, when I started training initial model the key for append was false so now when I try to put it to false the error is

Traceback (most recent call last):
File "/home/user/.conda/envs/nequip_stress/bin/nequip-train", line 8, in
sys.exit(main())
File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 65, in main
raise RuntimeError(
RuntimeError: Training instance exists at /path_to_traning_dir; either set append to True or use a different root or runname

However when I start it with append equal to true I get following error

Traceback (most recent call last):
File "/home/user/.conda/envs/nequip_stress/bin/nequip-train", line 8, in
sys.exit(main())
File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 74, in main
trainer = restart(config)
File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 220, in restart
raise ValueError(
ValueError: Key "append" is different in config and the result trainer.pth file. Please double check

I guess the question is if there is a way to pass already processed dataset along with model state?

Thanks in advance on any advice,
Ivan

@IZugec IZugec added the question Further information is requested label Jun 3, 2023
@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Jun 5, 2023

Hi @IZugec ,

I tried both initial_model_state / initialize_from_state and load_model_state / load_model_state

This will be the easiest way forward, and will load the cached processed dataset unless something goes wrong. I think there should be a full discussion of how to do this here--- you want initialize_from_state and a new run name:

#235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants