-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CI pipeline #28
Add CI pipeline #28
Conversation
steinmig
commented
Dec 17, 2024
- Moves existing tests to separate folder
- Convert some existing tutorials into regression tests
- CI pipeline with only default installation and pytest execution for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a few changes worth making before merging this branch. Namely:
- reverting the use of the
_ax
imports (these need to be integrated with other files and removed, so we don't want any new imports from them if we can avoid it) - cleaning up tests so that they're distinct from tutorials (removing excess comments, etc.)
Some of the training can also take quite a long time to run as it exists in the tutorials right now... Do you think there we should take precautions to ensure the CI/CD pipeline doesn't take too long to run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file a modified version of the tutorial for training a potential for molecules with excited states? If so, maybe we can add a file docstring at the top that describes its origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this file and test_training
are regression test I copied from the tutorials. Adding a remark and removing some of the comments sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean ideally one would want to simply test all of the tutorials, so we would also immediately know once they are out of date, but
- I am not too familiar with notebooks in a CI context
- Most of the current tutorials do not work out of the box at the moment -> also worth putting on a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check with some of the other maintainers, but I believe that @xiaochendu and @recisic checked all of the notebook tutorials worked when they integrated updates to the internal version of the repo. Maybe they could weigh in? If it is still a problem, we can open an issue so that someone can address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all tutorial notebooks were functional as of PR #23 (merging repos, commit 24fe0ca), this is also mentioned in the internal repo (https://github.mit.edu/MLMat/NeuralForceField/pull/117).
I’ve already sent this via DM but am sharing it here for tracking purposes.
[✓] 01_training.ipynb
[✓] 02_ase.ipynb
[✓] 03_excited.ipynb
[✓] 04_md.ipynb
[✓] 05_transfer.ipynb
[x] 06_cp3d.ipynb -- GEOM is 50GB to download, didn't try running rest of the script
[✓] 07_dimenet.ipynb
[x] 08_reactive_md.ipynb -- requires appending aRMSD dir to PYTHONPATH
[✓] 09_painn.ipynb
[✓] 11_spooky_net.ipynb
[✓] 12_spooky_painn.ipynb
[x] 13_tully_namd.ipynb -- taking quite a lot of time, maybe I messed up the environment
[x] 14_zn_namd.ipynb -- taking quite a lot of time, maybe I messed up the environment
[✓] 15_nff_md17_dataset.ipynb
[✓] 16_UmbrellaSampling_and_ScansWithHarmonicConstraints.ipynb
[✓] 17_eABF_and_WTMeABF.ipynb
[✓] 18_fine_tuning_foundation_models.ipynb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, then maybe some of the tutorials also didn't work for me due to environment issues. Once we have clearly documented requirements and a clean CI build, such issues might also get easier. What I remember: the requirement hell that I encountered was that either
- our code was incompatible with the new ASE
- our dependencies such as pymatgen bring in a scipy version that clashes with the old ASE
So maybe updating our code to the new ASE and then revisiting the tutorials in integrating them all into the CI is a good way forward?
see thread above
yes makes sense
yes, I reduced the epochs and increased the expected MAE from the tutorials to save some time. The excited states still take ~20 minutes, which is why I have disabled the test for now. The general potential training only takes a few seconds. You can see the CI execution time under |
@steinmig Sorry for just getting back to these changes now. I am not sure that we need to worry about tackling the
|
I just adapted the comments, either torch 2.6 or numpy 2 caused a new issue, but should be resolved pretty soon
The 25 min should only be with the excited states training, which is a too much to my taste for a single test spending 99% of that time just crunching numbers, it should be disabled now and only take ~3 min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work you did on this has been awesome, @steinmig ! Thanks so much for tackling this issue.