Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI pipeline #28

Merged
merged 11 commits into from
Jan 29, 2025
Merged

Conversation

steinmig
Copy link
Contributor

  • Moves existing tests to separate folder
  • Convert some existing tutorials into regression tests
  • CI pipeline with only default installation and pytest execution for now

Copy link
Contributor

@ajhoffman1229 ajhoffman1229 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a few changes worth making before merging this branch. Namely:

  1. reverting the use of the _ax imports (these need to be integrated with other files and removed, so we don't want any new imports from them if we can avoid it)
  2. cleaning up tests so that they're distinct from tutorials (removing excess comments, etc.)

Some of the training can also take quite a long time to run as it exists in the tutorials right now... Do you think there we should take precautions to ensure the CI/CD pipeline doesn't take too long to run?

nff/tests/dynamics_test.py Show resolved Hide resolved
nff/tests/test_ase.py Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file a modified version of the tutorial for training a potential for molecules with excited states? If so, maybe we can add a file docstring at the top that describes its origin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this file and test_training are regression test I copied from the tutorials. Adding a remark and removing some of the comments sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean ideally one would want to simply test all of the tutorials, so we would also immediately know once they are out of date, but

  1. I am not too familiar with notebooks in a CI context
  2. Most of the current tutorials do not work out of the box at the moment -> also worth putting on a list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check with some of the other maintainers, but I believe that @xiaochendu and @recisic checked all of the notebook tutorials worked when they integrated updates to the internal version of the repo. Maybe they could weigh in? If it is still a problem, we can open an issue so that someone can address it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all tutorial notebooks were functional as of PR #23 (merging repos, commit 24fe0ca), this is also mentioned in the internal repo (https://github.mit.edu/MLMat/NeuralForceField/pull/117).
I’ve already sent this via DM but am sharing it here for tracking purposes.

[✓] 01_training.ipynb
[✓] 02_ase.ipynb
[✓] 03_excited.ipynb
[✓] 04_md.ipynb
[✓] 05_transfer.ipynb
[x] 06_cp3d.ipynb -- GEOM is 50GB to download, didn't try running rest of the script
[✓] 07_dimenet.ipynb
[x] 08_reactive_md.ipynb -- requires appending aRMSD dir to PYTHONPATH
[✓] 09_painn.ipynb
[✓] 11_spooky_net.ipynb
[✓] 12_spooky_painn.ipynb
[x] 13_tully_namd.ipynb -- taking quite a lot of time, maybe I messed up the environment
[x] 14_zn_namd.ipynb -- taking quite a lot of time, maybe I messed up the environment
[✓] 15_nff_md17_dataset.ipynb
[✓] 16_UmbrellaSampling_and_ScansWithHarmonicConstraints.ipynb
[✓] 17_eABF_and_WTMeABF.ipynb
[✓] 18_fine_tuning_foundation_models.ipynb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, then maybe some of the tutorials also didn't work for me due to environment issues. Once we have clearly documented requirements and a clean CI build, such issues might also get easier. What I remember: the requirement hell that I encountered was that either

  1. our code was incompatible with the new ASE
  2. our dependencies such as pymatgen bring in a scipy version that clashes with the old ASE

So maybe updating our code to the new ASE and then revisiting the tutorials in integrating them all into the CI is a good way forward?

nff/tests/test_training.py Outdated Show resolved Hide resolved
@steinmig
Copy link
Contributor Author

1. reverting the use of the `_ax` imports (these need to be integrated with other files and removed, so we don't want any new imports from them if we can avoid it)

see thread above

2. cleaning up tests so that they're distinct from tutorials (removing excess comments, etc.)

yes makes sense

Some of the training can also take quite a long time to run as it exists in the tutorials right now... Do you think there we should take precautions to ensure the CI/CD pipeline doesn't take too long to run?

yes, I reduced the epochs and increased the expected MAE from the tutorials to save some time. The excited states still take ~20 minutes, which is why I have disabled the test for now. The general potential training only takes a few seconds. You can see the CI execution time under Actions

@ajhoffman1229
Copy link
Contributor

@steinmig Sorry for just getting back to these changes now. I am not sure that we need to worry about tackling the _ax files now with this PR for the CI/CD pipeline—that can probably wait. Did you have a chance to remove some of the comments on the tests? If so, we can try merging and firing up the CI/CD right away.

You can see the CI execution time under Actions
It looks like the execution time is ~25 min... That seems a little long but not totally unreasonable right now, but I'm not super familiar with typical CI/CD execution times for code repos of this size. I remember ASE's taking about twice as long, but I don't know if they have 2x as much code.

@steinmig
Copy link
Contributor Author

@steinmig Sorry for just getting back to these changes now. I am not sure that we need to worry about tackling the _ax files now with this PR for the CI/CD pipeline—that can probably wait. Did you have a chance to remove some of the comments on the tests? If so, we can try merging and firing up the CI/CD right away.

I just adapted the comments, either torch 2.6 or numpy 2 caused a new issue, but should be resolved pretty soon

It looks like the execution time is ~25 min... That seems a little long but not totally unreasonable right now, but I'm not super familiar with typical CI/CD execution times for code repos of this size. I remember ASE's taking about twice as long, but I don't know if they have 2x as much code.

The 25 min should only be with the excited states training, which is a too much to my taste for a single test spending 99% of that time just crunching numbers, it should be disabled now and only take ~3 min

Copy link
Contributor

@ajhoffman1229 ajhoffman1229 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work you did on this has been awesome, @steinmig ! Thanks so much for tackling this issue.

@ajhoffman1229 ajhoffman1229 merged commit f309fb6 into learningmatter-mit:master Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants