-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release sage 2.0.0 #418
Release sage 2.0.0 #418
Changes from 3 commits
9263334
b7bdc14
e1bfb89
ba25628
b74805b
8b2dfb0
09f9af4
9e646d0
346194f
2b3de02
39351b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,50 @@ | ||||||
# OpenFF Sage 2.0.0 Training Optimization v1.0 | ||||||
|
||||||
### Description | ||||||
|
||||||
A quantum chemical (QC) dataset curated to train [OpenFF 2.0.0 Sage](https://github.com/openforcefield/openff-sage) forcefield, with reparametrized Lennard-Jones (LJ) and valence parameters, the latter relevent to this dataset. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used in conjunction with the QC dataset used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions. | ||||||
|
||||||
### General Information | ||||||
|
||||||
- Date: 2024 12 12 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Class: OpenFF Optimization Dataset | ||||||
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to above, I'd focus on this exact dataset (publication version of existing datasets), not on the high-level purpose. Something like |
||||||
- Collection: OptimizationDataset | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: I'm not familiar with this entry, I thought this was covered by the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It came with the template README.md file when I followed the User Quickstart instructions. Is this not needed now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes probably not needed anymore and was for the previous QCStack API, or at least should be swapped for something like Background |
||||||
- Name: OpenFF Sage 2.0.0 Training Optimization v1.0 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: It sounds kind of redundant in the context of the repo, but we often include There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with Brent, or at least having this match the name of the directory (which does include -Dataset) would be preferable! |
||||||
- Number of unique molecules 1025 | ||||||
- Number of filtered molecules 0 | ||||||
- Number of conformers 3663 | ||||||
- Number of conformers min mean max 1.00, 3.53, 10.00 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please add colons to each of these? Markdown doesn't render horizontal whitespace neatly if you preview the file. |
||||||
- Mean molecular weight: 261.38 | ||||||
- Max molecular weight: 544.64 | ||||||
- Set of charges: -2.0 -1.0 0.0 1.0 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Just a bit easier to read! |
||||||
- Dataset Submitter: Jennifer A. Clark | ||||||
- Dataset Curator: Simon Boothroyd | ||||||
- Dataset Generator: Hyesu Jang | ||||||
|
||||||
### QCSubmit generation pipeline | ||||||
|
||||||
- `generate-combined-dataset.ipynb`: A notebook which shows how the dataset was prepared from the input files. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I'd also add |
||||||
|
||||||
### QCSubmit Manifest | ||||||
|
||||||
- `generate-combined-dataset.ipynb` | ||||||
- `dataset.json.bz2`: The basic dataset ready for submission. | ||||||
- `dataset.pdf`: A pdf file containing molecule 2D structures. | ||||||
- `dataset.smi`: SMILES for every molecule in the submission. | ||||||
|
||||||
### Metadata | ||||||
|
||||||
* Elements: {O, Br, H, F, P, C, Cl, N, I, S} | ||||||
* QC Specifications: default | ||||||
* basis: DZVP | ||||||
* implicit_solvent: None | ||||||
* keywords: {} | ||||||
* maxiter: 200 | ||||||
* method: B3LYP-D3BJ | ||||||
* program: psi4 | ||||||
* SCF Properties: | ||||||
* dipole | ||||||
* quadrupole | ||||||
* wiberg_lowdin_indices | ||||||
* mayer_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty thorough description! Personally, I'd probably leave most of these details in the original datasets and focus on the fact that this is a "publication version" of the already-used Sage 2.0.0 training data. Something more like "This is the complete optimization dataset used for training Sage 2.0.0, consisting of the x, y, z datasets, which were further filtered to remove ...", where
x
,y
,z
are the names (or links!) to the original datasets, and...
is a summary or even a list of the exactFilter
s applied in the Sage 2.0.0 curation script.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly agree with Brent, there's not a huge downside to keeping this information in but I'd prioritize listing the names of the "origin" datasets in case people want to go hunting for further provenance information. As Brent mentioned I would also describe the filters used to curate the dataset, ideally in words but noting that further information can be found in the repo scripts.