Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release sage 2.0.0 #418

Merged
merged 11 commits into from
Dec 18, 2024
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# OpenFF Sage 2.0.0 Training Optimization v1.0

### Description

A quantum chemical (QC) dataset curated to train [OpenFF 2.0.0 Sage](https://github.com/openforcefield/openff-sage) forcefield, with reparametrized Lennard-Jones (LJ) and valence parameters, the latter relevent to this dataset. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used in conjunction with the QC dataset used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty thorough description! Personally, I'd probably leave most of these details in the original datasets and focus on the fact that this is a "publication version" of the already-used Sage 2.0.0 training data. Something more like "This is the complete optimization dataset used for training Sage 2.0.0, consisting of the x, y, z datasets, which were further filtered to remove ...", where x, y, z are the names (or links!) to the original datasets, and ... is a summary or even a list of the exact Filters applied in the Sage 2.0.0 curation script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly agree with Brent, there's not a huge downside to keeping this information in but I'd prioritize listing the names of the "origin" datasets in case people want to go hunting for further provenance information. As Brent mentioned I would also describe the filters used to curate the dataset, ideally in words but noting that further information can be found in the repo scripts.


### General Information

- Date: 2024 12 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Date: 2024 12 12
- Date: 2024-12-12

- Class: OpenFF Optimization Dataset
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I'd focus on this exact dataset (publication version of existing datasets), not on the high-level purpose. Something like Sage 2.0.0 training data, optionally throwing in Complete set or something to emphasize that this is the totality of the optimization training data.

- Collection: OptimizationDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not familiar with this entry, I thought this was covered by the Class above. Just marking as a "nit" since I don't think it hurts to have it either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came with the template README.md file when I followed the User Quickstart instructions. Is this not needed now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably not needed anymore and was for the previous QCStack API, or at least should be swapped for something like dataset type: "optimization".

Background
I would guess this redundancy is from the differentiation between OpenFF-QCSubmit and MolSSI's QCPortal -- the OpenFF Optimization Dataset probably corresponds to a QCSubmit class whereas Collection: OptimizationDataset was likely intended for functions like FractalClient.list_collections("OptimizationDataset"). (as a side-note, those docs are for a legacy version -- up-to-date docs are now at the molssi github subdomain https://molssi.github.io/QCFractal/user_guide/datasets.html). I think that functionality has now been removed in the 0.50+ version of the now-named PortalClient -- all I see is PortalClient.list_datasets now. To be fully user-friendly, PortalClient.get_dataset requires both the dataset type and name, so swapping it to that information would be more useful.

- Name: OpenFF Sage 2.0.0 Training Optimization v1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It sounds kind of redundant in the context of the repo, but we often include Dataset in the dataset names. From checking the README, it looks like that was less common in earlier datasets, so this is probably fine, just read a little weird to me now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Brent, or at least having this match the name of the directory (which does include -Dataset) would be preferable!

- Number of unique molecules 1025
- Number of filtered molecules 0
- Number of conformers 3663
- Number of conformers min mean max 1.00, 3.53, 10.00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add colons to each of these? Markdown doesn't render horizontal whitespace neatly if you preview the file.

- Mean molecular weight: 261.38
- Max molecular weight: 544.64
- Set of charges: -2.0 -1.0 0.0 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Set of charges: -2.0 -1.0 0.0 1.0
- Set of charges: -2.0, -1.0, 0.0, 1.0

Just a bit easier to read!

- Dataset Submitter: Jennifer A. Clark
- Dataset Curator: Simon Boothroyd
- Dataset Generator: Hyesu Jang

### QCSubmit generation pipeline

- `generate-combined-dataset.ipynb`: A notebook which shows how the dataset was prepared from the input files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `generate-combined-dataset.ipynb`: A notebook which shows how the dataset was prepared from the input files.
- `generate-combined-dataset.py`: A script which shows how the dataset was prepared from the input files.

I'd also add output.txt since that's the output of this script.


### QCSubmit Manifest

- `generate-combined-dataset.ipynb`
- `dataset.json.bz2`: The basic dataset ready for submission.
- `dataset.pdf`: A pdf file containing molecule 2D structures.
- `dataset.smi`: SMILES for every molecule in the submission.

### Metadata

* Elements: {O, Br, H, F, P, C, Cl, N, I, S}
* QC Specifications: default
* basis: DZVP
* implicit_solvent: None
* keywords: {}
* maxiter: 200
* method: B3LYP-D3BJ
* program: psi4
* SCF Properties:
* dipole
* quadrupole
* wiberg_lowdin_indices
* mayer_indices
Git LFS file not shown
Binary file not shown.
Loading
Loading