Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release sage 2.0.0 torsion #419

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

jaclark5
Copy link
Collaborator

@jaclark5 jaclark5 commented Dec 18, 2024

New Submission Checklist

  • Created a new folder in the submissions directory containing the dataset
  • Added README.md describing the dataset see here for examples
  • All files used to produce the dataset are included with a description
  • Dataset follows the QCSubmit schema defined for Datasets, OptimizationDatasets and TorsionDriveDatasets
  • Dataset filename matches pattern dataset*.json; may feature a compression extension, such as .bz2
  • A PDF depicting the molecules is attached, in the case of torsiondrives this should include the highlighting of the central bond, this can be done automatically using qcsubmit.
  • QCSubmit validation passed
  • Made a new dataset entry in the mapping table in repository README.md
  • Ready to submit!

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,H ,P ,C ,S ,I ,Cl ,Br ,O ,F
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

Copy link
Contributor

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely LGTM, just a few minor nitpicks!

rec_ids_cmiles = {}
for _, results in Opt.entries.items():
tmp_rec_ids_cmiles = {result.record_id: result.cmiles for result in results}
# TODO: Check if updating dic would change the number of records
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still todo?

Comment on lines 5 to 8
A quantum chemical (QC) dataset curated to train the OpenFF 2.0.0 Sage torsion potentials. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions. This is the complete optimization dataset used for training OpenFF 2.0.0 Sage, consisting of the following datasets:

'OpenFF Gen 2 Torsion Set 1 Roche',
'OpenFF Gen 2 Torsion Set 2 Coverage', 'OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy', 'OpenFF Gen 2 Torsion Set 4 eMolecules - Discrepancy', 'OpenFF Gen 2 Torsion Set 5 Bayer' and 'OpenFF Gen 2 Torsion Set 6 supplemental 2'. The `HydrogenBondFilter(method='baker-hubbard')` filter was applied, and the following record IDs were dropped due to issues with ForceBalance: 6098580, 2703504, 2703505, 18045478. Further information can be found in the curation scripts for the linked repositories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the optimization training set PR you linked the OpenFF Sage repo, as well as the directories of each of the source torsion drive sets. That was quite nice, could you please do that here as well?

{
"dataset_name": "OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0",
"dataset_tagline": "B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage",
"description": "A quantum chemical (QC) dataset curated to train the OpenFF 2.0.0 Sage torsion potentials. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions. This is the complete optimization dataset used for training OpenFF 2.0.0 Sage, consisting of the following datasets: 'OpenFF Gen 2 Torsion Set 1 Roche', 'OpenFF Gen 2 Torsion Set 2 Coverage', 'OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy', 'OpenFF Gen 2 Torsion Set 4 eMolecules - Discrepancy', 'OpenFF Gen 2 Torsion Set 5 Bayer' and 'OpenFF Gen 2 Torsion Set 6 supplemental 2'. The `HydrogenBondFilter(method='baker-hubbard')` filter was applied, and the following record IDs were dropped due to issues with ForceBalance: 6098580, 2703504, 2703505, 18045478. Further information can be found in the curation scripts for the linked repositories.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": "A quantum chemical (QC) dataset curated to train the OpenFF 2.0.0 Sage torsion potentials. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions. This is the complete optimization dataset used for training OpenFF 2.0.0 Sage, consisting of the following datasets: 'OpenFF Gen 2 Torsion Set 1 Roche', 'OpenFF Gen 2 Torsion Set 2 Coverage', 'OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy', 'OpenFF Gen 2 Torsion Set 4 eMolecules - Discrepancy', 'OpenFF Gen 2 Torsion Set 5 Bayer' and 'OpenFF Gen 2 Torsion Set 6 supplemental 2'. The `HydrogenBondFilter(method='baker-hubbard')` filter was applied, and the following record IDs were dropped due to issues with ForceBalance: 6098580, 2703504, 2703505, 18045478. Further information can be found in the curation scripts for the linked repositories.",
"description": "A quantum chemical (QC) dataset curated to train the OpenFF 2.0.0 Sage torsion potentials. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries were used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions. This is the complete TorsionDrive dataset used for training OpenFF 2.0.0 Sage, consisting of the following datasets: 'OpenFF Gen 2 Torsion Set 1 Roche', 'OpenFF Gen 2 Torsion Set 2 Coverage', 'OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy', 'OpenFF Gen 2 Torsion Set 4 eMolecules - Discrepancy', 'OpenFF Gen 2 Torsion Set 5 Bayer' and 'OpenFF Gen 2 Torsion Set 6 supplemental 2'. The `HydrogenBondFilter(method='baker-hubbard')` filter was applied, and the following record IDs were dropped due to issues with ForceBalance: 6098580, 2703504, 2703505, 18045478. Further information can be found in the curation scripts for the linked repositories.",

Some typos and suggestions.

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,P ,H ,I ,C ,Br ,S ,O ,F ,Cl
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,P ,H ,I ,C ,Br ,S ,O ,F ,Cl
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,P ,H ,I ,C ,Br ,S ,O ,F ,Cl
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,P ,H ,I ,C ,Br ,S ,O ,F ,Cl
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

…g-Dataset-v1.0/generate-combined-dataset.py

Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,P ,H ,I ,C ,Br ,S ,O ,F ,Cl
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@jaclark5 jaclark5 requested a review from lilyminium December 19, 2024 19:01
@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
Dataset Type TorsionDriveDataset
Elements N ,P ,H ,I ,C ,Br ,S ,O ,F ,Cl
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-17-OpenFF-Sage-2.0.0-Torsion-Drive-Training-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

3 participants