Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release sage 2.0.0 #418

Merged
merged 11 commits into from
Dec 18, 2024
Merged

Release sage 2.0.0 #418

merged 11 commits into from
Dec 18, 2024

Conversation

jaclark5
Copy link
Collaborator

@jaclark5 jaclark5 commented Dec 13, 2024

New Submission Checklist

  • Created a new folder in the submissions directory containing the dataset
  • Added README.md describing the dataset see here for examples
  • All files used to produce the dataset are included with a description
  • Dataset follows the QCSubmit schema defined for Datasets, OptimizationDatasets and TorsionDriveDatasets
  • [NA] Dataset filename matches pattern dataset*.json; may feature a compression extension, such as .bz2
  • A PDF depicting the molecules is attached, in the case of torsiondrives this should include the highlighting of the central bond, this can be done automatically using qcsubmit.
  • QCSubmit validation passed
  • Made a new dataset entry in the mapping table in repository README.md
  • Ready to submit!

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization v1.0
Dataset Type OptimizationDataset
Elements O ,Br ,H ,F ,P ,C ,Cl ,N ,I ,S
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

Copy link
Collaborator

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall! I think all of my comments are just subjectively what I think we "usually" do, and not necessarily requirements. I know you said on Slack this isn't ready for approval, but just to make a note, there are some missing files too:

  • The input conda environment
  • The final, fully-resolved conda environment (from conda env export)
  • The output of the Python script

This last one is bordering on a "nice to have" rather than a strict requirement. I've definitely made some submissions without it, but it helps to fill the gaps between the Jupyter notebook submissions and a plain script (which I prefer for reviewing). I think *.log files are ignored in this repo, so make sure you save it as output.txt or something instead.

This one is covered by the checklist, but you'll also want to update the table in the main README.

Again, it looks good overall, so I'd usually drop a preemptive Approve here, but given the circumstances, I think I'll leave explicit approval to Lily.

Btw, nice job with the git-lfs stuff. I definitely messed that up on my first dataset.


### Description

A quantum chemical (QC) dataset curated to train [OpenFF 2.0.0 Sage](https://github.com/openforcefield/openff-sage) forcefield, with reparametrized Lennard-Jones (LJ) and valence parameters, the latter relevent to this dataset. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used in conjunction with the QC dataset used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty thorough description! Personally, I'd probably leave most of these details in the original datasets and focus on the fact that this is a "publication version" of the already-used Sage 2.0.0 training data. Something more like "This is the complete optimization dataset used for training Sage 2.0.0, consisting of the x, y, z datasets, which were further filtered to remove ...", where x, y, z are the names (or links!) to the original datasets, and ... is a summary or even a list of the exact Filters applied in the Sage 2.0.0 curation script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly agree with Brent, there's not a huge downside to keeping this information in but I'd prioritize listing the names of the "origin" datasets in case people want to go hunting for further provenance information. As Brent mentioned I would also describe the filters used to curate the dataset, ideally in words but noting that further information can be found in the repo scripts.


- Date: 2024 12 12
- Class: OpenFF Optimization Dataset
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I'd focus on this exact dataset (publication version of existing datasets), not on the high-level purpose. Something like Sage 2.0.0 training data, optionally throwing in Complete set or something to emphasize that this is the totality of the optimization training data.

- Date: 2024 12 12
- Class: OpenFF Optimization Dataset
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage
- Collection: OptimizationDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not familiar with this entry, I thought this was covered by the Class above. Just marking as a "nit" since I don't think it hurts to have it either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came with the template README.md file when I followed the User Quickstart instructions. Is this not needed now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably not needed anymore and was for the previous QCStack API, or at least should be swapped for something like dataset type: "optimization".

Background
I would guess this redundancy is from the differentiation between OpenFF-QCSubmit and MolSSI's QCPortal -- the OpenFF Optimization Dataset probably corresponds to a QCSubmit class whereas Collection: OptimizationDataset was likely intended for functions like FractalClient.list_collections("OptimizationDataset"). (as a side-note, those docs are for a legacy version -- up-to-date docs are now at the molssi github subdomain https://molssi.github.io/QCFractal/user_guide/datasets.html). I think that functionality has now been removed in the 0.50+ version of the now-named PortalClient -- all I see is PortalClient.list_datasets now. To be fully user-friendly, PortalClient.get_dataset requires both the dataset type and name, so swapping it to that information would be more useful.

- Class: OpenFF Optimization Dataset
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage
- Collection: OptimizationDataset
- Name: OpenFF Sage 2.0.0 Training Optimization v1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It sounds kind of redundant in the context of the repo, but we often include Dataset in the dataset names. From checking the README, it looks like that was less common in earlier datasets, so this is probably fine, just read a little weird to me now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Brent, or at least having this match the name of the directory (which does include -Dataset) would be preferable!

Comment on lines 67 to 68
molecule=None,
initial_molecules=[rec.initial_molecule for rec, _ in records],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review) Just for your future reference, this is the key difference forcing the use of the existing (qcelemental) molecules. The default usage here is to provide an openff.toolkit.Molecule to the molecule kwarg, which does the round-trip through the toolkit Lexie mentioned in her comments. Providing molecule=None and initial_molecules (which are qcelemental.models.Molecules) instead bypasses the round trip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brent, helpful to know!

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization v1.0
Dataset Type OptimizationDataset
Elements P ,C ,H ,F ,Cl ,Br ,I ,N ,S ,O
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@jaclark5 jaclark5 requested a review from ntBre December 13, 2024 17:48
"/Users/jennifer.clark/bin/openff-sage/data-set-curation/quantum-chemical/data-sets/1-2-0-opt-set-v3.json",
# "/Users/jennifer.clark/bin/openff-sage/data-set-curation/quantum-chemical/data-sets/1-2-0-td-set.json",
)
file = requests.get("https://raw.githubusercontent.com/openforcefield/openff-sage/refs/heads/main/data-set-curation/quantum-chemical/data-sets/1-2-0-opt-set-v3.json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to super, super pedantic, you can use the GitHub permalink: https://raw.githubusercontent.com/openforcefield/openff-sage/37a36e7eeaf6cdca795847089a288bdff168c08a/data-set-curation/quantum-chemical/data-sets/1-2-0-opt-set-v3.json. It's kinda annoying to get this (click the ... in the upper right, copy permalink, then click raw for the raw file at the permalink), but it should always point to this exact file version instead of whatever main happens to be. This is especially unlikely to matter here because this repo is basically an archive anyway, though.

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization v1.0
Dataset Type OptimizationDataset
Elements C ,O ,P ,Cl ,F ,Br ,N ,H ,I ,S
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@jaclark5 jaclark5 requested a review from lilyminium December 13, 2024 20:17
Copy link
Contributor

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks great -- thanks @jaclark5 for putting this together and @ntBre for the thorough review! Mine largely echoes Brent's comments, but I'm also not sure that the CMILES being used to index molecules between datasets will remain the same between software versions; it would be more robust to just use the CMILES in the original dataset, as commented.

Outside the scope of this PR but just a thought -- we should also consider having a separate table for force field datasets in the top-level README, otherwise it's a slight hassle to find.


### Description

A quantum chemical (QC) dataset curated to train [OpenFF 2.0.0 Sage](https://github.com/openforcefield/openff-sage) forcefield, with reparametrized Lennard-Jones (LJ) and valence parameters, the latter relevent to this dataset. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used in conjunction with the QC dataset used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly agree with Brent, there's not a huge downside to keeping this information in but I'd prioritize listing the names of the "origin" datasets in case people want to go hunting for further provenance information. As Brent mentioned I would also describe the filters used to curate the dataset, ideally in words but noting that further information can be found in the repo scripts.


### General Information

- Date: 2024 12 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Date: 2024 12 12
- Date: 2024-12-12

- Class: OpenFF Optimization Dataset
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage
- Collection: OptimizationDataset
- Name: OpenFF Sage 2.0.0 Training Optimization v1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Brent, or at least having this match the name of the directory (which does include -Dataset) would be preferable!

- Date: 2024 12 12
- Class: OpenFF Optimization Dataset
- Purpose: B3LYP-D3BJ/DZVP conformers applicable to drug-like molecules for OpenFF 2.0.0 Sage
- Collection: OptimizationDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably not needed anymore and was for the previous QCStack API, or at least should be swapped for something like dataset type: "optimization".

Background
I would guess this redundancy is from the differentiation between OpenFF-QCSubmit and MolSSI's QCPortal -- the OpenFF Optimization Dataset probably corresponds to a QCSubmit class whereas Collection: OptimizationDataset was likely intended for functions like FractalClient.list_collections("OptimizationDataset"). (as a side-note, those docs are for a legacy version -- up-to-date docs are now at the molssi github subdomain https://molssi.github.io/QCFractal/user_guide/datasets.html). I think that functionality has now been removed in the 0.50+ version of the now-named PortalClient -- all I see is PortalClient.list_datasets now. To be fully user-friendly, PortalClient.get_dataset requires both the dataset type and name, so swapping it to that information would be more useful.

- Number of conformers min mean max 1.00, 3.53, 10.00
- Mean molecular weight: 261.38
- Max molecular weight: 544.64
- Set of charges: -2.0 -1.0 0.0 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Set of charges: -2.0 -1.0 0.0 1.0
- Set of charges: -2.0, -1.0, 0.0, 1.0

Just a bit easier to read!


### QCSubmit generation pipeline

- `generate-combined-dataset.ipynb`: A notebook which shows how the dataset was prepared from the input files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `generate-combined-dataset.ipynb`: A notebook which shows how the dataset was prepared from the input files.
- `generate-combined-dataset.py`: A script which shows how the dataset was prepared from the input files.

I'd also add output.txt since that's the output of this script.

provenance1 = dataset_factory1.provenance(ToolkitRegistry([RDKitToolkitWrapper]))

dataset1 = OptimizationDataset(
dataset_name="OpenFF Sage 2.0.0 Training Optimization v1.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above -- could the name of this dataset please match the directory name?

initial_mols = [rec[0].initial_molecule for rec in rec_and_mol]

dataset_factory1 = OptimizationDatasetFactory()
provenance1 = dataset_factory1.provenance(ToolkitRegistry([RDKitToolkitWrapper]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at your environment, it looks like OpenEye is installed. Do you have a license? If so, it takes precedence over RDKit so probably best to add that here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following up on this, it's less important if CMILES isn't being generated, but if RDKit is there OpenEye should be considered too.

# Have to add records this way to avoid a round trip through the toolkit.
records_by_cmiles= {}
for record, molecule in rec_and_mol:
cmiles = molecule.to_smiles(isomeric=True, explicit_hydrogens=True, mapped=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit nitpicky, but since this might establish a protocol for future tasks -- there's no guarantee that running a to_smiles will give you the same cmiles for the same molecule across different OpenEye/RDKit versions, or especially if you have one toolkit installed but another was used to generate the source datasets. It would be slightly more robust to use the exact cmiles in the dataset result. The easiest way to do this is probably by inspecting filtered_and_combined.entries, which is a dictionary where the key is the portal address and the value is a list of OptimizationResults. Each OptimizationResult has an id and cmiles attribute that should correspond to the record.id of each record in rec_and_mol. Otherwise there may be other ways similar to what QCSubmit does when it downloads datasets (https://github.com/openforcefield/openff-qcsubmit/blob/d4e6b6986a58f5cf0184ba14dc4f7419e9978b67/openff/qcsubmit/results/results.py#L307-L353), but I haven't looked into those.

I'm not sure this actually matters very much; at the end of the day an OptimizationResultCollection or OptimizationDataset is a flat list of single-conformer results, and functionality like our ConformerRMSDFilter checks whether each record is a conformer of an existing molecule on-the-fly. However, it would be more ideal to keep the cmiles consistent between datasets just in case it does in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for this! I did notice that the CMILES changes depending on whether I use RDKit or OpenEye.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch Lily! Just in Jen's defense, I believe this code came straight from Lexie's NAGL2 combined dataset script. So I've also reviewed and approved it twice now, missing this subtlety. Also this is how OptimizationResultCollection.create_basic_dataset is implemented (except there explicit_hydrogens=False, mapped=False are passed), so Lexie probably got the code from there anyway.

This section of qcsubmit contains all of the locations you can check for a CMILES in an OptimizationResultCollection: https://github.com/openforcefield/openff-qcsubmit/blob/d4e6b6986a58f5cf0184ba14dc4f7419e9978b67/openff/qcsubmit/results/results.py#L483-L496. You can also be a bit clever with bool chaining:

 def get_cmiles(entry):
     mol = entry.initial_molecule
     return (
         mol.identifiers.canonical_isomeric_explicit_hydrogen_mapped_smiles
         or mol.extras.get("canonical_isomeric_explicit_hydrogen_mapped_smiles")
         or entry.attributes.get(
             "canonical_isomeric_explicit_hydrogen_mapped_smiles"
         )
     )

But entry here has to be fetched from qcportal, so I agree that pulling the CMILES from the OptimizationResultCollection itself would be easier.

As a final observation, I think this line is not actually the problematic use of to_smiles in the script because this cmiles is only used to key a dict. to_smiles is called again on line 65, where it's actually stored in the dataset and would cause the kind of problem Lily is pointing out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catches Brent, yes line 65 is definitely where it matters!

You're more familiar with the code here than me -- if you think this will be problematic for create_basic_dataset, could you please raise an issue in qcsubmit?

@jaclark5 please ping me when you'd like a re-review -- I have notifications turned off for PR pushes so won't see any changes made.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just opened openforcefield/openff-qcsubmit#310 referencing this.

Comment on lines 67 to 68
molecule=None,
initial_molecules=[rec.initial_molecule for rec, _ in records],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brent, helpful to know!

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization v1.0
Dataset Type OptimizationDataset
Elements C ,O ,P ,Cl ,F ,Br ,N ,H ,I ,S
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type OptimizationDataset
Elements H ,C ,Cl ,P ,F ,Br ,S ,N ,I ,O
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

Copy link
Contributor

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @jaclark5, just one final question on OpenEye!

I'll also add a tracking label to this -- in QCA-DS a lot of action happens in the CI, which monitors pull requests with particular labels to trigger workflows. The tracking label means a PR gets tracked on our project board, which in turn has associated CI workflows to manage dataset submission and error cycling. In general we try to add the tracking label and any compute- labels before a PR gets merged, although we don't expect this one to require any additional computation.

initial_mols = [rec[0].initial_molecule for rec in rec_and_mol]

dataset_factory1 = OptimizationDatasetFactory()
provenance1 = dataset_factory1.provenance(ToolkitRegistry([RDKitToolkitWrapper]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following up on this, it's less important if CMILES isn't being generated, but if RDKit is there OpenEye should be considered too.

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type OptimizationDataset
Elements S ,Br ,O ,P ,H ,C ,N ,I ,Cl ,F
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type OptimizationDataset
Elements S ,Br ,O ,F ,C ,N ,Cl ,P ,I ,H
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3


def pull_record_id_cmiles(Opt: Type[OptimizationResultCollection]):
Copy link
Contributor

@lilyminium lilyminium Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking since the type signature won't affect performance) -- I hadn't actually seen typing.Type used before so had to look it up. It looks like it's an alias of type? If so I think Opt: OptimizationResultCollection is correct -- this tells the typer that Opt should be an object of OptimizationResultCollection, whereas type would refer to a class itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If black did this automatically, it might be because of the capitalization of Opt confusing it.

Copy link
Contributor

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- please merge when you're ready!

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type OptimizationDataset
Elements S ,Br ,O ,F ,C ,N ,Cl ,P ,I ,H
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2
Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type OptimizationDataset
Elements S ,Br ,O ,F ,C ,N ,Cl ,P ,I ,H
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@jaclark5 jaclark5 merged commit 7f8ed2a into master Dec 18, 2024
1 check passed
@jaclark5 jaclark5 deleted the release_sage_2.0.0 branch December 18, 2024 22:36
@openff-dangerbot
Copy link
Contributor

Lifecycle - QCSubmit Submission Report : SUCCESS

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-18 22:48 UTC

Response from public QCArchive:

None

QCSubmit version information(click to expand)
version
openff.qcsubmit 0.53.0
openff.toolkit 0.16.4
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.1

@openff-dangerbot
Copy link
Contributor

Current status - Error Cycling

Consider manually moving this.

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-19 04:10 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 0 0 3663 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-19 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 0 0 3663 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-19 18:25 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 27 48 3588 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-20 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-21 02:59 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-21 03:21 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-21 12:03 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-22 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-23 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-24 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-25 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-26 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-27 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-28 12:03 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-29 12:03 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-30 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2024-12-31 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2025-01-01 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2025-01-02 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2025-01-03 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2025-01-04 12:03 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2025-01-05 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
Dataset Type optimization
UTC Datetime 2025-01-06 12:04 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 176 0 3487 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.7
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Error Cycling
Development

Successfully merging this pull request may close these issues.

4 participants