-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interaction Datasets #40
Merged
Merged
Changes from all commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
bd3fcf9
started splitting datasets into 'interaction' and 'potential'
mcneela a800ea5
add num_unique_molecules property
mcneela 9d6fca6
added logging
mcneela 794e63f
started base interaction dataset
mcneela 0db4765
add interaction __init__ file and revise potential __init__ file
mcneela 6e5a002
add des370k interaction to config_factory.py
mcneela 8e1e003
have BaseInteractionDataset inherit BaseDataset
mcneela d68bae6
implemented read_raw_entries for DES370K
mcneela 5e94d67
finished implementation of DES370K interaction
mcneela 3c9508b
finished implementation of DES370K interaction
mcneela 768fb2e
update BaseDataset import path
mcneela 8aeadd8
added Metcalf dataset
mcneela 9cf6034
updated DES370K based on Prudencio's comments
mcneela ce2c53b
Merge branch 'interaction' into metcalf
mcneela 6206665
added const molecule_groups lookup for DES370K dataset
mcneela 5cb57d9
updated subsets for DES370K
mcneela e18b710
added download url for des5m_interaction
mcneela 54cadbf
updated README with new datasets
mcneela 7f83eb5
Merge branch 'metcalf' into interaction
mcneela a922ef7
Added DES5M dataset
mcneela 2146058
added des_s66 dataset
mcneela 4d9a4ba
added DESS66x8 dataset
mcneela c2229e3
small update to __init__ file
mcneela 9349454
added L7 dataset
mcneela c3bdc64
added X40 dataset
mcneela 23c0739
add new datasets to __init__.py
mcneela 74f87a6
added splinter dataset
mcneela f046ea9
fixed a couple splinter things
mcneela 3c84ee9
update default data shapes for interaction datasets
mcneela 04c81ae
updated test_dummy.py with new import structure
mcneela 11e2858
fix test_import.py
mcneela 78f0423
code cleanup for the linter
mcneela bd58fdf
fix ani import
mcneela 5dfcf55
Merge branch 'refactoring' into interaction
mcneela 4bc3a49
fix base dataset import
mcneela b046eea
black formatting
mcneela fe54044
ran precommit
mcneela ef2528c
removed DES from datasets/__init__.py
mcneela c0ef5b1
removed DES from datasets/__init__.py
mcneela ad55296
fix X40 energy methods
mcneela 0a51e7c
added interaction dataset docstrings
mcneela b6c3a6a
update readme with all interaction datasets
mcneela 07f70b8
update metcalf __energy_methods__
mcneela 1443450
refactored des370k and des5m
mcneela 802b70b
update base interaction dataset to add n_atoms_first property
mcneela e969b54
update L7 and X40 to use python base yaml package
mcneela 5725fed
modify interaction/base.py to save keys other than force/energy in pr…
mcneela 6c6b286
fix base dataset issue
mcneela 46c5ebe
fix circular imports
mcneela d5ec053
merge origin/develop into interaction
mcneela cb9987c
removed print statements
mcneela File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,4 @@ | ||
from .base import BaseDataset # noqa | ||
from .interaction import AVAILABLE_INTERACTION_DATASETS # noqa | ||
from .interaction import DES # noqa | ||
from .potential import AVAILABLE_POTENTIAL_DATASETS # noqa | ||
from .potential.ani import ANI1, ANI1CCX, ANI1X # noqa | ||
from .potential.comp6 import COMP6 # noqa | ||
from .potential.dummy import Dummy # noqa | ||
from .potential.gdml import GDML # noqa | ||
from .potential.geom import GEOM # noqa | ||
from .potential.iso_17 import ISO17 # noqa | ||
from .potential.molecule3d import Molecule3D # noqa | ||
from .potential.multixcqm9 import MultixcQM9 # noqa | ||
from .potential.nabladft import NablaDFT # noqa | ||
from .potential.orbnet_denali import OrbnetDenali # noqa | ||
from .potential.pcqm import PCQM_B3LYP, PCQM_PM6 # noqa | ||
from .potential.qm7x import QM7X # noqa | ||
from .potential.qmugs import QMugs # noqa | ||
from .potential.sn2_rxn import SN2RXN # noqa | ||
from .potential.solvated_peptides import SolvatedPeptides # noqa | ||
from .potential.spice import Spice # noqa | ||
from .potential.tmqm import TMQM # noqa | ||
from .potential.transition1x import Transition1X # noqa | ||
from .potential.waterclusters3_30 import WaterClusters # noqa | ||
|
||
AVAILABLE_DATASETS = {**AVAILABLE_POTENTIAL_DATASETS, **AVAILABLE_INTERACTION_DATASETS} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
import os | ||
from typing import Dict, List | ||
|
||
import numpy as np | ||
import yaml | ||
from loguru import logger | ||
|
||
from openqdc.datasets.interaction.base import BaseInteractionDataset | ||
from openqdc.utils.molecule import atom_table | ||
|
||
|
||
class DataItemYAMLObj: | ||
def __init__(self, name, shortname, geometry, reference_value, setup, group, tags): | ||
self.name = name | ||
self.shortname = shortname | ||
self.geometry = geometry | ||
self.reference_value = reference_value | ||
self.setup = setup | ||
self.group = group | ||
self.tags = tags | ||
|
||
|
||
class DataSetYAMLObj: | ||
def __init__(self, name, references, text, method_energy, groups_by, groups, global_setup): | ||
self.name = name | ||
self.references = references | ||
self.text = text | ||
self.method_energy = method_energy | ||
self.groups_by = groups_by | ||
self.groups = groups | ||
self.global_setup = global_setup | ||
|
||
|
||
def data_item_constructor(loader: yaml.SafeLoader, node: yaml.nodes.MappingNode): | ||
"""Construct an employee.""" | ||
return DataItemYAMLObj(**loader.construct_mapping(node)) | ||
|
||
|
||
def dataset_constructor(loader: yaml.SafeLoader, node: yaml.nodes.MappingNode): | ||
"""Construct an employee.""" | ||
return DataSetYAMLObj(**loader.construct_mapping(node)) | ||
|
||
|
||
def get_loader(): | ||
"""Add constructors to PyYAML loader.""" | ||
loader = yaml.SafeLoader | ||
loader.add_constructor("!ruby/object:ProtocolDataset::DataSetItem", data_item_constructor) | ||
loader.add_constructor("!ruby/object:ProtocolDataset::DataSetDescription", dataset_constructor) | ||
return loader | ||
|
||
|
||
class L7(BaseInteractionDataset): | ||
""" | ||
The L7 interaction energy dataset as described in: | ||
|
||
Accuracy of Quantum Chemical Methods for Large Noncovalent Complexes | ||
Robert Sedlak, Tomasz Janowski, Michal Pitoňák, Jan Řezáč, Peter Pulay, and Pavel Hobza | ||
Journal of Chemical Theory and Computation 2013 9 (8), 3364-3374 | ||
DOI: 10.1021/ct400036b | ||
|
||
Data was downloaded and extracted from: | ||
http://cuby4.molecular.cz/dataset_l7.html | ||
""" | ||
|
||
__name__ = "L7" | ||
__energy_unit__ = "hartree" | ||
__distance_unit__ = "ang" | ||
__forces_unit__ = "hartree/ang" | ||
__energy_methods__ = [ | ||
"CSD(T) | QCISD(T)", | ||
"DLPNO-CCSD(T)", | ||
"MP2/CBS", | ||
"MP2C/CBS", | ||
"fixed", | ||
"DLPNO-CCSD(T0)", | ||
"LNO-CCSD(T)", | ||
"FN-DMC", | ||
] | ||
|
||
energy_target_names = [] | ||
|
||
def read_raw_entries(self) -> List[Dict]: | ||
yaml_fpath = os.path.join(self.root, "l7.yaml") | ||
logger.info(f"Reading L7 interaction data from {self.root}") | ||
yaml_file = open(yaml_fpath, "r") | ||
data = [] | ||
data_dict = yaml.load(yaml_file, Loader=get_loader()) | ||
charge0 = int(data_dict["description"].global_setup["molecule_a"]["charge"]) | ||
charge1 = int(data_dict["description"].global_setup["molecule_b"]["charge"]) | ||
|
||
for idx, item in enumerate(data_dict["items"]): | ||
energies = [] | ||
name = np.array([item.shortname]) | ||
fname = item.geometry.split(":")[1] | ||
energies.append(item.reference_value) | ||
xyz_file = open(os.path.join(self.root, f"{fname}.xyz"), "r") | ||
lines = list(map(lambda x: x.strip().split(), xyz_file.readlines())) | ||
lines.pop(1) | ||
n_atoms = np.array([int(lines[0][0])], dtype=np.int32) | ||
n_atoms_first = np.array([int(item.setup["molecule_a"]["selection"].split("-")[1])], dtype=np.int32) | ||
subset = np.array([item.group]) | ||
energies += [float(val[idx]) for val in list(data_dict["alternative_reference"].values())] | ||
energies = np.array([energies], dtype=np.float32) | ||
pos = np.array(lines[1:])[:, 1:].astype(np.float32) | ||
elems = np.array(lines[1:])[:, 0] | ||
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elems]), axis=1) | ||
natoms0 = n_atoms_first[0] | ||
natoms1 = n_atoms[0] - natoms0 | ||
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1) | ||
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32) | ||
|
||
item = dict( | ||
energies=energies, | ||
subset=subset, | ||
n_atoms=n_atoms, | ||
n_atoms_first=n_atoms_first, | ||
atomic_inputs=atomic_inputs, | ||
name=name, | ||
) | ||
data.append(item) | ||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
import os | ||
from typing import Dict, List | ||
|
||
import numpy as np | ||
import yaml | ||
from loguru import logger | ||
|
||
from openqdc.datasets.interaction.base import BaseInteractionDataset | ||
from openqdc.datasets.interaction.L7 import get_loader | ||
from openqdc.utils.molecule import atom_table | ||
|
||
|
||
class X40(BaseInteractionDataset): | ||
""" | ||
X40 interaction dataset of 40 dimer pairs as | ||
introduced in the following paper: | ||
|
||
Benchmark Calculations of Noncovalent Interactions of Halogenated Molecules | ||
Jan Řezáč, Kevin E. Riley, and Pavel Hobza | ||
Journal of Chemical Theory and Computation 2012 8 (11), 4285-4292 | ||
DOI: 10.1021/ct300647k | ||
|
||
Dataset retrieved and processed from: | ||
http://cuby4.molecular.cz/dataset_x40.html | ||
""" | ||
|
||
__name__ = "X40" | ||
__energy_unit__ = "hartree" | ||
__distance_unit__ = "ang" | ||
__forces_unit__ = "hartree/ang" | ||
__energy_methods__ = [ | ||
"CCSD(T)/CBS", | ||
"MP2/CBS", | ||
"dCCSD(T)/haDZ", | ||
"dCCSD(T)/haTZ", | ||
"MP2.5/CBS(aDZ)", | ||
] | ||
|
||
energy_target_names = [] | ||
|
||
def read_raw_entries(self) -> List[Dict]: | ||
yaml_fpath = os.path.join(self.root, "x40.yaml") | ||
logger.info(f"Reading X40 interaction data from {self.root}") | ||
yaml_file = open(yaml_fpath, "r") | ||
data = [] | ||
data_dict = yaml.load(yaml_file, Loader=get_loader()) | ||
charge0 = int(data_dict["description"].global_setup["molecule_a"]["charge"]) | ||
charge1 = int(data_dict["description"].global_setup["molecule_b"]["charge"]) | ||
|
||
for idx, item in enumerate(data_dict["items"]): | ||
energies = [] | ||
name = np.array([item.shortname]) | ||
energies.append(float(item.reference_value)) | ||
xyz_file = open(os.path.join(self.root, f"{item.shortname}.xyz"), "r") | ||
lines = list(map(lambda x: x.strip().split(), xyz_file.readlines())) | ||
setup = lines.pop(1) | ||
n_atoms = np.array([int(lines[0][0])], dtype=np.int32) | ||
n_atoms_first = setup[0].split("-")[1] | ||
n_atoms_first = np.array([int(n_atoms_first)], dtype=np.int32) | ||
subset = np.array([item.group]) | ||
energies += [float(val[idx]) for val in list(data_dict["alternative_reference"].values())] | ||
energies = np.array([energies], dtype=np.float32) | ||
pos = np.array(lines[1:])[:, 1:].astype(np.float32) | ||
elems = np.array(lines[1:])[:, 0] | ||
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elems]), axis=1) | ||
natoms0 = n_atoms_first[0] | ||
natoms1 = n_atoms[0] - natoms0 | ||
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1) | ||
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32) | ||
|
||
item = dict( | ||
energies=energies, | ||
subset=subset, | ||
n_atoms=n_atoms, | ||
n_atoms_first=n_atoms_first, | ||
atomic_inputs=atomic_inputs, | ||
name=name, | ||
) | ||
data.append(item) | ||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,21 @@ | ||
from .des import DES | ||
from .base import BaseInteractionDataset | ||
from .des5m import DES5M | ||
from .des370k import DES370K | ||
from .dess66 import DESS66 | ||
from .dess66x8 import DESS66x8 | ||
from .L7 import L7 | ||
from .metcalf import Metcalf | ||
from .splinter import Splinter | ||
from .X40 import X40 | ||
|
||
AVAILABLE_INTERACTION_DATASETS = {"des": DES} | ||
AVAILABLE_INTERACTION_DATASETS = { | ||
"base": BaseInteractionDataset, | ||
"des5m": DES5M, | ||
"des370k": DES370K, | ||
"dess66": DESS66, | ||
"dess66x8": DESS66x8, | ||
"l7": L7, | ||
"metcalf": Metcalf, | ||
"splinter": Splinter, | ||
"x40": X40, | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this is fine but we need to add more informations about these datasets in the readme