-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interaction Datasets #40
Conversation
|
||
name = np.array([smiles0 + "." + smiles1]) | ||
|
||
item = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need this. a lot of the information can in the sub dict can be retrieved only with the info below!
item = dict(
energies=energies,
subset=np.array(["DES370K"]), #In dess they have subsets for each monomer no? so mabe the subset here can be "subset1.subset2"
n_atoms=np.array([natoms0 + natoms1], dtype=np.int32),
n_atoms_first=np.array([natoms0], dtype=np.int32),
atomic_inputs=atomic_inputs, # with n_atoms_first we can resplit this so we can leave this and split in the getitem
name=name, # already smiles1 and smiles2 can be
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove yaml dependency
Comments have been deleted and docstrings have been added! 🧨 |
Refactoring of DES370K/5M is now done as well |
Fixed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do double check, is every datasets in Hartree and Angstrom units?
# l7 = dict( | ||
# dataset_name="l7", | ||
# links={"l7.zip": "http://www.begdb.org/moldown.php?id=40"} | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be uncommented.
Also I don't see the X40 and Splinter downloads in the config_factory, is there a particular reason why we didn't add them there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I downloaded those datasets manually. There were multiple files and such so it seemed difficult to use the config_factory.
| --- | | ||
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) | | ||
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) | | ||
| [Metcalf](https://pubs.aip.org/aip/jcp/article/152/7/074103/1059677/Approaches-for-machine-learning-intermolecular) | | ||
| [DESS66](https://www.nature.com/articles/s41597-021-00833-x) | | ||
| [DESS66x8](https://www.nature.com/articles/s41597-021-00833-x) | | ||
| [Splinter](https://www.nature.com/articles/s41597-023-02443-1) | | ||
| [X40](https://pubs.acs.org/doi/10.1021/ct300647k) | | ||
| [L7](https://pubs.acs.org/doi/10.1021/ct400036b) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this is fine but we need to add more informations about these datasets in the readme
openqdc/datasets/interaction/base.py
Outdated
csum = np.cumsum(res.get("n_atoms")) | ||
print(csum) | ||
x = np.zeros((csum.shape[0], 2), dtype=np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the print
I'm not positive, I suggest we merge and then do a new PR if there are any issues. |
I'm ok merging (as soon as you remove the print statement) but I'll leave it to you correcting the units in case they are wrong |
The DES370K interaction dataset has been created. The data dictionary elements additionally have keys "mol0" and "mol1" to make getting the individual dimers easier for users.
P.S. The current DES dataset we are using pulls the geometries from
.mol
files which were not created by the original dataset authors and appear to be correct. The correct atomic coordinates can be obtained from the original csv file as I am doing.