Interaction Datasets #40

mcneela · 2024-03-04T19:19:29Z

The DES370K interaction dataset has been created. The data dictionary elements additionally have keys "mol0" and "mol1" to make getting the individual dimers easier for users.

P.S. The current DES dataset we are using pulls the geometries from .mol files which were not created by the original dataset authors and appear to be correct. The correct atomic coordinates can be obtained from the original csv file as I am doing.

mcneela · 2024-03-04T19:19:46Z

@FNTwin

src/openqdc/datasets/interaction/des370k.py

prtos · 2024-03-05T03:47:09Z

src/openqdc/datasets/interaction/des370k.py

+
+            name = np.array([smiles0 + "." + smiles1])
+
+            item = dict(


I think we only need this. a lot of the information can in the sub dict can be retrieved only with the info below!

item = dict( energies=energies, subset=np.array(["DES370K"]), #In dess they have subsets for each monomer no? so mabe the subset here can be "subset1.subset2" n_atoms=np.array([natoms0 + natoms1], dtype=np.int32), n_atoms_first=np.array([natoms0], dtype=np.int32), atomic_inputs=atomic_inputs, # with n_atoms_first we can resplit this so we can leave this and split in the getitem name=name, # already smiles1 and smiles2 can be )

FNTwin

Remove yaml dependency

openqdc/datasets/interaction/X40.py

openqdc/datasets/interaction/L7.py

mcneela · 2024-03-08T16:25:53Z

Thank you Danny for all the work!

I would remove the commented parts of the code and add some docstrings for the different classes/new datasets like we have in the other datasets.

Also lots of read_raw_entries are pretty long and handling multiple functions internally so I would refactor them a little bit into multiple function to make them more readable.

Last thing, is some dataset can probably be written with a little bit more inheritance to make everything cleaner and have less clutter in the interaction dataset folder , as an example:
class des370k(...):
    ....

class des5m(des370k):
    ....
If you can add a dummy class as a test for the new object return type

Comments have been deleted and docstrings have been added! 🧨

mcneela · 2024-03-08T16:45:05Z

Thank you Danny for all the work!
I would remove the commented parts of the code and add some docstrings for the different classes/new datasets like we have in the other datasets.
Also lots of read_raw_entries are pretty long and handling multiple functions internally so I would refactor them a little bit into multiple function to make them more readable.
Last thing, is some dataset can probably be written with a little bit more inheritance to make everything cleaner and have less clutter in the interaction dataset folder , as an example:
class des370k(...):
    ....

class des5m(des370k):
    ....
If you can add a dummy class as a test for the new object return type
Comments have been deleted and docstrings have been added! 🧨

Refactoring of DES370K/5M is now done as well

mcneela · 2024-03-12T17:58:09Z

Remove yaml dependency

Fixed now

…ops.pkl

FNTwin

Just do double check, is every datasets in Hartree and Angstrom units?

FNTwin · 2024-03-14T13:23:26Z

openqdc/raws/config_factory.py

+    # l7 = dict(
+    #     dataset_name="l7",
+    #     links={"l7.zip": "http://www.begdb.org/moldown.php?id=40"}
+    # )


I think this should be uncommented.
Also I don't see the X40 and Splinter downloads in the config_factory, is there a particular reason why we didn't add them there?

I downloaded those datasets manually. There were multiple files and such so it seemed difficult to use the config_factory.

FNTwin · 2024-03-14T13:25:27Z

README.md

+| --- |
+| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) |
+| [DES5M](https://www.nature.com/articles/s41597-021-00833-x)   |
+| [Metcalf](https://pubs.aip.org/aip/jcp/article/152/7/074103/1059677/Approaches-for-machine-learning-intermolecular) |
+| [DESS66](https://www.nature.com/articles/s41597-021-00833-x) |
+| [DESS66x8](https://www.nature.com/articles/s41597-021-00833-x) |
+| [Splinter](https://www.nature.com/articles/s41597-023-02443-1) |
+| [X40](https://pubs.acs.org/doi/10.1021/ct300647k) |
+| [L7](https://pubs.acs.org/doi/10.1021/ct400036b)  |


For now this is fine but we need to add more informations about these datasets in the readme

FNTwin · 2024-03-14T13:26:23Z

openqdc/datasets/interaction/base.py

+        csum = np.cumsum(res.get("n_atoms"))
+        print(csum)
+        x = np.zeros((csum.shape[0], 2), dtype=np.int32)


Remove the print

mcneela · 2024-03-14T16:12:07Z

Just do double check, is every datasets in Hartree and Angstrom units?

I'm not positive, I suggest we merge and then do a new PR if there are any issues.

FNTwin · 2024-03-14T16:57:08Z

Just do double check, is every datasets in Hartree and Angstrom units?

I'm not positive, I suggest we merge and then do a new PR if there are any issues.

I'm ok merging (as soon as you remove the print statement) but I'll leave it to you correcting the units in case they are wrong

mcneela added 10 commits March 1, 2024 16:07

started splitting datasets into 'interaction' and 'potential'

bd3fcf9

add num_unique_molecules property

a800ea5

added logging

9d6fca6

started base interaction dataset

794e63f

add interaction __init__ file and revise potential __init__ file

0db4765

add des370k interaction to config_factory.py

6e5a002

have BaseInteractionDataset inherit BaseDataset

8e1e003

implemented read_raw_entries for DES370K

d68bae6

finished implementation of DES370K interaction

5e94d67

finished implementation of DES370K interaction

3c9508b

mcneela requested a review from prtos March 4, 2024 19:19

mcneela added 4 commits March 4, 2024 14:23

update BaseDataset import path

768fb2e

added Metcalf dataset

8aeadd8

updated DES370K based on Prudencio's comments

9cf6034

Merge branch 'interaction' into metcalf

ce2c53b

prtos reviewed Mar 5, 2024

View reviewed changes

mcneela added 8 commits March 5, 2024 15:09

added const molecule_groups lookup for DES370K dataset

6206665

updated subsets for DES370K

5cb57d9

added download url for des5m_interaction

e18b710

updated README with new datasets

54cadbf

Merge branch 'metcalf' into interaction

7f83eb5

Added DES5M dataset

a922ef7

added des_s66 dataset

2146058

added DESS66x8 dataset

4d9a4ba

mcneela changed the title ~~DES370K Interaction~~ Interaction Datasets Mar 6, 2024

mcneela added 4 commits March 6, 2024 09:56

small update to __init__ file

c2229e3

added L7 dataset

9349454

added X40 dataset

c3bdc64

add new datasets to __init__.py

23c0739

mcneela added 8 commits March 8, 2024 10:00

code cleanup for the linter

78f0423

fix ani import

bd58fdf

Merge branch 'refactoring' into interaction

5dfcf55

fix base dataset import

4bc3a49

black formatting

b046eea

ran precommit

fe54044

removed DES from datasets/__init__.py

ef2528c

removed DES from datasets/__init__.py

c0ef5b1

FNTwin requested changes Mar 8, 2024

View reviewed changes

openqdc/datasets/interaction/X40.py Outdated Show resolved Hide resolved

openqdc/datasets/interaction/L7.py Outdated Show resolved Hide resolved

mcneela added 2 commits March 8, 2024 11:04

fix X40 energy methods

ad55296

added interaction dataset docstrings

0a51e7c

mcneela added 3 commits March 8, 2024 11:29

update readme with all interaction datasets

b6c3a6a

update metcalf __energy_methods__

07f70b8

refactored des370k and des5m

1443450

update base interaction dataset to add n_atoms_first property

802b70b

Base automatically changed from refactoring to develop March 8, 2024 19:28

update L7 and X40 to use python base yaml package

e969b54

mcneela force-pushed the interaction branch from 0a2ee3f to e969b54 Compare March 12, 2024 18:17

mcneela added 4 commits March 13, 2024 11:23

modify interaction/base.py to save keys other than force/energy in pr…

5725fed

…ops.pkl

fix base dataset issue

6c6b286

fix circular imports

46c5ebe

merge origin/develop into interaction

d5ec053

FNTwin approved these changes Mar 14, 2024

View reviewed changes

removed print statements

cb9987c

mcneela merged commit e1456e6 into develop Mar 14, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction Datasets #40

Interaction Datasets #40

mcneela commented Mar 4, 2024

mcneela commented Mar 4, 2024

prtos Mar 5, 2024

FNTwin left a comment

mcneela commented Mar 8, 2024

mcneela commented Mar 8, 2024

mcneela commented Mar 12, 2024

FNTwin left a comment

FNTwin Mar 14, 2024 •

edited

Loading

mcneela Mar 14, 2024

FNTwin Mar 14, 2024

FNTwin Mar 14, 2024

mcneela commented Mar 14, 2024

FNTwin commented Mar 14, 2024

Interaction Datasets #40

Interaction Datasets #40

Conversation

mcneela commented Mar 4, 2024

mcneela commented Mar 4, 2024

prtos Mar 5, 2024

Choose a reason for hiding this comment

FNTwin left a comment

Choose a reason for hiding this comment

mcneela commented Mar 8, 2024

mcneela commented Mar 8, 2024

mcneela commented Mar 12, 2024

FNTwin left a comment

Choose a reason for hiding this comment

FNTwin Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

mcneela Mar 14, 2024

Choose a reason for hiding this comment

FNTwin Mar 14, 2024

Choose a reason for hiding this comment

FNTwin Mar 14, 2024

Choose a reason for hiding this comment

mcneela commented Mar 14, 2024

FNTwin commented Mar 14, 2024

FNTwin Mar 14, 2024 •

edited

Loading