Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction Datasets #40

Merged
merged 51 commits into from
Mar 14, 2024
Merged

Interaction Datasets #40

merged 51 commits into from
Mar 14, 2024

Conversation

mcneela
Copy link
Collaborator

@mcneela mcneela commented Mar 4, 2024

The DES370K interaction dataset has been created. The data dictionary elements additionally have keys "mol0" and "mol1" to make getting the individual dimers easier for users.

P.S. The current DES dataset we are using pulls the geometries from .mol files which were not created by the original dataset authors and appear to be correct. The correct atomic coordinates can be obtained from the original csv file as I am doing.

@mcneela mcneela requested a review from prtos March 4, 2024 19:19
@mcneela
Copy link
Collaborator Author

mcneela commented Mar 4, 2024

@FNTwin

src/openqdc/datasets/interaction/des370k.py Outdated Show resolved Hide resolved

name = np.array([smiles0 + "." + smiles1])

item = dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need this. a lot of the information can in the sub dict can be retrieved only with the info below!

        item = dict(
            energies=energies,
            subset=np.array(["DES370K"]), #In dess they have subsets for each monomer no? so mabe the subset here can be "subset1.subset2"
            n_atoms=np.array([natoms0 + natoms1], dtype=np.int32),
            n_atoms_first=np.array([natoms0], dtype=np.int32),
            atomic_inputs=atomic_inputs, # with n_atoms_first we can resplit this so we can leave this and split in the getitem
            name=name, # already smiles1 and smiles2 can be 
        )

@mcneela mcneela changed the title DES370K Interaction Interaction Datasets Mar 6, 2024
Copy link
Collaborator

@FNTwin FNTwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove yaml dependency

openqdc/datasets/interaction/X40.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/L7.py Outdated Show resolved Hide resolved
@mcneela
Copy link
Collaborator Author

mcneela commented Mar 8, 2024

Thank you Danny for all the work!

I would remove the commented parts of the code and add some docstrings for the different classes/new datasets like we have in the other datasets.

Also lots of read_raw_entries are pretty long and handling multiple functions internally so I would refactor them a little bit into multiple function to make them more readable.

Last thing, is some dataset can probably be written with a little bit more inheritance to make everything cleaner and have less clutter in the interaction dataset folder , as an example:

class des370k(...):
    ....

class des5m(des370k):
    ....

If you can add a dummy class as a test for the new object return type

Comments have been deleted and docstrings have been added! 🧨

@mcneela
Copy link
Collaborator Author

mcneela commented Mar 8, 2024

Thank you Danny for all the work!
I would remove the commented parts of the code and add some docstrings for the different classes/new datasets like we have in the other datasets.
Also lots of read_raw_entries are pretty long and handling multiple functions internally so I would refactor them a little bit into multiple function to make them more readable.
Last thing, is some dataset can probably be written with a little bit more inheritance to make everything cleaner and have less clutter in the interaction dataset folder , as an example:

class des370k(...):
    ....

class des5m(des370k):
    ....

If you can add a dummy class as a test for the new object return type

Comments have been deleted and docstrings have been added! 🧨

Refactoring of DES370K/5M is now done as well

Base automatically changed from refactoring to develop March 8, 2024 19:28
@mcneela
Copy link
Collaborator Author

mcneela commented Mar 12, 2024

Remove yaml dependency

Fixed now

Copy link
Collaborator

@FNTwin FNTwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do double check, is every datasets in Hartree and Angstrom units?

Comment on lines +160 to +163
# l7 = dict(
# dataset_name="l7",
# links={"l7.zip": "http://www.begdb.org/moldown.php?id=40"}
# )
Copy link
Collaborator

@FNTwin FNTwin Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be uncommented.
Also I don't see the X40 and Splinter downloads in the config_factory, is there a particular reason why we didn't add them there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I downloaded those datasets manually. There were multiple files and such so it seemed difficult to use the config_factory.

Comment on lines +84 to +92
| --- |
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) |
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) |
| [Metcalf](https://pubs.aip.org/aip/jcp/article/152/7/074103/1059677/Approaches-for-machine-learning-intermolecular) |
| [DESS66](https://www.nature.com/articles/s41597-021-00833-x) |
| [DESS66x8](https://www.nature.com/articles/s41597-021-00833-x) |
| [Splinter](https://www.nature.com/articles/s41597-023-02443-1) |
| [X40](https://pubs.acs.org/doi/10.1021/ct300647k) |
| [L7](https://pubs.acs.org/doi/10.1021/ct400036b) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is fine but we need to add more informations about these datasets in the readme

Comment on lines 39 to 41
csum = np.cumsum(res.get("n_atoms"))
print(csum)
x = np.zeros((csum.shape[0], 2), dtype=np.int32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the print

@mcneela
Copy link
Collaborator Author

mcneela commented Mar 14, 2024

Just do double check, is every datasets in Hartree and Angstrom units?

I'm not positive, I suggest we merge and then do a new PR if there are any issues.

@FNTwin
Copy link
Collaborator

FNTwin commented Mar 14, 2024

Just do double check, is every datasets in Hartree and Angstrom units?

I'm not positive, I suggest we merge and then do a new PR if there are any issues.

I'm ok merging (as soon as you remove the print statement) but I'll leave it to you correcting the units in case they are wrong

@mcneela mcneela merged commit e1456e6 into develop Mar 14, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants