-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128
Comments
Hello @ankitskvmdam, After some thought, I suggest the following:
The process will be the following:
I hope this makes sense? Then, I have more ideas to complicate things further (for example, to search against other databases such as ZINC in CHEESE, to do multiple similarity searches, to penalize molecules that are similar in 2D (favouring scaffold hopping), etc.). But let's go step by step. Please let me know if something is not clear, @ankitskvmdam ! |
@miquelduranfrigola It makes sense. I will proceed with this. |
Following up on our last meeting:Cheese API updates:
Validation of newly generated moleculeTo validate newly trained model, we'll develop a classifier model. This classifier will be trained using two datasets:
The classifier's purpose will be to classify how many newly generated molecules belongs to the class of "400 molecules" |
Thanks @ankitskvmdam this is a perfect summary. |
Sure! Here are the links: |
Follow up
|
Hi @ankitskvmdam this is fantastic. As a reference library, you can use the following one: https://github.com/ersilia-os/groverfeat/blob/main/data/reference_library.csv |
I have updated the notebook by adding the above changes. I have a question regarding the reference library, which contains 10,00,000 SMILES. How should we select SMILES from this library? If we select randomly, there is a possibility of selecting SMILES with similar 3D structures. In the notebook, I used RMSD to filter out 3d similar molecules. Also, instead of training a classifier model, can we just use RMSD? |
Thanks @ankitskvmdam Let's discuss now in our meeting. There is a possibility of selecting molecules with similar 3D structures, but I think this possibility is residual and we should not worry to much about it, especially since we are "just" evaluating the model and the worst that can happen is that we underestimate performance. As for the classifier vs RMSD - the classifier should be faster since no conformers need to be generated. |
I made the following changes to the notebook:
P.S. I tried generating 30,000 new molecules, it took 3 minutes, and at the end I was able to get 1000 molecules. |
Thanks @ankitskvmdam this looks very good and the notebook is great. |
@miquelduranfrigola , What I am doing is only taking the smiles that have more than 60% probability of having same similarity score. Questions:
If we train the model with 1:10 and then do |
Thanks for the clarification @ankitskvmdam, I understand. Those are very valid points. In any case, I don't think we need to worry too much about it. It is still fine to keep molecules with "low" scores. I definitely think we can keep the top 1000 even if they do not have high scores. You mentioned that you will ask for 20k or 30k molecules. Based on your experience, do you think we will ever get a case where we will only have less than 1000 molecules? If this is the case, I agree, no classifier is needed. |
I have updated the notebook:
|
Hi @ankitskvmdam thanks for the update. Let's do 15k molecules, this sounds like a good compromise. |
Hi @miquelduranfrigola and @ankitskvmdam What is the status of this? are we aiming to complete it or we have put it on hold? |
This is essentially finished - Mol2MolSimilarity works great. I close the issue. @ankitskvmdam feel free to re-open if there are things you want to discuss further |
Summary
Currently, the REINVENT 4 Mol2MolSimilarity model generates new molecules with a similar 2D structure but not necessarily a similar 3D structure. Our goal is to train the REINVENT 4 Mol2MolSimilarity model to produce molecules with a 3D structure similar to the input molecule.
Approach 1
To achieve this, we will train the Mol2MolSimilarity model to generate new molecules with a similar 3D shape. The training process involves the following steps:
Approach 2
While most steps are similar to Approach 1, this approach explores alternative tools for generating 3D conformers and calculating 3D shape similarity scores. One such tool we can investigate is Cheese.
Scope
Initiative 🐋
Objective(s)
To develop a model that can efficiently produce new molecules with a 3D structure similar to that of the input molecule.
Team
Timeline
TBD
Documentation
The text was updated successfully, but these errors were encountered: